You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yun Gao (Jira)" <ji...@apache.org> on 2020/08/03 09:35:00 UTC

[jira] [Commented] (FLINK-10981) Add or modify metrics to show the maximum usage of InputBufferPool/OutputBufferPool to help debugging back pressure

    [ https://issues.apache.org/jira/browse/FLINK-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169853#comment-17169853 ] 

Yun Gao commented on FLINK-10981:
---------------------------------

Sorry for the later response for being in holiday, I also agree with that we could close this issue for now.

> Add or modify metrics to show the maximum usage of InputBufferPool/OutputBufferPool to help debugging back pressure
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-10981
>                 URL: https://issues.apache.org/jira/browse/FLINK-10981
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Metrics, Runtime / Network
>            Reporter: Yun Gao
>            Assignee: Yun Gao
>            Priority: Major
>
> Currently the network layer has provided two metrics items, namely _InputBufferPoolUsageGauge_ and _OutputBufferPoolUsageGauge_ to show the usage of input buffer pool and output buffer pool. When there are multiple inputs(SingleInputGate) or outputs(ResultPartition), the two metrics items show their average usage. 
>  
> However, we found that the maximum usage of all the InputBufferPool or OutputBufferPool is also useful in debugging back pressure. Suppose we have a job with the following job graph:
>  
> {code:java}
>           F     
>            \
>             \
>             _\/      
> A ---> B ----> C ---> D
>        \
>         \
>          \-> E 
>          {code}
> Besides, also suppose D is very slow and thus cause back pressure, but E is very fast and F outputs few records, thus the usage of the corresponding input/output buffer pool is almost 0.
>  
> Then the average input/output buffer usage of each task will be:
>  
> {code:java}
> A(100%) --> (100%) B (50%) --> (50%) C (100%) --> (100%) D
> {code}
>  
>  
> But the maximum input/output buffer usage of each task will be:
>  
> {code:java}
> A(100%) --> (100%) B (100%) --> (100%) C (100%) --> (100%) D
> {code}
> Users will be able to find the slowest task by finding the first task whose input buffer usage is 100% but output usage is less than 100%.
>  
>  
> If it is reasonable to show the maximum input/output buffer usage, I think there may be three options:
>  # Modify the current computation logic of _InputBufferPoolUsageGauge_ and _OutputBufferPoolUsageGauge._
>  # Add two _new metrics items InputBufferPoolMaxUsageGauge and OutputBufferPoolMaxUsageGauge._
>  # Try to show distinct usage for each input/output buffer pool.
> and I think maybe the second option is the most preferred. 
>  
> How do you think about that?
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)