You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Piotr Nowojski (JIRA)" <ji...@apache.org> on 2019/01/03 09:54:00 UTC
[jira] [Commented] (FLINK-10981) Add or modify metrics to show the maximum usage of InputBufferPool/OutputBufferPool to help debugging back pressure

    [ https://issues.apache.org/jira/browse/FLINK-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732826#comment-16732826 ] 

Piotr Nowojski commented on FLINK-10981:
----------------------------------------

[~gaoyunhaii] aren't the already existing metrics enough and basically equivalent to what you are proposing?

* {{totalQueueLen}}	Total number of queued buffers in all input/output channels.
* {{minQueueLen}}	Minimum number of queued buffers in all input/output channels.
* {{maxQueueLen}}	Maximum number of queued buffers in all input/output channels.
* {{avgQueueLen}} Average number of queued buffers in all input/output channels.

https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#network

Actually one thing that I was more missing is the aggregation of the metics - metrics could be collected at the lowest possible level (some at the operator level, others at the task level, etc) and then aggregated up: operator -> operator chain (?) -> task -> stage (all tasks doing same thing across multiple task managers) -> job. Each level could present both aggregated stats AND define some new custom ones. Something like this I was especially missing during analysing where is the bottleneck on large cluster with huge job, that had ~50 tasks with parallelism ~200.

> Add or modify metrics to show the maximum usage of InputBufferPool/OutputBufferPool to help debugging back pressure
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-10981
>                 URL: https://issues.apache.org/jira/browse/FLINK-10981
>             Project: Flink
>          Issue Type: Improvement
>          Components: Metrics, Network
>            Reporter: Yun Gao
>            Assignee: Yun Gao
>            Priority: Major
>
> Currently the network layer has provided two metrics items, namely _InputBufferPoolUsageGauge_ and _OutputBufferPoolUsageGauge_ to show the usage of input buffer pool and output buffer pool. When there are multiple inputs(SingleInputGate) or outputs(ResultPartition), the two metrics items show their average usage. 
>  
> However, we found that the maximum usage of all the InputBufferPool or OutputBufferPool is also useful in debugging back pressure. Suppose we have a job with the following job graph:
>  
> {code:java}
>           F     
>            \
>             \
>             _\/      
> A ---> B ----> C ---> D
>        \
>         \
>          \-> E 
>          {code}
> Besides, also suppose D is very slow and thus cause back pressure, but E is very fast and F outputs few records, thus the usage of the corresponding input/output buffer pool is almost 0.
>  
> Then the average input/output buffer usage of each task will be:
>  
> {code:java}
> A(100%) --> (100%) B (50%) --> (50%) C (100%) --> (100%) D
> {code}
>  
>  
> But the maximum input/output buffer usage of each task will be:
>  
> {code:java}
> A(100%) --> (100%) B (100%) --> (100%) C (100%) --> (100%) D
> {code}
> Users will be able to find the slowest task by finding the first task whose input buffer usage is 100% but output usage is less than 100%.
>  
>  
> If it is reasonable to show the maximum input/output buffer usage, I think there may be three options:
>  # Modify the current computation logic of _InputBufferPoolUsageGauge_ and _OutputBufferPoolUsageGauge._
>  # Add two _new metrics items InputBufferPoolMaxUsageGauge and OutputBufferPoolMaxUsageGauge._
>  # Try to show distinct usage for each input/output buffer pool.
> and I think maybe the second option is the most preferred. 
>  
> How do you think about that?
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)