You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Piotr Nowojski (JIRA)" <ji...@apache.org> on 2019/01/03 09:54:00 UTC
[jira] [Commented] (FLINK-10981) Add or modify metrics to show the
maximum usage of InputBufferPool/OutputBufferPool to help debugging back
pressure
[ https://issues.apache.org/jira/browse/FLINK-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732826#comment-16732826 ]
Piotr Nowojski commented on FLINK-10981:
----------------------------------------
[~gaoyunhaii] aren't the already existing metrics enough and basically equivalent to what you are proposing?
* {{totalQueueLen}} Total number of queued buffers in all input/output channels.
* {{minQueueLen}} Minimum number of queued buffers in all input/output channels.
* {{maxQueueLen}} Maximum number of queued buffers in all input/output channels.
* {{avgQueueLen}} Average number of queued buffers in all input/output channels.
https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#network
Actually one thing that I was more missing is the aggregation of the metics - metrics could be collected at the lowest possible level (some at the operator level, others at the task level, etc) and then aggregated up: operator -> operator chain (?) -> task -> stage (all tasks doing same thing across multiple task managers) -> job. Each level could present both aggregated stats AND define some new custom ones. Something like this I was especially missing during analysing where is the bottleneck on large cluster with huge job, that had ~50 tasks with parallelism ~200.
> Add or modify metrics to show the maximum usage of InputBufferPool/OutputBufferPool to help debugging back pressure
> -------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-10981
> URL: https://issues.apache.org/jira/browse/FLINK-10981
> Project: Flink
> Issue Type: Improvement
> Components: Metrics, Network
> Reporter: Yun Gao
> Assignee: Yun Gao
> Priority: Major
>
> Currently the network layer has provided two metrics items, namely _InputBufferPoolUsageGauge_ and _OutputBufferPoolUsageGauge_ to show the usage of input buffer pool and output buffer pool. When there are multiple inputs(SingleInputGate) or outputs(ResultPartition), the two metrics items show their average usage.
>
> However, we found that the maximum usage of all the InputBufferPool or OutputBufferPool is also useful in debugging back pressure. Suppose we have a job with the following job graph:
>
> {code:java}
> F
> \
> \
> _\/
> A ---> B ----> C ---> D
> \
> \
> \-> E
> {code}
> Besides, also suppose D is very slow and thus cause back pressure, but E is very fast and F outputs few records, thus the usage of the corresponding input/output buffer pool is almost 0.
>
> Then the average input/output buffer usage of each task will be:
>
> {code:java}
> A(100%) --> (100%) B (50%) --> (50%) C (100%) --> (100%) D
> {code}
>
>
> But the maximum input/output buffer usage of each task will be:
>
> {code:java}
> A(100%) --> (100%) B (100%) --> (100%) C (100%) --> (100%) D
> {code}
> Users will be able to find the slowest task by finding the first task whose input buffer usage is 100% but output usage is less than 100%.
>
>
> If it is reasonable to show the maximum input/output buffer usage, I think there may be three options:
> # Modify the current computation logic of _InputBufferPoolUsageGauge_ and _OutputBufferPoolUsageGauge._
> # Add two _new metrics items InputBufferPoolMaxUsageGauge and OutputBufferPoolMaxUsageGauge._
> # Try to show distinct usage for each input/output buffer pool.
> and I think maybe the second option is the most preferred.
>
> How do you think about that?
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)