You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yun Gao (Jira)" <ji...@apache.org> on 2020/08/03 09:35:00 UTC
[jira] [Commented] (FLINK-10981) Add or modify metrics to show the
maximum usage of InputBufferPool/OutputBufferPool to help debugging back
pressure
[ https://issues.apache.org/jira/browse/FLINK-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169853#comment-17169853 ]
Yun Gao commented on FLINK-10981:
---------------------------------
Sorry for the later response for being in holiday, I also agree with that we could close this issue for now.
> Add or modify metrics to show the maximum usage of InputBufferPool/OutputBufferPool to help debugging back pressure
> -------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-10981
> URL: https://issues.apache.org/jira/browse/FLINK-10981
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Metrics, Runtime / Network
> Reporter: Yun Gao
> Assignee: Yun Gao
> Priority: Major
>
> Currently the network layer has provided two metrics items, namely _InputBufferPoolUsageGauge_ and _OutputBufferPoolUsageGauge_ to show the usage of input buffer pool and output buffer pool. When there are multiple inputs(SingleInputGate) or outputs(ResultPartition), the two metrics items show their average usage.
>
> However, we found that the maximum usage of all the InputBufferPool or OutputBufferPool is also useful in debugging back pressure. Suppose we have a job with the following job graph:
>
> {code:java}
> F
> \
> \
> _\/
> A ---> B ----> C ---> D
> \
> \
> \-> E
> {code}
> Besides, also suppose D is very slow and thus cause back pressure, but E is very fast and F outputs few records, thus the usage of the corresponding input/output buffer pool is almost 0.
>
> Then the average input/output buffer usage of each task will be:
>
> {code:java}
> A(100%) --> (100%) B (50%) --> (50%) C (100%) --> (100%) D
> {code}
>
>
> But the maximum input/output buffer usage of each task will be:
>
> {code:java}
> A(100%) --> (100%) B (100%) --> (100%) C (100%) --> (100%) D
> {code}
> Users will be able to find the slowest task by finding the first task whose input buffer usage is 100% but output usage is less than 100%.
>
>
> If it is reasonable to show the maximum input/output buffer usage, I think there may be three options:
> # Modify the current computation logic of _InputBufferPoolUsageGauge_ and _OutputBufferPoolUsageGauge._
> # Add two _new metrics items InputBufferPoolMaxUsageGauge and OutputBufferPoolMaxUsageGauge._
> # Try to show distinct usage for each input/output buffer pool.
> and I think maybe the second option is the most preferred.
>
> How do you think about that?
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)