You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Zhanghao Chen (Jira)" <ji...@apache.org> on 2023/04/11 08:05:00 UTC
[jira] [Created] (FLINK-31769) Add percentiles to aggregated metrics
Zhanghao Chen created FLINK-31769:
-------------------------------------
Summary: Add percentiles to aggregated metrics
Key: FLINK-31769
URL: https://issues.apache.org/jira/browse/FLINK-31769
Project: Flink
Issue Type: Improvement
Components: Autoscaler, Runtime / Metrics
Reporter: Zhanghao Chen
Attachments: image-2023-04-11-15-11-51-471.png
*Background*
Currently only min/avg/max of metrics are exposed via REST API. Flink Autoscaler relies on these aggregated metrics to make predictions, and the type of aggregation plays an import role. [FLINK-30652] Use max busytime instead of average to compute true processing rate - ASF JIRA (apache.org) suggests that using max aggregator instead of avg of busy time can handle data skew more robustly. However, we found that for large-scale jobs, using max aggregation may be too sensitive. As a result, the true processing rate is underestimated with severe turbulence.
The graph below is the true processing rate estimated with different aggregators of a real production data transmission job with a parallelism of 750.
!image-2023-04-11-15-11-51-471.png!
*Proposal*
Add percentiles (p50, p90, p99) to aggregated metrics. Apache common maths can be used for computing that.
A follow up would be making Flink autoscaler make use of the new aggregators.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)