You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by "Zhanghao Chen (Jira)" <ji...@apache.org> on 2023/04/11 08:05:00 UTC

[jira] [Created] (FLINK-31769) Add percentiles to aggregated metrics

Zhanghao Chen created FLINK-31769:
-------------------------------------

             Summary: Add percentiles to aggregated metrics
                 Key: FLINK-31769
                 URL: https://issues.apache.org/jira/browse/FLINK-31769
             Project: Flink
          Issue Type: Improvement
          Components: Autoscaler, Runtime / Metrics
            Reporter: Zhanghao Chen
         Attachments: image-2023-04-11-15-11-51-471.png

*Background*

Currently only min/avg/max of metrics are exposed via REST API. Flink Autoscaler relies on these aggregated metrics to make predictions, and the type of aggregation plays an import role. [FLINK-30652] Use max busytime instead of average to compute true processing rate - ASF JIRA (apache.org) suggests that using max aggregator instead of avg of busy time can handle data skew more robustly. However, we found that for large-scale jobs, using max aggregation may be too sensitive. As a result, the true processing rate is underestimated with severe turbulence.

The graph below is the true processing rate estimated with different aggregators of a real production data transmission job with a parallelism of 750.

!image-2023-04-11-15-11-51-471.png!

*Proposal*

Add percentiles (p50, p90, p99) to aggregated metrics. Apache common maths can be used for computing that.

A follow up would be making Flink autoscaler make use of the new aggregators.

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)