You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Zhilong Hong (Jira)" <ji...@apache.org> on 2020/12/17 03:39:00 UTC
[jira] [Comment Edited] (FLINK-20612) Add benchmarks for scheduler

    [ https://issues.apache.org/jira/browse/FLINK-20612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250775#comment-17250775 ] 

Zhilong Hong edited comment on FLINK-20612 at 12/17/20, 3:38 AM:
-----------------------------------------------------------------

Thank you for your reply, [~trohrmann]. We implemented two versions, i.e., the unit tests running in CI and the benchmarks running in flink-benchmark. We think there are pros and cons for both of them.

For unit tests:
 * Pros: When we implemented the POC of optimization, we can validate it by running the tests in IDE, rather than compiling the entire Flink and running the benchmark. It saves the time.
 * Cons: If we set parallelism as a large number like 8k or more, the unit tests would require a large quantity of the heap memory. This may make the test cases unstable.

For flink-benchmark version:
 * Pros: With flink-benchmark, we can run the benchmarks daily and trace how the performance changes over time.
 * Cons: First, if we want to evaluate the effect of our optimization, it would cost a lot of time. Second, it would cost ~1.5 hours to run the scheduler benchmarks. For reference, it takes 2 hours to run the current flink-benchmark on my computer.

As a result, we prefer the flink-benchmark version. It's more stable, and can help us to trace the performance.

For the unit tests, we will keep them locally. We can use them to test our optimization in the future. 


was (Author: thesharing):
Thank you for your reply, [~trohrmann]. We implemented two versions, i.e., the unit tests running in CI and the benchmarks running in flink-benchmark. We think there are pros and cons for both of them.

For unit tests:
 * Pros: When we implemented the POC of optimization, we can validate it by running the tests in IDE, rather than compiling the entire Flink and running the benchmark. It saves the time.
 * Cons: If we set parallelism as a large number like 8k or more, the unit tests would require a large quantity of the heap memory. This may make the test cases unstable.

For flink-benchmark version:
 * Pros: With flink-benchmark, we can run the benchmarks daily and trace how the performance changes over time.
 * Cons: First, if we want to evaluate the effect of our optimization, it would cost a lot of time. Second, it would cost ~1.5 hours to run the scheduler benchmarks. For reference, it takes 2 hours to run the current flink-benchmark on my computer.

> Add benchmarks for scheduler
> ----------------------------
>
>                 Key: FLINK-20612
>                 URL: https://issues.apache.org/jira/browse/FLINK-20612
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.13.0
>            Reporter: Zhilong Hong
>            Priority: Major
>
> With Flink 1.12, we failed to run large-scale jobs on our cluster. When we were trying to run the jobs, we met the exceptions like out of heap memory, taskmanager heartbeat timeout, and etc. We increased the size of heap memory and extended the heartbeat timeout, the job still failed. After the troubleshooting, we found that there are some performance bottlenecks in the jobmaster. These bottlenecks are highly related to the complexity of the topology.
> We implemented several benchmarks on these bottlenecks based on flink-benchmark. The topology of the benchmarks is a simple graph, which consists of only two vertices: one source vertex and one sink vertex. They are both connected with all-to-all blocking edges. The parallelisms of the vertices are both 8000. The execution mode is batch. The results of the benchmarks are illustrated below:
> Table 1: The result of benchmarks on bottlenecks in the jobmaster
> | |*Time spent*|
> |Build topology|19970.44 ms|
> |Init scheduling strategy|38167.351 ms|
> |Deploy tasks|15102.850 ms|
> |Calculate failover region to restart|12080.271 ms|
> We'd like to propose these benchmarks for procedures related to the scheduler. There are three main benefits:
>  # They help us to understand the current status of task deployment performance and locate where the bottleneck is.
>  # We can use the benchmarks to evaluate the optimization in the future.
>  # As we run the benchmarks daily, they will help us to trace how the performance changes and locate the commit that introduces the performance regression if there is any.
> In the first version of the benchmarks, we mainly focus on the procedures we mentioned above. The methods corresponding to the procedures are:
>  # Building topology: {{ExecutionGraph#attachJobGraph}}
>  # Initializing scheduling strategies: {{PipelinedRegionSchedulingStrategy#init}}
>  # Deploying tasks: {{Execution#deploy}}
>  # Calculating failover regions: {{RestartPipelinedRegionFailoverStrategy#getTasksNeedingRestart}}
> In the benchmarks, the topology consists of two vertices: source -> sink. They are connected with all-to-all edges. The result partition type ({{PIPELINED}} and {{BLOCKING}}) should be considered separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)