You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by "Prabhu Joseph (Jira)" <ji...@apache.org> on 2023/06/29 11:13:00 UTC

[jira] [Created] (FLINK-32484) AdaptiveScheduler combined restart during scaling out

Prabhu Joseph created FLINK-32484:
-------------------------------------

Summary: AdaptiveScheduler combined restart during scaling out
Key: FLINK-32484
URL: https://issues.apache.org/jira/browse/FLINK-32484
Project: Flink
Issue Type: Improvement
Components: API / Core
Affects Versions: 1.17.0
Reporter: Prabhu Joseph

On a scaling-out operation, when nodes are added at different times, AdaptiveScheduler does multiple restarts within a short period of time. On one of our Flink jobs, we have seen AdaptiveScheduler restart the ExecutionGraph every time there is a notification of new resources to it. There are five restarts within 3 minutes.

AdaptiveScheduler could provide a configurable restart window interval to the user during which it combines the notified resources and restarts once when the available resources are sufficient to fit the desired parallelism or when the window times out. This is applicable only when the execution graph is in the executing state and not in the waiting for resources state.

{code:java}
[root@ip-172-31-40-185 container_1688034805200_0002_01_000001]# grep -i scale *
jobmanager.log:2023-06-29 10:46:58,061 INFO org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New resources are available. Restarting job to scale up.
jobmanager.log:2023-06-29 10:47:57,317 INFO org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New resources are available. Restarting job to scale up.
jobmanager.log:2023-06-29 10:48:53,314 INFO org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New resources are available. Restarting job to scale up.
jobmanager.log:2023-06-29 10:49:27,821 INFO org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New resources are available. Restarting job to scale up.
jobmanager.log:2023-06-29 10:50:15,672 INFO org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New resources are available. Restarting job to scale up.
[root@ip-172-31-40-185 container_1688034805200_0002_01_000001]# {code}

--
This message was sent by Atlassian Jira
(v8.20.10#820010)