You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Niklas Semmler (Jira)" <ji...@apache.org> on 2021/12/13 14:49:00 UTC
[jira] [Created] (FLINK-25277) Introduce explicit shutdown signalling between TaskManager and JobManager
Niklas Semmler created FLINK-25277:
--------------------------------------
Summary: Introduce explicit shutdown signalling between TaskManager and JobManager
Key: FLINK-25277
URL: https://issues.apache.org/jira/browse/FLINK-25277
Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination
Affects Versions: 1.14.0, 1.13.0
Reporter: Niklas Semmler
Fix For: 1.15.0
We need to introduce shutdown signalling between TaskManager and JobManager for fast & graceful shutdown in reactive scheduler mode.
In Flink 1.14 and earlier versions, the JobManager tracks the availability of a TaskManager using a hearbeat. This heartbeat is by default configured with an interval of 10 seconds and a timeout of 50 seconds [1]. Hence, the shutdown of a TaskManager is recognized only after about 50-60 seconds. This works fine for the static scheduling mode, where a TaskManager only disappears as part of a cluster shutdown or a job failure. However, in the reactive scheduler mode (FLINK-10407), TaskManagers are regularly added and removed from a running job. Here, the heartbeat-mechanisms incurs additional delays.
To remove these delays, we add an explicit shutdown signal from the TaskManager to the JobManager. Additionally, to avoid data loss in a running job, the TaskManager will wait for a shutdown confirmation from the JobManager before shutting down.
[1]https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout
--
This message was sent by Atlassian Jira
(v8.20.1#820001)