You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Zhu Zhu (Jira)" <ji...@apache.org> on 2022/03/02 09:10:00 UTC

[jira] [Comment Edited] (FLINK-26400) Release Testing: Explicit shutdown signalling from TaskManager to JobManager

    [ https://issues.apache.org/jira/browse/FLINK-26400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499990#comment-17499990 ] 

Zhu Zhu edited comment on FLINK-26400 at 3/2/22, 9:09 AM:
----------------------------------------------------------

I have done the test and the result looks good. The stopped TaskManager can be quickly identified by the JobManager to trigger a restart. I also tried directly `kill` a TaskManager (in SIGTERM way) and it also works well. 

Thanks for this improvement [~nsemmler]! I think it is very useful even for non-reactive mode. Because taking away a problematic machine is very common in production, and in our experience it is a common cause of Flink streaming job failures. This improvement can speed up the error detection, and hence speed up the job recovery.

I noticed 2 problems which are out of the scope of this test though.
1. When there is no resource and job is in CREATED state, I will get an error "Job failed during initialization of JobManager" when trying to open the job page, and cannot see the job topology or other job informations.
2. The parallelisms of graph nodes shown in the web UI is not updated when job vertex parallelisms change. I guess the root cause is the json plan is not updated on job vertex parallelism changes. Because we encountered the same issue when developing AdaptiveBatchScheduler. As a reference, our solution can be found [here|https://github.com/apache/flink/blob/152ad4fc14920372076c0004793c179141ae10c7/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptivebatch/AdaptiveBatchScheduler.java#L224].


was (Author: zhuzh):
I have done the test and the result looks good. The stopped TaskManager can be quickly identified by the JobManager to trigger a restart. I also tried directly `kill` a TaskManager (in SIGTERM way) and it also works well. 
Thanks for this improvement [~nsemmler]! I think it is very useful even for non-reactive mode. Because taking away a problematic machine is very common in production, and in our experience it is a common cause of Flink streaming job failures. This improvement can speed up the error detection, and hence speed up the job recovery.

I noticed 2 problems which are out of the scope of this test though.
1. When there is no resource and job is in CREATED state, I will get an error "Job failed during initialization of JobManager" when trying to open the job page, and cannot see the job topology or other job informations.
2. The parallelism shown in the graph is not updated when a job vertex parallelism changes. I guess the root cause is the json plan is not updated on job vertex parallelism changes. Because we encountered the same issue when developing AdaptiveBatchScheduler. As a reference, our solution can be found [here|https://github.com/apache/flink/blob/152ad4fc14920372076c0004793c179141ae10c7/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptivebatch/AdaptiveBatchScheduler.java#L224].

> Release Testing: Explicit shutdown signalling from TaskManager to JobManager
> ----------------------------------------------------------------------------
>
>                 Key: FLINK-26400
>                 URL: https://issues.apache.org/jira/browse/FLINK-26400
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Niklas Semmler
>            Assignee: Zhu Zhu
>            Priority: Blocker
>              Labels: release-testing
>             Fix For: 1.15.0
>
>
> FLINK-25277 introduces explicit signalling between a TaskManager and the JobManager when the TaskManager shuts down. This reduces the time it takes for a reactive cluster to down-scale & restart.
>  
> *Setup*
>  # Add the following line to your flink config to enable reactive mode:
> {code}
> taskmanager.host: localhost # a workaround
> scheduler-mode: reactive
> restart-strategy: fixeddelay
> restart-strategy.fixed-delay.attempts: 100
> {code}
>  # Create a “usrlib” folder and place the TopSpeedWindowing jar into it
> {code:bash}
> $ mkdir usrlib
> $ cp examples/streaming/TopSpeedWindowing.jar usrlib/
> {code}
>  # Start the job 
> {code:bash}
> $ bin/standalone-job.sh start  --main-class org.apache.flink.streaming.examples.windowing.TopSpeedWindowing
> {code}
>  # Start three task managers
> {code:bash}
> $ bin/taskmanager.sh start
> $ bin/taskmanager.sh start
> $ bin/taskmanager.sh start
> {code}
>  # Wait for the job to stabilize. The log file should show that three tasks start for every operator.
> {code}
>  GlobalWindows -> Sink: Print to Std. Out (3/3) (d10339d5755d07f3d9864ed1b2147af2) switched from INITIALIZING to RUNNING.{code}
> *Test*
> Stop one taskmanager
> {code:bash}
> $ bin/taskmanager.sh stop
> {code}
> Success condition: You should see that the job cancels and re-runs after a few seconds. In the logs you should find a line with the text “The TaskExecutor is shutting down”.
> *Teardown*
> Stop all taskmanagers and the jobmanager:
> {code:bash}
> $ bin/standalone-job.sh stop
> $ bin/taskmanager.sh stop-all
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)