You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Zhu Zhu (JIRA)" <ji...@apache.org> on 2019/04/10 04:36:00 UTC
[jira] [Commented] (FLINK-11813) Standby per job mode Dispatchers don't know job's JobSchedulingStatus

    [ https://issues.apache.org/jira/browse/FLINK-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814030#comment-16814030 ] 

Zhu Zhu commented on FLINK-11813:
---------------------------------

I think PENDING and RUNNING together with null(or NONE) status are enough.

If the job is PENDING or RUNNING, the *JobManagerRunner* can start the *JobMaster* once it is granted leadership.

If the job is not found, the *JobManagerRunner* should shutdown, so that a terminated job will not be restarted.

 

Currently job status is PENDING in *RunningJobsRegistry* by default, even if the job does not exist any more. But I think it should be NONE in this case.

The JobSchedulingStatus should change to be PENDING upon job submission in the Dispatcher. A *registerJob()* interface might be needed in *RunningJobsRegistry* to add the job as PENDING in this way.

When the job is globally terminated(including FINISHED/CANCELED/FAILED), we can call an *unregisterJob()* interface in *RunningJobsRegistry* to remove the job status file, rather than change it to be DONE.

 

B.T.W, Seems Flink does not use the PENDING and RUNNING status to make decision currently. They are used in the same way. While in the future we can use them to identify whether it's a JM failover.

> Standby per job mode Dispatchers don't know job's JobSchedulingStatus
> ---------------------------------------------------------------------
>
>                 Key: FLINK-11813
>                 URL: https://issues.apache.org/jira/browse/FLINK-11813
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0
>            Reporter: Till Rohrmann
>            Priority: Major
>
> At the moment, it can happen that standby {{Dispatchers}} in per job mode will restart a terminated job after they gained leadership. The problem is that we currently clear the {{RunningJobsRegistry}} once a job has reached a globally terminal state. After the leading {{Dispatcher}} terminates, a standby {{Dispatcher}} will gain leadership. Without having the information from the {{RunningJobsRegistry}} it cannot tell whether the job has been executed or whether the {{Dispatcher}} needs to re-execute the job. At the moment, the {{Dispatcher}} will assume that there was a fault and hence re-execute the job. This can lead to duplicate results.
> I think we need some way to tell standby {{Dispatchers}} that a certain job has been successfully executed. One trivial solution could be to not clean up the {{RunningJobsRegistry}} but then we will clutter ZooKeeper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)