You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Stephan Ewen (JIRA)" <ji...@apache.org> on 2017/01/17 16:38:26 UTC

[jira] [Commented] (FLINK-5501) Determine whether the job starts from last JobManager failure

    [ https://issues.apache.org/jira/browse/FLINK-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826357#comment-15826357 ] 

Stephan Ewen commented on FLINK-5501:
-------------------------------------

I think the approach you outlined is good.

For thought and future reference, [~till.rohrmann] and me were thinking through the following alternatives as well that we rejected in the end:

  1. Extend the leader election service such that it carries an incrementing number when leaders change. If the leader is elected with {{0}} then it simply starts the job, if it is elected with something {{!= 0}}, it starts with reconciling. That approach, however, is not very suitable for cluster sessions, and does not have a good separation of concerns.

  2. JobManager always starts the job, and if a TaskManager registers as "reconciling", it cancels the job and goes to "reconciling".
    - Advantage: No special state, plus eager acquisition of resources in case no reconciliation happens
    - Disadvantage: Reconciliation is the more common case (assuming very long running streaming jobs) and this runs off "in the wrong direction" for the common case, triggering unnecessary resource allocation. It is also probably more complicated to implement.


> Determine whether the job starts from last JobManager failure
> -------------------------------------------------------------
>
>                 Key: FLINK-5501
>                 URL: https://issues.apache.org/jira/browse/FLINK-5501
>             Project: Flink
>          Issue Type: Sub-task
>          Components: JobManager
>            Reporter: Zhijiang Wang
>            Assignee: Zhijiang Wang
>
> When the {{JobManagerRunner}} grants leadership, it should check whether the current job is already running or not. If the job is running, the {{JobManager}} should reconcile itself (enter RECONCILING state) and waits for the {{TaskManager}} reporting task status. Otherwise the {{JobManger}} can schedule the {{ExecutionGraph}} in common way.
> The {{RunningJobsRegistry}} can provide the way to check the job running status, but we should expand the current interface and fix the related process to support this function.
> 1. {{RunningJobsRegistry}} sets RUNNING status after {{JobManagerRunner}} granting leadership at the first time.
> 2. If the job finishes, the job status will be set FINISHED by {{RunningJobsRegistry}} and the status will be deleted before exit. 
> 3. If the mini cluster starts multi {{JobManagerRunner}}, and the leader {{JobManagerRunner}} already finishes the job to set the job status FINISHED, other {{JobManagerRunner}} will exit after grants the leadership again.
> 4. If the {{JobManager}} fails, the job status will be still in RUNNING. So if the {{JobManagerRunner}} (the previous or new one) grants leadership again, it will check the job status and enters {{RECONCILING}} state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)