You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2015/06/29 10:58:04 UTC

[jira] [Created] (FLINK-2289) Make JobManager highly available

Till Rohrmann created FLINK-2289:
------------------------------------

             Summary: Make JobManager highly available
                 Key: FLINK-2289
                 URL: https://issues.apache.org/jira/browse/FLINK-2289
             Project: Flink
          Issue Type: Improvement
            Reporter: Till Rohrmann
            Assignee: Till Rohrmann


Currently, the {{JobManager}} is the single point of failure in the Flink system. If it fails, then your job cannot be recovered and the Flink cluster is no longer able to receive new jobs.

Therefore, it is crucial to make the {{JobManager}} fault tolerant so that the Flink cluster can recover from failed {{JobManager}}. As a first step towards this goal, I propose to make the {{JobManager}} highly available by starting multiple instances and using Apache ZooKeeper to elect a leader. The leader is responsible for the execution of the Flink job. 

In case that the {{JobManager}} dies, one of the other running {{JobManager}} will be elected as the leader and take over the role of the leader. The {{Client}} and the {{TaskManager}} will automatically detect the new {{JobManager}} by querying the ZooKeeper cluster.

Note that this does not achieve full fault tolerance for the {{JobManager}} but it allows the cluster to recover from failed {{JobManager}}. The design of high-availability for the {{JobManager}} is tracked in the wiki here [1].

Resources:
[1] [https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)