You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Ufuk Celebi (JIRA)" <ji...@apache.org> on 2015/10/20 14:56:27 UTC

[jira] [Resolved] (FLINK-2287) Implement JobManager high availability

     [ https://issues.apache.org/jira/browse/FLINK-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ufuk Celebi resolved FLINK-2287.
--------------------------------
    Resolution: Fixed

Fixed in sub tasks

> Implement JobManager high availability
> --------------------------------------
>
>                 Key: FLINK-2287
>                 URL: https://issues.apache.org/jira/browse/FLINK-2287
>             Project: Flink
>          Issue Type: Improvement
>          Components: JobManager, TaskManager
>            Reporter: Ufuk Celebi
>             Fix For: 0.10
>
>
> The problem: The JobManager (JM) is a single point of failure. When it crashes, TaskManagers (TM) fail all running jobs and try to reconnect to the same JM. A failed JM looses all state and can not resume the running jobs; even if it recovers and the TMs reconnect.
> Solution: implement JM fault tolerance/high availability by having multiple JM instances running with one as leader and the other(s) in standby. The exact coordination and state update protocol between JM, TM, and clients is covered in sub-tasks/issues.
> Related Wiki: https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)