You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "TisonKun (JIRA)" <ji...@apache.org> on 2018/12/27 10:26:00 UTC

[jira] [Issue Comment Deleted] (FLINK-10333) Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker, CompletedCheckpoints)

     [ https://issues.apache.org/jira/browse/FLINK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

TisonKun updated FLINK-10333:
-----------------------------
    Comment: was deleted

(was: Hi [~StephanEwen] and [~till.rohrmann]

With an offline discuss with [~xiaogang.shi] we see ZK has a transactional mechanism so that we can ensure only the leader writes ZK.

Given this knowledge and the inconsistencies [~till.rohrmann] noticed, before go into reimplementation, I did a survey of the usage of ZK based stores in flink. Ideally there is exact one role who writes a specific znode.

There are four types of znodes that flink writes. Besides {{SubmittedJobGraphStore}} written by {{Dispatcher}}, {{CompletedCheckpointStore}} written by {{JobMaster}} and {{MesosWorkerStore}} written by {{MesosResourceManager}}, there is the {{RunningJobsRegistry}} that also has a ZK based implementation.

All of the first three write ZK with a heavy “store lock”, but as [~xiaogang.shi]  pointing out, it still lacks of atomicity. And with the solution based on ZK transaction — for example, a current {{Dispatcher}} leader {{setData}} with {{check}} for {{election-node-path}} — we can ensure the atomicity while getting rid of the lock.

For the last one, {{RunningJobsRegistry}}, situation becomes a bit more complex. {{JobManagerRunner}} is responsible for {{setJobRunning}} and {{setJobFinished}} and {{Dispatcher}} is responsible for {{clearJob}}. This is against the ideal that one role for one znode. Also I notice that the gap between the semantic of {{DONE}} and that of clear is ambiguous. 

We might try to prevent the same job be processed by an approach other than check an ephemeral {{DONE}} status. What if we replace {{setJobFinished}} with clearing {{RunningJobsRegistry}}?)

> Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker, CompletedCheckpoints)
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-10333
>                 URL: https://issues.apache.org/jira/browse/FLINK-10333
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.3, 1.6.0, 1.7.0
>            Reporter: Till Rohrmann
>            Priority: Major
>             Fix For: 1.8.0
>
>
> While going over the ZooKeeper based stores ({{ZooKeeperSubmittedJobGraphStore}}, {{ZooKeeperMesosWorkerStore}}, {{ZooKeeperCompletedCheckpointStore}}) and the underlying {{ZooKeeperStateHandleStore}} I noticed several inconsistencies which were introduced with past incremental changes.
> * Depending whether {{ZooKeeperStateHandleStore#getAllSortedByNameAndLock}} or {{ZooKeeperStateHandleStore#getAllAndLock}} is called, deserialization problems will either lead to removing the Znode or not
> * {{ZooKeeperStateHandleStore}} leaves inconsistent state in case of exceptions (e.g. {{#getAllAndLock}} won't release the acquired locks in case of a failure)
> * {{ZooKeeperStateHandleStore}} has too many responsibilities. It would be better to move {{RetrievableStateStorageHelper}} out of it for a better separation of concerns
> * {{ZooKeeperSubmittedJobGraphStore}} overwrites a stored {{JobGraph}} even if it is locked. This should not happen since it could leave another system in an inconsistent state (imagine a changed {{JobGraph}} which restores from an old checkpoint)
> * Redundant but also somewhat inconsistent put logic in the different stores
> * Shadowing of ZooKeeper specific exceptions in {{ZooKeeperStateHandleStore}} which were expected to be caught in {{ZooKeeperSubmittedJobGraphStore}}
> * Getting rid of the {{SubmittedJobGraphListener}} would be helpful
> These problems made me think how reliable these components actually work. Since these components are very important, I propose to refactor them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)