You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "TisonKun (Jira)" <ji...@apache.org> on 2019/09/05 08:18:01 UTC

[jira] [Comment Edited] (FLINK-10333) Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker, CompletedCheckpoints)

    [ https://issues.apache.org/jira/browse/FLINK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923170#comment-16923170 ] 

TisonKun edited comment on FLINK-10333 at 9/5/19 8:17 AM:
----------------------------------------------------------

{{LeaderServer}} is regarded as a prerequisite for new high-availability services, otherwise we have to implement embedded one which should not be required as design(see also [here|https://lists.apache.org/x/thread.html/0da7ff1f985125f5f0f16b15cd1b6617f68d15cf11c421245071a485@%3Cdev.flink.apache.org%3E]) and live with the inconsistency views/apis between different implementation(see concerns about retrieve JobMaster address above).

We can start a separated thread to handle its details and implementation if we reach a consensus here. It would be cleanly individually integrated in current codebase.


was (Author: tison):
Details and implementation of {{LeaderServer}} is regarded as a prerequisite for new high-availability services, otherwise we have to implement embedded one which should not be required as design(see also [here|https://lists.apache.org/x/thread.html/0da7ff1f985125f5f0f16b15cd1b6617f68d15cf11c421245071a485@%3Cdev.flink.apache.org%3E]) and live with the inconsistency views/apis between different implementation(see concerns about retrieve JobMaster address above).

We can start a separated thread to handle it if we reach a consensus here. It would be cleanly individually integrated in current codebase.

> Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker, CompletedCheckpoints)
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-10333
>                 URL: https://issues.apache.org/jira/browse/FLINK-10333
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.5.3, 1.6.0, 1.7.0
>            Reporter: Till Rohrmann
>            Priority: Major
>         Attachments: screenshot-1.png
>
>
> While going over the ZooKeeper based stores ({{ZooKeeperSubmittedJobGraphStore}}, {{ZooKeeperMesosWorkerStore}}, {{ZooKeeperCompletedCheckpointStore}}) and the underlying {{ZooKeeperStateHandleStore}} I noticed several inconsistencies which were introduced with past incremental changes.
> * Depending whether {{ZooKeeperStateHandleStore#getAllSortedByNameAndLock}} or {{ZooKeeperStateHandleStore#getAllAndLock}} is called, deserialization problems will either lead to removing the Znode or not
> * {{ZooKeeperStateHandleStore}} leaves inconsistent state in case of exceptions (e.g. {{#getAllAndLock}} won't release the acquired locks in case of a failure)
> * {{ZooKeeperStateHandleStore}} has too many responsibilities. It would be better to move {{RetrievableStateStorageHelper}} out of it for a better separation of concerns
> * {{ZooKeeperSubmittedJobGraphStore}} overwrites a stored {{JobGraph}} even if it is locked. This should not happen since it could leave another system in an inconsistent state (imagine a changed {{JobGraph}} which restores from an old checkpoint)
> * Redundant but also somewhat inconsistent put logic in the different stores
> * Shadowing of ZooKeeper specific exceptions in {{ZooKeeperStateHandleStore}} which were expected to be caught in {{ZooKeeperSubmittedJobGraphStore}}
> * Getting rid of the {{SubmittedJobGraphListener}} would be helpful
> These problems made me think how reliable these components actually work. Since these components are very important, I propose to refactor them.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)