You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yang Wang (JIRA)" <ji...@apache.org> on 2019/07/08 08:08:00 UTC
[jira] [Commented] (FLINK-12751) Create file based HA support
[ https://issues.apache.org/jira/browse/FLINK-12751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880105#comment-16880105 ]
Yang Wang commented on FLINK-12751:
-----------------------------------
Hi [~Xeli], i am not sure the backend implementation of PVC ReadWriteOnce. If the kubernetes is guaranteed to mount only once by job managers, then we will not need leader election.
Hi [~borisl], if the job manager election could be supported, then i think this is a good feature. Especially we want to deploy flink cluster on kubernetes.
> Create file based HA support
> ----------------------------
>
> Key: FLINK-12751
> URL: https://issues.apache.org/jira/browse/FLINK-12751
> Project: Flink
> Issue Type: Improvement
> Components: FileSystems
> Affects Versions: 1.8.0, 1.9.0, 2.0.0
> Environment: Flink on k8 and Mini cluster
> Reporter: Boris Lublinsky
> Priority: Major
> Labels: features, pull-request-available
> Original Estimate: 168h
> Time Spent: 10m
> Remaining Estimate: 167h 50m
>
> In the current Flink implementation, HA support can be implemented either using Zookeeper or Custom Factory class.
> Add HA implementation based on PVC. The idea behind this implementation
> is as follows:
> * Because implementation assumes a single instance of Job manager (Job manager selection and restarts are done by K8 Deployment of 1)
> URL management is done using StandaloneHaServices implementation (in the case of cluster) and EmbeddedHaServices implementation (in the case of mini cluster)
> * For management of the submitted Job Graphs, checkpoint counter and completed checkpoint an implementation is leveraging the following file system layout
> {code}
> ha -----> root of the HA data
> checkpointcounter -----> checkpoint counter folder
> <job ID> -----> job id folder
> <counter file> -----> counter file
> <another job ID> -----> another job id folder
> ...........
> completedCheckpoint -----> completed checkpoint folder
> <job ID> -----> job id folder
> <checkpoint file> -----> checkpoint file
> <another checkpoint file> -----> checkpoint file
> ...........
> <another job ID> -----> another job id folder
> ...........
> submittedJobGraph -----> submitted graph folder
> <job ID> -----> job id folder
> <graph file> -----> graph file
> <another job ID> -----> another job id folder
> ...........
> {code}
> An implementation should overwrites 2 of the Flink files:
> * HighAvailabilityServicesUtils - added `FILESYSTEM` option for picking HA service
> * HighAvailabilityMode - added `FILESYSTEM` to available HA options.
> The actual implementation adds the following classes:
> * `FileSystemHAServices` - an implementation of a `HighAvailabilityServices` for file system
> * `FileSystemUtils` - support class for creation of runtime components.
> * `FileSystemStorageHelper` - file system operations implementation for filesystem based HA
> * `FileSystemCheckpointRecoveryFactory` - an implementation of a `CheckpointRecoveryFactory`for file system
> * `FileSystemCheckpointIDCounter` - an implementation of a `CheckpointIDCounter` for file system
> * `FileSystemCompletedCheckpointStore` - an implementation of a `CompletedCheckpointStore` for file system
> * `FileSystemSubmittedJobGraphStore` - an implementation of a `SubmittedJobGraphStore` for file system
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)