You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yang Wang (JIRA)" <ji...@apache.org> on 2019/07/08 08:08:00 UTC

[jira] [Commented] (FLINK-12751) Create file based HA support

    [ https://issues.apache.org/jira/browse/FLINK-12751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880105#comment-16880105 ] 

Yang Wang commented on FLINK-12751:
-----------------------------------

Hi [~Xeli], i am not sure the backend implementation of PVC ReadWriteOnce. If the kubernetes is guaranteed to mount only once by job managers, then we will not need leader election.

 

Hi [~borisl], if the job manager election could be supported, then i think this is a good feature. Especially we want to deploy flink cluster on kubernetes.

> Create file based HA support
> ----------------------------
>
>                 Key: FLINK-12751
>                 URL: https://issues.apache.org/jira/browse/FLINK-12751
>             Project: Flink
>          Issue Type: Improvement
>          Components: FileSystems
>    Affects Versions: 1.8.0, 1.9.0, 2.0.0
>         Environment: Flink on k8 and Mini cluster
>            Reporter: Boris Lublinsky
>            Priority: Major
>              Labels: features, pull-request-available
>   Original Estimate: 168h
>          Time Spent: 10m
>  Remaining Estimate: 167h 50m
>
> In the current Flink implementation, HA support can be implemented either using Zookeeper or Custom Factory class.
> Add HA implementation based on PVC. The idea behind this implementation
> is as follows:
> * Because implementation assumes a single instance of Job manager (Job manager selection and restarts are done by K8 Deployment of 1)
> URL management is done using StandaloneHaServices implementation (in the case of cluster) and EmbeddedHaServices implementation (in the case of mini cluster)
> * For management of the submitted Job Graphs, checkpoint counter and completed checkpoint an implementation is leveraging the following file system layout
> {code}
>  ha -----> root of the HA data
>  checkpointcounter -----> checkpoint counter folder
>  <job ID> -----> job id folder
>  <counter file> -----> counter file
>  <another job ID> -----> another job id folder
>  ...........
>  completedCheckpoint -----> completed checkpoint folder
>  <job ID> -----> job id folder
>  <checkpoint file> -----> checkpoint file
>  <another checkpoint file> -----> checkpoint file
>  ...........
>  <another job ID> -----> another job id folder
>  ...........
>  submittedJobGraph -----> submitted graph folder
>  <job ID> -----> job id folder
>  <graph file> -----> graph file
>  <another job ID> -----> another job id folder
>  ...........
> {code}
> An implementation should overwrites 2 of the Flink files:
> * HighAvailabilityServicesUtils - added `FILESYSTEM` option for picking HA service
> * HighAvailabilityMode - added `FILESYSTEM` to available HA options.
> The actual implementation adds the following classes:
> * `FileSystemHAServices` - an implementation of a `HighAvailabilityServices` for file system
> * `FileSystemUtils` - support class for creation of runtime components.
> * `FileSystemStorageHelper` - file system operations implementation for filesystem based HA
> * `FileSystemCheckpointRecoveryFactory` - an implementation of a `CheckpointRecoveryFactory`for file system
> * `FileSystemCheckpointIDCounter` - an implementation of a `CheckpointIDCounter` for file system
> * `FileSystemCompletedCheckpointStore` - an implementation of a `CompletedCheckpointStore` for file system
> * `FileSystemSubmittedJobGraphStore` - an implementation of a `SubmittedJobGraphStore` for file system



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)