You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yang Wang (JIRA)" <ji...@apache.org> on 2019/07/08 08:08:00 UTC

[jira] [Comment Edited] (FLINK-12751) Create file based HA support

    [ https://issues.apache.org/jira/browse/FLINK-12751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873830#comment-16873830 ] 

Yang Wang edited comment on FLINK-12751 at 7/8/19 8:07 AM:
-----------------------------------------------------------

[~borisl] Sorry for the late.

Just as you say, job manager selection is done by k8s deployment of 1. However, if the kubelet crashed, the pod running on it may not be terminated. So two job managers may be running at the same time. We need to use the etcd/zookeeper to do the job manager election.


was (Author: fly_in_gis):
[~borisl] Sorry for the late.

Just as you say, job manager selection is done by k8s deployment of 1. However, if the kubelet crashed, the pod running on it may not be terminated. So two job managers may be running at the same time. We need to use the etcd/zookeeper to do the job manager selection.

> Create file based HA support
> ----------------------------
>
>                 Key: FLINK-12751
>                 URL: https://issues.apache.org/jira/browse/FLINK-12751
>             Project: Flink
>          Issue Type: Improvement
>          Components: FileSystems
>    Affects Versions: 1.8.0, 1.9.0, 2.0.0
>         Environment: Flink on k8 and Mini cluster
>            Reporter: Boris Lublinsky
>            Priority: Major
>              Labels: features, pull-request-available
>   Original Estimate: 168h
>          Time Spent: 10m
>  Remaining Estimate: 167h 50m
>
> In the current Flink implementation, HA support can be implemented either using Zookeeper or Custom Factory class.
> Add HA implementation based on PVC. The idea behind this implementation
> is as follows:
> * Because implementation assumes a single instance of Job manager (Job manager selection and restarts are done by K8 Deployment of 1)
> URL management is done using StandaloneHaServices implementation (in the case of cluster) and EmbeddedHaServices implementation (in the case of mini cluster)
> * For management of the submitted Job Graphs, checkpoint counter and completed checkpoint an implementation is leveraging the following file system layout
> {code}
>  ha -----> root of the HA data
>  checkpointcounter -----> checkpoint counter folder
>  <job ID> -----> job id folder
>  <counter file> -----> counter file
>  <another job ID> -----> another job id folder
>  ...........
>  completedCheckpoint -----> completed checkpoint folder
>  <job ID> -----> job id folder
>  <checkpoint file> -----> checkpoint file
>  <another checkpoint file> -----> checkpoint file
>  ...........
>  <another job ID> -----> another job id folder
>  ...........
>  submittedJobGraph -----> submitted graph folder
>  <job ID> -----> job id folder
>  <graph file> -----> graph file
>  <another job ID> -----> another job id folder
>  ...........
> {code}
> An implementation should overwrites 2 of the Flink files:
> * HighAvailabilityServicesUtils - added `FILESYSTEM` option for picking HA service
> * HighAvailabilityMode - added `FILESYSTEM` to available HA options.
> The actual implementation adds the following classes:
> * `FileSystemHAServices` - an implementation of a `HighAvailabilityServices` for file system
> * `FileSystemUtils` - support class for creation of runtime components.
> * `FileSystemStorageHelper` - file system operations implementation for filesystem based HA
> * `FileSystemCheckpointRecoveryFactory` - an implementation of a `CheckpointRecoveryFactory`for file system
> * `FileSystemCheckpointIDCounter` - an implementation of a `CheckpointIDCounter` for file system
> * `FileSystemCompletedCheckpointStore` - an implementation of a `CompletedCheckpointStore` for file system
> * `FileSystemSubmittedJobGraphStore` - an implementation of a `SubmittedJobGraphStore` for file system



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)