You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Edmond (JIRA)" <ji...@apache.org> on 2018/12/11 17:02:00 UTC
[jira] [Updated] (FLINK-11132) Restore From Savepoint on HA Setup

     [ https://issues.apache.org/jira/browse/FLINK-11132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edmond updated FLINK-11132:
---------------------------
    Description: 
In our current setup we have one job-manager (standalone-job.sh) and one task-manager (taskmanager.sh) deployed as job-cluster in HA mode (ZooKeeper).

We tried to run a simple stateful Flink app that generates periodically checkpoints and savepoints to a shared storage, in order to re-run it again from a specific savepoint later. However, when in HA it a seems that it ignores the savepoint restore flag (--fromSavepoint) and recover from the last checkpoint instead. When we removed the HA configuration, savepoint restoration was successful.

 

flink-conf.yaml:

high-availability: zookeeper
 high-availability.zookeeper.quorum: zookeeper-host:2181
 high-availability.zookeeper.path.root: /flink
 high-availability.cluster-id: our_cluster_id
 high-availability.storageDir: gs://app_bucket/flink_ns/ha
 high-availability.jobmanager.port: 6123
 state.backend.fs.memory-threshold: 0
 state.checkpoints.dir: gs://app_bucket/flink_ns/checkpoints
 state.savepoints.dir: gs://app_bucket/flink_ns/savepoints

When we tried to run it in non-HA mode we just removed the high-availability.* parameters.

Job Manager command before restore:

./standalone-job.sh start-foreground --job-classname  com.TestApp -Djobmanager.rpc.address=127.0.0.1 -Dparallelism.default=1 -Dblob.server.port=6124 -Dquery.server.ports=6125

Job Manager command when trying to restore:

./standalone-job.sh start-foreground --job-classname  com.TestApp -Djobmanager.rpc.address=127.0.0.1 -Dparallelism.default=1 -Dblob.server.port=6124 -Dquery.server.ports=6125 --fromSavepoint gs://app_bucket/flink_ns/savepoints/savepoint_1/savepoint-000000-e7f1f0f63c41

 

Task Manager command:

./taskmanager.sh start-foreground -Djobmanager.rpc.address=127.0.0.1

  was:
In our current setup we have one job-manager (standalone-job.sh) and one task-manager (taskmanager.sh) deployed as job-cluster in HA mode (ZooKeeper).

We tried to run a simple stateful Flink app that generates periodically checkpoints and savepoints to a shared storage, in order to re-run it again from a specific savepoint later. However, when in HA it a seems that it ignores the savepoint restore flag (--fromSavepoint) and recover from the last checkpoint instead. When we removed the HA configuration, savepoint restoration was successful.


> Restore From Savepoint on HA Setup
> ----------------------------------
>
>                 Key: FLINK-11132
>                 URL: https://issues.apache.org/jira/browse/FLINK-11132
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.6.2, 1.7.0
>         Environment: flink-conf.yaml:
> high-availability: zookeeper
> high-availability.zookeeper.quorum: zookeeper-host:2181
> high-availability.zookeeper.path.root: /flink
> high-availability.cluster-id: our_cluster_id
> high-availability.storageDir: gs://app_bucket/flink_ns/ha
> high-availability.jobmanager.port: 6123
> state.backend.fs.memory-threshold: 0
> state.checkpoints.dir: gs://app_bucket/flink_ns/checkpoints
> state.savepoints.dir: gs://app_bucket/flink_ns/savepoints
> When we tried to run it in non-HA mode we just removed the high-availability.* parameters.
> Job Manager command before restore:
> ./standalone-job.sh start-foreground --job-classname  com.TestApp -Djobmanager.rpc.address=127.0.0.1 -Dparallelism.default=1 -Dblob.server.port=6124 -Dquery.server.ports=6125
> Job Manager command when trying to restore:
> ./standalone-job.sh start-foreground --job-classname  com.TestApp -Djobmanager.rpc.address=127.0.0.1 -Dparallelism.default=1 -Dblob.server.port=6124 -Dquery.server.ports=6125 --fromSavepoint gs://app_bucket/flink_ns/savepoints/savepoint_1/savepoint-000000-e7f1f0f63c41
>  
> Task Manager command:
> ./taskmanager.sh start-foreground -Djobmanager.rpc.address=127.0.0.1
>  
>            Reporter: Edmond
>            Priority: Major
>
> In our current setup we have one job-manager (standalone-job.sh) and one task-manager (taskmanager.sh) deployed as job-cluster in HA mode (ZooKeeper).
> We tried to run a simple stateful Flink app that generates periodically checkpoints and savepoints to a shared storage, in order to re-run it again from a specific savepoint later. However, when in HA it a seems that it ignores the savepoint restore flag (--fromSavepoint) and recover from the last checkpoint instead. When we removed the HA configuration, savepoint restoration was successful.
>  
> flink-conf.yaml:
> high-availability: zookeeper
>  high-availability.zookeeper.quorum: zookeeper-host:2181
>  high-availability.zookeeper.path.root: /flink
>  high-availability.cluster-id: our_cluster_id
>  high-availability.storageDir: gs://app_bucket/flink_ns/ha
>  high-availability.jobmanager.port: 6123
>  state.backend.fs.memory-threshold: 0
>  state.checkpoints.dir: gs://app_bucket/flink_ns/checkpoints
>  state.savepoints.dir: gs://app_bucket/flink_ns/savepoints
> When we tried to run it in non-HA mode we just removed the high-availability.* parameters.
> Job Manager command before restore:
> ./standalone-job.sh start-foreground --job-classname  com.TestApp -Djobmanager.rpc.address=127.0.0.1 -Dparallelism.default=1 -Dblob.server.port=6124 -Dquery.server.ports=6125
> Job Manager command when trying to restore:
> ./standalone-job.sh start-foreground --job-classname  com.TestApp -Djobmanager.rpc.address=127.0.0.1 -Dparallelism.default=1 -Dblob.server.port=6124 -Dquery.server.ports=6125 --fromSavepoint gs://app_bucket/flink_ns/savepoints/savepoint_1/savepoint-000000-e7f1f0f63c41
>  
> Task Manager command:
> ./taskmanager.sh start-foreground -Djobmanager.rpc.address=127.0.0.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)