You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by vishalovercome <vi...@moengage.com> on 2020/12/16 20:47:03 UTC

Will job manager restarts lead to repeated savepoint restoration?

My flink job runs in kubernetes. This is the setup:

1. One job running as a job cluster with one job manager
2. HA powered by zookeeper (works fine)
3. Job/Deployment manifests stored in Github and deployed to kubernetes by
Argo
4. State persisted to S3

If I were to stop (drain and take a savepoint) and resume, I'll have to
update the job manager manifest with the savepoint location and save it in
Github and redeploy. After deployment, I'll presumably have to modify the
manifest again to remove the savepoint location so as to avoid starting the
application from the same savepoint. This raises some questions:

1. If the job manager were to crash before the manifest is updated again
then won't kubernetes restart the job manager from the savepoint rather than
the latest checkpoint?
2. Is there a way to ensure that restoration from a savepoint doesn't happen
more than once? Or not after first successful checkpoint?
3. If even one checkpoint has been finalized, then the job should prefer the
checkpoint rather than the savepoint. Will that happen automatically given
zookeeper?
4. Is it possible to not have to remove the savepoint path from the
kubernetes manifest and simply rely on newer checkpoints/savepoints? It
feels rather clumsy to have to add and remove back manually. We could use a
cron job to remove it but its still clumsy.
5. Is there a way of asking flink to use the latest savepoint rather than
specifying the location of the savepoint? If I were to manually rename the
s3 savepoint location to something fixed (s3://fixed_savepoint_path_always)
then would there be any problem restoring the job?
6. Any open source tool that solves this problem?




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Will job manager restarts lead to repeated savepoint restoration?

Posted by Till Rohrmann <tr...@apache.org>.

Hi Vishal,

thanks for the detailed description of the problems.

1. This is currently the intended behaviour of Flink. The reason is that if
the system is no longer connected to ZooKeeper then we cannot rule out that
there is another process who has taken over the leadership. FLINK-10052 has
the goal to make this behaviour configurable and we intend to include it in
the next major release.

2. This is indeed a bug of the newly introduced application mode. With
Flink 1.11.3 or 1.12.0 it should be fixed. Hence, I would recommend you to
upgrade your Flink cluster.

3. Hard to tell what the problem is here. From Flink's perspective, if it
cannot establish a connection to ZooKeeper, then it cannot be sure who is
the leader and whether it should start executing jobs. Maybe there is a
problem with the connection to the ZooKeeper cluster from the nodes on
which Flink runs. Decreasing the session timeouts usually makes the
connection less stable if it is a network issue.

Cheers,
Till

On Mon, Dec 21, 2020 at 3:53 PM vishalovercome <vi...@moengage.com> wrote:

> I don't know how to reproduce it but what I've observed are three kinds of
> termination when connectivity with zookeeper is somehow disrupted. I don't
> think its an issue with zookeeper as it supports a much bigger kafka
> cluster
> since a few years.
>
> 1. The first kind is exactly this -
> https://github.com/apache/flink/pull/11338. Basically temporary loss of
> connectivity or rolling upgrade of zookeeper will cause job to terminate.
> It
> will restart eventually from where it left off.
> 2. The second kind is when job terminates and restarts for the same reason
> but is unable to recover from checkpoint. I think its similar to this -
> https://issues.apache.org/jira/browse/FLINK-19154. If upgrading to 1.12.0
> (from 1.11.2) will fix the second issue then I'll upgrade.
> 3. The third kind is where it repeatedly restarts as its unable to
> establish
> a session with Zookeeper. I don't know if reducing session timeout will
> help
> here but in this case, I'm forced to disable zookeeper HA entirely as the
> job cannot even restart here.
>
> I could create a JIRA ticket for discussion zookeeper itself if you suggest
> but the issue of zookeeper and savepoints are related as I'm not sure what
> will happen in each of the above.
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Re: Will job manager restarts lead to repeated savepoint restoration?

Posted by vishalovercome <vi...@moengage.com>.

I don't know how to reproduce it but what I've observed are three kinds of
termination when connectivity with zookeeper is somehow disrupted. I don't
think its an issue with zookeeper as it supports a much bigger kafka cluster
since a few years. 

1. The first kind is exactly this -
https://github.com/apache/flink/pull/11338. Basically temporary loss of
connectivity or rolling upgrade of zookeeper will cause job to terminate. It
will restart eventually from where it left off.
2. The second kind is when job terminates and restarts for the same reason
but is unable to recover from checkpoint. I think its similar to this -
https://issues.apache.org/jira/browse/FLINK-19154. If upgrading to 1.12.0
(from 1.11.2) will fix the second issue then I'll upgrade. 
3. The third kind is where it repeatedly restarts as its unable to establish
a session with Zookeeper. I don't know if reducing session timeout will help
here but in this case, I'm forced to disable zookeeper HA entirely as the
job cannot even restart here. 

I could create a JIRA ticket for discussion zookeeper itself if you suggest
but the issue of zookeeper and savepoints are related as I'm not sure what
will happen in each of the above.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Will job manager restarts lead to repeated savepoint restoration?

Posted by Till Rohrmann <tr...@apache.org>.

What are exactly the problems when the checkpoint recovery does not work?
Even if the ZooKeeper connection is temporarily disconnected which leads to
the JobMaster losing leadership and the job being suspended, the next
leader should continue where the first job left stopped because of the lost
ZooKeeper connection.

What happens under the hood when restoring from a savepoint is that it is
inserted into the CompletedCheckpointStore where also the other checkpoints
are stored. If now a failure happens, Flink will first try to recover from
a checkpoint/savepoint from the CompletedCheckpointStore and only if this
store does not contain any checkpoints/savepoints, it will use the
savepoint with which the job is started. The CompletedCheckpointStore
persists the checkpoint/savepoint information by writing the pointers to
ZooKeeper.

Cheers,
Till

On Mon, Dec 21, 2020 at 11:38 AM vishalovercome <vi...@moengage.com> wrote:

> Thanks for your reply!
>
> What I have seen is that the job terminates when there's intermittent loss
> of connectivity with zookeeper. This is in-fact the most common reason why
> our jobs are terminating at this point. Worse, it's unable to restore from
> checkpoint during some (not all) of these terminations. Under these
> scenarios, won't the job try to recover from a savepoint?
>
> I've gone through various tickets reporting stability issues due to
> zookeeper that you've mentioned you intend to resolve soon. But until the
> zookeeper based HA is stable, should we assume that it will repeatedly
> restore from savepoints? I would rather rely on kafka offsets to resume
> where it left off rather than savepoints.
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Re: Will job manager restarts lead to repeated savepoint restoration?

Posted by vishalovercome <vi...@moengage.com>.

Thanks for your reply!

What I have seen is that the job terminates when there's intermittent loss
of connectivity with zookeeper. This is in-fact the most common reason why
our jobs are terminating at this point. Worse, it's unable to restore from
checkpoint during some (not all) of these terminations. Under these
scenarios, won't the job try to recover from a savepoint? 

I've gone through various tickets reporting stability issues due to
zookeeper that you've mentioned you intend to resolve soon. But until the
zookeeper based HA is stable, should we assume that it will repeatedly
restore from savepoints? I would rather rely on kafka offsets to resume
where it left off rather than savepoints.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Will job manager restarts lead to repeated savepoint restoration?

Posted by Till Rohrmann <tr...@apache.org>.

Hi,

if you start a Flink job from a savepoint and the job needs to recover,
then it will only reuse the savepoint if no later checkpoint has been
created. Flink will always use the latest checkpoint/savepoint taken.

Cheers,
Till

On Wed, Dec 16, 2020 at 9:47 PM vishalovercome <vi...@moengage.com> wrote:

> My flink job runs in kubernetes. This is the setup:
>
> 1. One job running as a job cluster with one job manager
> 2. HA powered by zookeeper (works fine)
> 3. Job/Deployment manifests stored in Github and deployed to kubernetes by
> Argo
> 4. State persisted to S3
>
> If I were to stop (drain and take a savepoint) and resume, I'll have to
> update the job manager manifest with the savepoint location and save it in
> Github and redeploy. After deployment, I'll presumably have to modify the
> manifest again to remove the savepoint location so as to avoid starting the
> application from the same savepoint. This raises some questions:
>
> 1. If the job manager were to crash before the manifest is updated again
> then won't kubernetes restart the job manager from the savepoint rather
> than
> the latest checkpoint?
> 2. Is there a way to ensure that restoration from a savepoint doesn't
> happen
> more than once? Or not after first successful checkpoint?
> 3. If even one checkpoint has been finalized, then the job should prefer
> the
> checkpoint rather than the savepoint. Will that happen automatically given
> zookeeper?
> 4. Is it possible to not have to remove the savepoint path from the
> kubernetes manifest and simply rely on newer checkpoints/savepoints? It
> feels rather clumsy to have to add and remove back manually. We could use a
> cron job to remove it but its still clumsy.
> 5. Is there a way of asking flink to use the latest savepoint rather than
> specifying the location of the savepoint? If I were to manually rename the
> s3 savepoint location to something fixed (s3://fixed_savepoint_path_always)
> then would there be any problem restoring the job?
> 6. Any open source tool that solves this problem?
>
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>