You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Kevin Lam <ke...@shopify.com.INVALID> on 2023/02/07 21:42:57 UTC

Flink Operator - Supporting Recovery from Snapshot

Hello,

I was reading the Flink Kubernetes Operator documentation and noticed that
if you want to redeploy a Flink job from a specific snapshot, you must
follow these manual recovery steps. Are there plans to streamline this
process? Deploying from a specific snapshot is a relatively common
operation and it'd be nice to not need to delete the FlinkDeployment

I wonder if the Flink Operator could use the initialSavepointPath similar
to the restartNonce and savepointTriggerNonce parameters, where if
initialSavepointPath changes, the deployed job is restored from the
specified savepoint. Any thoughts?

Thanks!

Re: Flink Operator - Supporting Recovery from Snapshot

Posted by Gyula Fóra <gy...@gmail.com>.
Hi All!

Based on some continuous feedback and experience, we feel that it may be a
good time to introduce this functionality in a way that doesn't
accidentally affect existing users in an unexpected way.

Please see: https://issues.apache.org/jira/browse/FLINK-33763 for details
and review.

Cheers,
Gyula

On Fri, Feb 10, 2023 at 7:27 PM Kevin Lam <ke...@shopify.com.invalid>
wrote:

> Hey Yaroslav!
>
> Awesome, good to know that approach works well for you. I think our plan as
> of now is to do the same--delete the current FlinkDeployment when deploying
> from a specific snapshot. It'll be a separate workflow from normal
> deployments to take advantage of the operator otherwise.
>
> Thanks!
>
> On Fri, Feb 10, 2023 at 12:23 PM Yaroslav Tkachenko
> <ya...@goldsky.com.invalid> wrote:
>
> > Hi Kevin!
> >
> > In my case, I automated this workflow by first deleting the current Flink
> > deployment and then creating a new one. So, if the initialSavepointPath
> is
> > different it'll use it for recovery.
> >
> > This approach is indeed irreversible, but so far it's been working well.
> >
> > On Fri, Feb 10, 2023 at 8:17 AM Kevin Lam <kevin.lam@shopify.com.invalid
> >
> > wrote:
> >
> > > Thanks for the response Gyula! Those caveats make sense, and I see,
> > there's
> > > a bit of a complexity to consider if the feature is implemented. I do
> > think
> > > it would be useful, so would also love to hear what others think!
> > >
> > > On Wed, Feb 8, 2023 at 3:47 AM Gyula Fóra <gy...@gmail.com>
> wrote:
> > >
> > > > Hi Kevin!
> > > >
> > > > Thanks for starting this discussion.
> > > >
> > > > On a high level what you are proposing is quite simple: if the
> initial
> > > > savepoint path changes we use that for the upgrade.
> > > >
> > > > I see a few caveats here that may be important:
> > > >
> > > >  1. To use a new savepoint/checkpoint path for recovery we have to
> stop
> > > the
> > > > job and delete all HA metadata. This means that this operation may
> not
> > be
> > > > "reversible" in some cases because we lose the checkpoint info with
> the
> > > HA
> > > > metadata (unless we force a savepoint on shutdown).
> > > >  2. This will break the current upgrade/checkpoint ownership model in
> > > which
> > > > the operator controls the checkpoints and ensures that you always get
> > the
> > > > latest (or an error). It will also make the reconciliation logic more
> > > > complex
> > > >  3. This could be a breaking change for current users (if for some
> > reason
> > > > they rely on the current behaviour, which is weird but still true)
> > > >  4. The name initialSavepointPath becomes a bit misleading
> > > >
> > > > I agree that it would be nice to make this easier for the user, but
> the
> > > > question is whether what we gain by this is worth the extra
> complexity.
> > > > I think under normal circumstances the user does not really want to
> > > > suddenly redeploy the job starting from a new state. If that happens
> I
> > > > think it makes sense to create a new deployment resource and it's
> not a
> > > > very big overhead.
> > > >
> > > > Currently when "manual" recovery is needed are cases when the
> operator
> > > > loses track of the latest checkpoint, mostly due to "incorrect" error
> > > > handling on the Flink side that also deletes the HA metadata. I think
> > we
> > > > should strive to improve and eliminate most of these cases (as we
> have
> > > > already done for many of these problems).
> > > >
> > > > Would be great to hear what others think about this topic!
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > > > On Tue, Feb 7, 2023 at 10:43 PM Kevin Lam
> > <kevin.lam@shopify.com.invalid
> > > >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I was reading the Flink Kubernetes Operator documentation and
> noticed
> > > > that
> > > > > if you want to redeploy a Flink job from a specific snapshot, you
> > must
> > > > > follow these manual recovery steps. Are there plans to streamline
> > this
> > > > > process? Deploying from a specific snapshot is a relatively common
> > > > > operation and it'd be nice to not need to delete the
> FlinkDeployment
> > > > >
> > > > > I wonder if the Flink Operator could use the initialSavepointPath
> > > similar
> > > > > to the restartNonce and savepointTriggerNonce parameters, where if
> > > > > initialSavepointPath changes, the deployed job is restored from the
> > > > > specified savepoint. Any thoughts?
> > > > >
> > > > > Thanks!
> > > > >
> > > >
> > >
> >
>

Re: Flink Operator - Supporting Recovery from Snapshot

Posted by Kevin Lam <ke...@shopify.com.INVALID>.
Hey Yaroslav!

Awesome, good to know that approach works well for you. I think our plan as
of now is to do the same--delete the current FlinkDeployment when deploying
from a specific snapshot. It'll be a separate workflow from normal
deployments to take advantage of the operator otherwise.

Thanks!

On Fri, Feb 10, 2023 at 12:23 PM Yaroslav Tkachenko
<ya...@goldsky.com.invalid> wrote:

> Hi Kevin!
>
> In my case, I automated this workflow by first deleting the current Flink
> deployment and then creating a new one. So, if the initialSavepointPath is
> different it'll use it for recovery.
>
> This approach is indeed irreversible, but so far it's been working well.
>
> On Fri, Feb 10, 2023 at 8:17 AM Kevin Lam <ke...@shopify.com.invalid>
> wrote:
>
> > Thanks for the response Gyula! Those caveats make sense, and I see,
> there's
> > a bit of a complexity to consider if the feature is implemented. I do
> think
> > it would be useful, so would also love to hear what others think!
> >
> > On Wed, Feb 8, 2023 at 3:47 AM Gyula Fóra <gy...@gmail.com> wrote:
> >
> > > Hi Kevin!
> > >
> > > Thanks for starting this discussion.
> > >
> > > On a high level what you are proposing is quite simple: if the initial
> > > savepoint path changes we use that for the upgrade.
> > >
> > > I see a few caveats here that may be important:
> > >
> > >  1. To use a new savepoint/checkpoint path for recovery we have to stop
> > the
> > > job and delete all HA metadata. This means that this operation may not
> be
> > > "reversible" in some cases because we lose the checkpoint info with the
> > HA
> > > metadata (unless we force a savepoint on shutdown).
> > >  2. This will break the current upgrade/checkpoint ownership model in
> > which
> > > the operator controls the checkpoints and ensures that you always get
> the
> > > latest (or an error). It will also make the reconciliation logic more
> > > complex
> > >  3. This could be a breaking change for current users (if for some
> reason
> > > they rely on the current behaviour, which is weird but still true)
> > >  4. The name initialSavepointPath becomes a bit misleading
> > >
> > > I agree that it would be nice to make this easier for the user, but the
> > > question is whether what we gain by this is worth the extra complexity.
> > > I think under normal circumstances the user does not really want to
> > > suddenly redeploy the job starting from a new state. If that happens I
> > > think it makes sense to create a new deployment resource and it's not a
> > > very big overhead.
> > >
> > > Currently when "manual" recovery is needed are cases when the operator
> > > loses track of the latest checkpoint, mostly due to "incorrect" error
> > > handling on the Flink side that also deletes the HA metadata. I think
> we
> > > should strive to improve and eliminate most of these cases (as we have
> > > already done for many of these problems).
> > >
> > > Would be great to hear what others think about this topic!
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Tue, Feb 7, 2023 at 10:43 PM Kevin Lam
> <kevin.lam@shopify.com.invalid
> > >
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I was reading the Flink Kubernetes Operator documentation and noticed
> > > that
> > > > if you want to redeploy a Flink job from a specific snapshot, you
> must
> > > > follow these manual recovery steps. Are there plans to streamline
> this
> > > > process? Deploying from a specific snapshot is a relatively common
> > > > operation and it'd be nice to not need to delete the FlinkDeployment
> > > >
> > > > I wonder if the Flink Operator could use the initialSavepointPath
> > similar
> > > > to the restartNonce and savepointTriggerNonce parameters, where if
> > > > initialSavepointPath changes, the deployed job is restored from the
> > > > specified savepoint. Any thoughts?
> > > >
> > > > Thanks!
> > > >
> > >
> >
>

Re: Flink Operator - Supporting Recovery from Snapshot

Posted by Yaroslav Tkachenko <ya...@goldsky.com.INVALID>.
Hi Kevin!

In my case, I automated this workflow by first deleting the current Flink
deployment and then creating a new one. So, if the initialSavepointPath is
different it'll use it for recovery.

This approach is indeed irreversible, but so far it's been working well.

On Fri, Feb 10, 2023 at 8:17 AM Kevin Lam <ke...@shopify.com.invalid>
wrote:

> Thanks for the response Gyula! Those caveats make sense, and I see, there's
> a bit of a complexity to consider if the feature is implemented. I do think
> it would be useful, so would also love to hear what others think!
>
> On Wed, Feb 8, 2023 at 3:47 AM Gyula Fóra <gy...@gmail.com> wrote:
>
> > Hi Kevin!
> >
> > Thanks for starting this discussion.
> >
> > On a high level what you are proposing is quite simple: if the initial
> > savepoint path changes we use that for the upgrade.
> >
> > I see a few caveats here that may be important:
> >
> >  1. To use a new savepoint/checkpoint path for recovery we have to stop
> the
> > job and delete all HA metadata. This means that this operation may not be
> > "reversible" in some cases because we lose the checkpoint info with the
> HA
> > metadata (unless we force a savepoint on shutdown).
> >  2. This will break the current upgrade/checkpoint ownership model in
> which
> > the operator controls the checkpoints and ensures that you always get the
> > latest (or an error). It will also make the reconciliation logic more
> > complex
> >  3. This could be a breaking change for current users (if for some reason
> > they rely on the current behaviour, which is weird but still true)
> >  4. The name initialSavepointPath becomes a bit misleading
> >
> > I agree that it would be nice to make this easier for the user, but the
> > question is whether what we gain by this is worth the extra complexity.
> > I think under normal circumstances the user does not really want to
> > suddenly redeploy the job starting from a new state. If that happens I
> > think it makes sense to create a new deployment resource and it's not a
> > very big overhead.
> >
> > Currently when "manual" recovery is needed are cases when the operator
> > loses track of the latest checkpoint, mostly due to "incorrect" error
> > handling on the Flink side that also deletes the HA metadata. I think we
> > should strive to improve and eliminate most of these cases (as we have
> > already done for many of these problems).
> >
> > Would be great to hear what others think about this topic!
> >
> > Cheers,
> > Gyula
> >
> > On Tue, Feb 7, 2023 at 10:43 PM Kevin Lam <kevin.lam@shopify.com.invalid
> >
> > wrote:
> >
> > > Hello,
> > >
> > > I was reading the Flink Kubernetes Operator documentation and noticed
> > that
> > > if you want to redeploy a Flink job from a specific snapshot, you must
> > > follow these manual recovery steps. Are there plans to streamline this
> > > process? Deploying from a specific snapshot is a relatively common
> > > operation and it'd be nice to not need to delete the FlinkDeployment
> > >
> > > I wonder if the Flink Operator could use the initialSavepointPath
> similar
> > > to the restartNonce and savepointTriggerNonce parameters, where if
> > > initialSavepointPath changes, the deployed job is restored from the
> > > specified savepoint. Any thoughts?
> > >
> > > Thanks!
> > >
> >
>

Re: Flink Operator - Supporting Recovery from Snapshot

Posted by Kevin Lam <ke...@shopify.com.INVALID>.
Thanks for the response Gyula! Those caveats make sense, and I see, there's
a bit of a complexity to consider if the feature is implemented. I do think
it would be useful, so would also love to hear what others think!

On Wed, Feb 8, 2023 at 3:47 AM Gyula Fóra <gy...@gmail.com> wrote:

> Hi Kevin!
>
> Thanks for starting this discussion.
>
> On a high level what you are proposing is quite simple: if the initial
> savepoint path changes we use that for the upgrade.
>
> I see a few caveats here that may be important:
>
>  1. To use a new savepoint/checkpoint path for recovery we have to stop the
> job and delete all HA metadata. This means that this operation may not be
> "reversible" in some cases because we lose the checkpoint info with the HA
> metadata (unless we force a savepoint on shutdown).
>  2. This will break the current upgrade/checkpoint ownership model in which
> the operator controls the checkpoints and ensures that you always get the
> latest (or an error). It will also make the reconciliation logic more
> complex
>  3. This could be a breaking change for current users (if for some reason
> they rely on the current behaviour, which is weird but still true)
>  4. The name initialSavepointPath becomes a bit misleading
>
> I agree that it would be nice to make this easier for the user, but the
> question is whether what we gain by this is worth the extra complexity.
> I think under normal circumstances the user does not really want to
> suddenly redeploy the job starting from a new state. If that happens I
> think it makes sense to create a new deployment resource and it's not a
> very big overhead.
>
> Currently when "manual" recovery is needed are cases when the operator
> loses track of the latest checkpoint, mostly due to "incorrect" error
> handling on the Flink side that also deletes the HA metadata. I think we
> should strive to improve and eliminate most of these cases (as we have
> already done for many of these problems).
>
> Would be great to hear what others think about this topic!
>
> Cheers,
> Gyula
>
> On Tue, Feb 7, 2023 at 10:43 PM Kevin Lam <ke...@shopify.com.invalid>
> wrote:
>
> > Hello,
> >
> > I was reading the Flink Kubernetes Operator documentation and noticed
> that
> > if you want to redeploy a Flink job from a specific snapshot, you must
> > follow these manual recovery steps. Are there plans to streamline this
> > process? Deploying from a specific snapshot is a relatively common
> > operation and it'd be nice to not need to delete the FlinkDeployment
> >
> > I wonder if the Flink Operator could use the initialSavepointPath similar
> > to the restartNonce and savepointTriggerNonce parameters, where if
> > initialSavepointPath changes, the deployed job is restored from the
> > specified savepoint. Any thoughts?
> >
> > Thanks!
> >
>

Re: Flink Operator - Supporting Recovery from Snapshot

Posted by Gyula Fóra <gy...@gmail.com>.
Hi Kevin!

Thanks for starting this discussion.

On a high level what you are proposing is quite simple: if the initial
savepoint path changes we use that for the upgrade.

I see a few caveats here that may be important:

 1. To use a new savepoint/checkpoint path for recovery we have to stop the
job and delete all HA metadata. This means that this operation may not be
"reversible" in some cases because we lose the checkpoint info with the HA
metadata (unless we force a savepoint on shutdown).
 2. This will break the current upgrade/checkpoint ownership model in which
the operator controls the checkpoints and ensures that you always get the
latest (or an error). It will also make the reconciliation logic more
complex
 3. This could be a breaking change for current users (if for some reason
they rely on the current behaviour, which is weird but still true)
 4. The name initialSavepointPath becomes a bit misleading

I agree that it would be nice to make this easier for the user, but the
question is whether what we gain by this is worth the extra complexity.
I think under normal circumstances the user does not really want to
suddenly redeploy the job starting from a new state. If that happens I
think it makes sense to create a new deployment resource and it's not a
very big overhead.

Currently when "manual" recovery is needed are cases when the operator
loses track of the latest checkpoint, mostly due to "incorrect" error
handling on the Flink side that also deletes the HA metadata. I think we
should strive to improve and eliminate most of these cases (as we have
already done for many of these problems).

Would be great to hear what others think about this topic!

Cheers,
Gyula

On Tue, Feb 7, 2023 at 10:43 PM Kevin Lam <ke...@shopify.com.invalid>
wrote:

> Hello,
>
> I was reading the Flink Kubernetes Operator documentation and noticed that
> if you want to redeploy a Flink job from a specific snapshot, you must
> follow these manual recovery steps. Are there plans to streamline this
> process? Deploying from a specific snapshot is a relatively common
> operation and it'd be nice to not need to delete the FlinkDeployment
>
> I wonder if the Flink Operator could use the initialSavepointPath similar
> to the restartNonce and savepointTriggerNonce parameters, where if
> initialSavepointPath changes, the deployed job is restored from the
> specified savepoint. Any thoughts?
>
> Thanks!
>