You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by jonas eyob <jo...@gmail.com> on 2022/07/05 14:53:57 UTC

Best practice for creating/restoring savepoint in standalone k8 setup

Hi!

We are running a Standalone job on Kubernetes using application deployment
mode, with HA enabled.

We have attempted to automate how we create and restore savepoints by
running a script for generating a savepoint (using k8 preStop hook) and
another one for restoring from a savepoint (located in a S3 bucket).

Restoring from a savepoint is typically not a problem once we have a
savepoint generated and accessible in our s3 bucket. The problem is
generating the savepoint which hasn't been very reliable thus far. Logs are
not particularly helpful either so we wanted to rethink how we go about
taking savepoints.

Are there any best practices for doing this in a CI/CD manner given our
setup?

--

Re: Best practice for creating/restoring savepoint in standalone k8 setup

Posted by Gyula Fóra <gy...@gmail.com>.

Hi Jonas!

I think generally managed platforms used to provide the functionality that
you are after. Otherwise it's mostly home grown CI/CD integrations :)

The Kubernetes Operator is maybe the first initiative to bring proper
application lifecycle management to the ecosystem directly.

Cheers,
Gyula

On Tue, Jul 5, 2022 at 6:45 PM jonas eyob <jo...@gmail.com> wrote:

> Thanks Weihua and Gyula,
>
> @Weihia
> > If you restart flink cluster by delete/create deployment directly, it
> will be automatically restored from the latest checkpoint[1], so maybe just
> enabling the checkpoint is enough.
> Not sure I follow, we might have changes to the job that will require us
> to restore from a savepoint, where checkpoints wouldn't be possible due to
> significant changes to the JobGraph.
>
> > But if you want to use savepoint, you need to check whether the latest
> savepoint is successful (check whether have _metadata in savepoint dir is
> useful in most scenarios, but in some cases the _metadata may not be
> completed).
>
> Yes that is basically what our savepoint restore script does, it checks S3
> to see if we have any savepoints generated and will specify that to the
> "--fromSavePoint" argument.
>
> @Gyula
>
> >Did you check the https://github.com/apache/flink-kubernetes-operator
> <https://github.com/apache/flink-kubernetes-operator> by any chance?
> Interesting, no I have missed this! will have a look but it would also be
> interesting to see how this have been solved before the introduction of the
> Flink operator
>
> Den tis 5 juli 2022 kl 16:37 skrev Gyula Fóra <gy...@gmail.com>:
>
>> Hi!
>>
>> Did you check the https://github.com/apache/flink-kubernetes-operator
>> <https://github.com/apache/flink-kubernetes-operator> by any chance?
>>
>> It provides many of the application lifecycle features that you are
>> probably after straight out-of-the-box. It has both manual and periodic
>> savepoint triggering also included in the latest upcoming version :)
>>
>> Cheers,
>> Gyula
>>
>> On Tue, Jul 5, 2022 at 5:34 PM Weihua Hu <hu...@gmail.com> wrote:
>>
>>> Hi, jonas
>>>
>>> If you restart flink cluster by delete/create deployment directly, it
>>> will be automatically restored from the latest checkpoint[1], so maybe just
>>> enabling the checkpoint is enough.
>>> But if you want to use savepoint, you need to check whether the latest
>>> savepoint is successful (check whether have _metadata in savepoint dir is
>>> useful in most scenarios, but in some cases the _metadata may not be
>>> completed).
>>>
>>> [1]
>>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/
>>>
>>> Best,
>>> Weihua
>>>
>>>
>>> On Tue, Jul 5, 2022 at 10:54 PM jonas eyob <jo...@gmail.com> wrote:
>>>
>>>> Hi!
>>>>
>>>> We are running a Standalone job on Kubernetes using application
>>>> deployment mode, with HA enabled.
>>>>
>>>> We have attempted to automate how we create and restore savepoints by
>>>> running a script for generating a savepoint (using k8 preStop hook) and
>>>> another one for restoring from a savepoint (located in a S3 bucket).
>>>>
>>>> Restoring from a savepoint is typically not a problem once we have a
>>>> savepoint generated and accessible in our s3 bucket. The problem is
>>>> generating the savepoint which hasn't been very reliable thus far. Logs are
>>>> not particularly helpful either so we wanted to rethink how we go about
>>>> taking savepoints.
>>>>
>>>> Are there any best practices for doing this in a CI/CD manner given our
>>>> setup?
>>>>
>>>> --
>>>>
>>>>
>
> --
> *Med Vänliga Hälsningar*
> *Jonas Eyob*
>

Re: Best practice for creating/restoring savepoint in standalone k8 setup

Posted by jonas eyob <jo...@gmail.com>.

Thanks Weihua and Gyula,

@Weihia
> If you restart flink cluster by delete/create deployment directly, it
will be automatically restored from the latest checkpoint[1], so maybe just
enabling the checkpoint is enough.
Not sure I follow, we might have changes to the job that will require us to
restore from a savepoint, where checkpoints wouldn't be possible due to
significant changes to the JobGraph.

> But if you want to use savepoint, you need to check whether the latest
savepoint is successful (check whether have _metadata in savepoint dir is
useful in most scenarios, but in some cases the _metadata may not be
completed).

Yes that is basically what our savepoint restore script does, it checks S3
to see if we have any savepoints generated and will specify that to the
"--fromSavePoint" argument.

@Gyula

>Did you check the https://github.com/apache/flink-kubernetes-operator
<https://github.com/apache/flink-kubernetes-operator> by any chance?
Interesting, no I have missed this! will have a look but it would also be
interesting to see how this have been solved before the introduction of the
Flink operator

Den tis 5 juli 2022 kl 16:37 skrev Gyula Fóra <gy...@gmail.com>:

> Hi!
>
> Did you check the https://github.com/apache/flink-kubernetes-operator
> <https://github.com/apache/flink-kubernetes-operator> by any chance?
>
> It provides many of the application lifecycle features that you are
> probably after straight out-of-the-box. It has both manual and periodic
> savepoint triggering also included in the latest upcoming version :)
>
> Cheers,
> Gyula
>
> On Tue, Jul 5, 2022 at 5:34 PM Weihua Hu <hu...@gmail.com> wrote:
>
>> Hi, jonas
>>
>> If you restart flink cluster by delete/create deployment directly, it
>> will be automatically restored from the latest checkpoint[1], so maybe just
>> enabling the checkpoint is enough.
>> But if you want to use savepoint, you need to check whether the latest
>> savepoint is successful (check whether have _metadata in savepoint dir is
>> useful in most scenarios, but in some cases the _metadata may not be
>> completed).
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/
>>
>> Best,
>> Weihua
>>
>>
>> On Tue, Jul 5, 2022 at 10:54 PM jonas eyob <jo...@gmail.com> wrote:
>>
>>> Hi!
>>>
>>> We are running a Standalone job on Kubernetes using application
>>> deployment mode, with HA enabled.
>>>
>>> We have attempted to automate how we create and restore savepoints by
>>> running a script for generating a savepoint (using k8 preStop hook) and
>>> another one for restoring from a savepoint (located in a S3 bucket).
>>>
>>> Restoring from a savepoint is typically not a problem once we have a
>>> savepoint generated and accessible in our s3 bucket. The problem is
>>> generating the savepoint which hasn't been very reliable thus far. Logs are
>>> not particularly helpful either so we wanted to rethink how we go about
>>> taking savepoints.
>>>
>>> Are there any best practices for doing this in a CI/CD manner given our
>>> setup?
>>>
>>> --
>>>
>>>

-- 
*Med Vänliga Hälsningar*
*Jonas Eyob*

Re: Best practice for creating/restoring savepoint in standalone k8 setup

Posted by Gyula Fóra <gy...@gmail.com>.

Hi!

Did you check the https://github.com/apache/flink-kubernetes-operator
<https://github.com/apache/flink-kubernetes-operator> by any chance?

It provides many of the application lifecycle features that you are
probably after straight out-of-the-box. It has both manual and periodic
savepoint triggering also included in the latest upcoming version :)

Cheers,
Gyula

On Tue, Jul 5, 2022 at 5:34 PM Weihua Hu <hu...@gmail.com> wrote:

> Hi, jonas
>
> If you restart flink cluster by delete/create deployment directly, it will
> be automatically restored from the latest checkpoint[1], so maybe just
> enabling the checkpoint is enough.
> But if you want to use savepoint, you need to check whether the latest
> savepoint is successful (check whether have _metadata in savepoint dir is
> useful in most scenarios, but in some cases the _metadata may not be
> completed).
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/
>
> Best,
> Weihua
>
>
> On Tue, Jul 5, 2022 at 10:54 PM jonas eyob <jo...@gmail.com> wrote:
>
>> Hi!
>>
>> We are running a Standalone job on Kubernetes using application
>> deployment mode, with HA enabled.
>>
>> We have attempted to automate how we create and restore savepoints by
>> running a script for generating a savepoint (using k8 preStop hook) and
>> another one for restoring from a savepoint (located in a S3 bucket).
>>
>> Restoring from a savepoint is typically not a problem once we have a
>> savepoint generated and accessible in our s3 bucket. The problem is
>> generating the savepoint which hasn't been very reliable thus far. Logs are
>> not particularly helpful either so we wanted to rethink how we go about
>> taking savepoints.
>>
>> Are there any best practices for doing this in a CI/CD manner given our
>> setup?
>>
>> --
>>
>>

Re: Best practice for creating/restoring savepoint in standalone k8 setup

Posted by Weihua Hu <hu...@gmail.com>.

Hi, jonas

If you restart flink cluster by delete/create deployment directly, it will
be automatically restored from the latest checkpoint[1], so maybe just
enabling the checkpoint is enough.
But if you want to use savepoint, you need to check whether the latest
savepoint is successful (check whether have _metadata in savepoint dir is
useful in most scenarios, but in some cases the _metadata may not be
completed).

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/

Best,
Weihua

On Tue, Jul 5, 2022 at 10:54 PM jonas eyob <jo...@gmail.com> wrote:

> Hi!
>
> We are running a Standalone job on Kubernetes using application deployment
> mode, with HA enabled.
>
> We have attempted to automate how we create and restore savepoints by
> running a script for generating a savepoint (using k8 preStop hook) and
> another one for restoring from a savepoint (located in a S3 bucket).
>
> Restoring from a savepoint is typically not a problem once we have a
> savepoint generated and accessible in our s3 bucket. The problem is
> generating the savepoint which hasn't been very reliable thus far. Logs are
> not particularly helpful either so we wanted to rethink how we go about
> taking savepoints.
>
> Are there any best practices for doing this in a CI/CD manner given our
> setup?
>
> --
>
>