You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Peter Westermann <no...@genesys.com> on 2022/06/16 13:53:48 UTC

Sporadic issues with savepoint status lookup in Flink 1.15

We recently upgraded one of our Flink clusters to version 1.15.0 and are now seeing sporadic issues when stopping a job with a savepoint via the REST API. This happens for /jobs/:jobid/savepoints and /jobs/:jobid/stop:
The job finishes with a savepoint but the triggerId returned from the REST API seems to be invalid. Any lookups via /jobs/:jobid/savepoints/:triggerid fail with a 404 and the following error:

org.apache.flink.runtime.rest.handler.RestHandlerException: There is no savepoint operation with triggerId=cee5054245598efb42245b3046a6ae75 for job 0995a9461f0178294ea71c9accbe750c


Peter Westermann
Analytics Software Architect
[cidimage001.jpg@01D78D4C.C00AC080]
peter.westermann@genesys.com<ma...@genesys.com>
[cidimage001.jpg@01D78D4C.C00AC080]
[cidimage002.jpg@01D78D4C.C00AC080]<http://www.genesys.com/>

Re: Sporadic issues with savepoint status lookup in Flink 1.15

Posted by Chesnay Schepler <ch...@apache.org>.

Are there any log messages from the CompletedOperationCache in the logs?

On 16/06/2022 16:54, Chesnay Schepler wrote:
> There is an expected case where this might happen:
> if too much time has elapsed since the savepoint was completed 
> (default 5 minutes; controlled by rest.async.store-duration)
>
> Did this happen earlier than that?
>
> On 16/06/2022 15:53, Peter Westermann wrote:
>>
>> We recently upgraded one of our Flink clusters to version 1.15.0 and 
>> are now seeing sporadic issues when stopping a job with a savepoint 
>> via the REST API. This happens for */jobs/:jobid/savepoints 
>> *and*/jobs/:jobid/stop*:
>>
>> The job finishes with a savepoint but the triggerId returned from the 
>> REST API seems to be invalid. Any lookups via 
>> */jobs/:jobid/savepoints/:triggerid* fail with a 404 and the 
>> following error:
>>
>> org.apache.flink.runtime.rest.handler.RestHandlerException: There is 
>> no savepoint operation with 
>> triggerId=cee5054245598efb42245b3046a6ae75 for job 
>> 0995a9461f0178294ea71c9accbe750c
>>
>> Peter Westermann
>>
>> Analytics Software Architect
>>
>> cidimage001.jpg@01D78D4C.C00AC080
>>
>> peter.westermann@genesys.com <ma...@genesys.com>
>>
>> cidimage001.jpg@01D78D4C.C00AC080
>>
>> cidimage002.jpg@01D78D4C.C00AC080 <http://www.genesys.com/>
>>
>

Re: Sporadic issues with savepoint status lookup in Flink 1.15

Posted by Chesnay Schepler <ch...@apache.org>.

We did several changes to the savepoint rest API backend, where 
something may have snuck in.
The odd thing is that you only see the issue for stop-with-savepoint, 
which are internally handled the same way as savepoints.

On 16/06/2022 17:57, Peter Westermann wrote:
>
> We run a standalone Flink cluster in session mode (but we usually only 
> run one job per cluster; session mode just fits better with our 
> deployment workflow than application mode).
>
> We trigger hourly savepoints and also use savepoints to stop a job and 
> then restart with a new version of the jar.
>
> I haven’t seen any issue with the hourly savepoints (without stopping 
> the job).  For these, I can see messages such as Evicted result with 
> trigger id 30f9457373eba7b9de1bdeaf591a6956 because its TTL of 300s 
> has expired.
>
> ~5 minutes after savepoint completion.
>
> When the stop-with-savepoint status lookup fails with Exception 
> occurred in REST handler: There is no savepoint operation with 
> triggerId=cee5054245598efb42245b3046a6ae75
>
> I still see Evicted result with trigger id 
> cee5054245598efb42245b3046a6ae75because its TTL of 300s has expired.~5 
> minutes after savepoint completion.
>
> The documentation 
> <https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/rest_api/#api> 
> for Flink 1.15 mentions a new feature:
>
> /For (stop-with-)savepoint operations you can control this 
> //triggerId// by setting it in the body of the request that triggers 
> the operation. This allow you to safely* retry such operations without 
> triggering multiple savepoints./
>
> Could this have anything to do with the error I am seeing?
>
> Peter Westermann
>
> Analytics Software Architect
>
> cidimage001.jpg@01D78D4C.C00AC080
>
> peter.westermann@genesys.com <ma...@genesys.com>
>
> cidimage001.jpg@01D78D4C.C00AC080
>
> cidimage002.jpg@01D78D4C.C00AC080 <http://www.genesys.com/>
>
> *From: *Chesnay Schepler <ch...@apache.org>
> *Date: *Thursday, June 16, 2022 at 11:32 AM
> *To: *Peter Westermann <no...@genesys.com>, 
> user@flink.apache.org <us...@flink.apache.org>
> *Subject: *Re: Sporadic issues with savepoint status lookup in Flink 1.15
>
> * EXTERNAL EMAIL - Please use caution with links and attachments *
>
> ------------------------------------------------------------------------
>
> ok that shouldn't happen. I couldn't find anything wrong in the code 
> so far; will continue trying to reproduce it.
>
> If this happens, does it persist indefinitely for a particular 
> triggerId, or does it reappear later on again?
>
> Are you only ever triggering a single savepoint for a given job?
>
> Are you using session or application clusters?
>
> On 16/06/2022 16:59, Peter Westermann wrote:
>
>     If it happens it happens immediately. Once we receive the
>     triggerId from */jobs/:jobid/stop *or*/jobs/:jobid/savepoints* we
>     poll */jobs/:jobid/savepoints/:triggerid *every second until the
>     status is no longer IN_PROGRESS.
>
>     Peter Westermann
>
>     Analytics Software Architect
>
>     peter.westermann@genesys.com <ma...@genesys.com>
>
>     <http://www.genesys.com/>
>
>     *From: *Chesnay Schepler <ch...@apache.org>
>     <ma...@apache.org>
>     *Date: *Thursday, June 16, 2022 at 10:55 AM
>     *To: *Peter Westermann <no...@genesys.com>
>     <ma...@genesys.com>, user@flink.apache.org
>     <us...@flink.apache.org> <ma...@flink.apache.org>
>     *Subject: *Re: Sporadic issues with savepoint status lookup in
>     Flink 1.15
>
>     * EXTERNAL EMAIL - Please use caution with links and attachments *
>
>     ------------------------------------------------------------------------
>
>     There is an expected case where this might happen:
>
>     if too much time has elapsed since the savepoint was completed
>     (default 5 minutes; controlled by rest.async.store-duration)
>
>     Did this happen earlier than that?
>
>     On 16/06/2022 15:53, Peter Westermann wrote:
>
>         We recently upgraded one of our Flink clusters to version
>         1.15.0 and are now seeing sporadic issues when stopping a job
>         with a savepoint via the REST API. This happens for
>         */jobs/:jobid/savepoints *and*/jobs/:jobid/stop*:
>
>         The job finishes with a savepoint but the triggerId returned
>         from the REST API seems to be invalid. Any lookups via
>         */jobs/:jobid/savepoints/:triggerid* fail with a 404 and the
>         following error:
>
>         org.apache.flink.runtime.rest.handler.RestHandlerException:
>         There is no savepoint operation with
>         triggerId=cee5054245598efb42245b3046a6ae75 for job
>         0995a9461f0178294ea71c9accbe750c
>
>         Peter Westermann
>
>         Analytics Software Architect
>
>         cidimage001.jpg@01D78D4C.C00AC080
>
>         peter.westermann@genesys.com <ma...@genesys.com>
>
>         cidimage001.jpg@01D78D4C.C00AC080
>
>         cidimage002.jpg@01D78D4C.C00AC080 <http://www.genesys.com/>
>

Re: Sporadic issues with savepoint status lookup in Flink 1.15

Posted by Peter Westermann <no...@genesys.com>.

We run a standalone Flink cluster in session mode (but we usually only run one job per cluster; session mode just fits better with our deployment workflow than application mode).
We trigger hourly savepoints and also use savepoints to stop a job and then restart with a new version of the jar.
I haven’t seen any issue with the hourly savepoints (without stopping the job).  For these, I can see messages such as Evicted result with trigger id 30f9457373eba7b9de1bdeaf591a6956 because its TTL of 300s has expired.
~5 minutes after savepoint completion.

When the stop-with-savepoint status lookup fails with Exception occurred in REST handler: There is no savepoint operation with triggerId=cee5054245598efb42245b3046a6ae75
I still see Evicted result with trigger id cee5054245598efb42245b3046a6ae75 because its TTL of 300s has expired. ~5 minutes after savepoint completion.

The documentation<https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/rest_api/#api> for Flink 1.15 mentions a new feature:
For (stop-with-)savepoint operations you can control this triggerId by setting it in the body of the request that triggers the operation. This allow you to safely* retry such operations without triggering multiple savepoints.

Could this have anything to do with the error I am seeing?



Peter Westermann
Analytics Software Architect
[cidimage001.jpg@01D78D4C.C00AC080]
peter.westermann@genesys.com<ma...@genesys.com>
[cidimage001.jpg@01D78D4C.C00AC080]
[cidimage002.jpg@01D78D4C.C00AC080]<http://www.genesys.com/>


From: Chesnay Schepler <ch...@apache.org>
Date: Thursday, June 16, 2022 at 11:32 AM
To: Peter Westermann <no...@genesys.com>, user@flink.apache.org <us...@flink.apache.org>
Subject: Re: Sporadic issues with savepoint status lookup in Flink 1.15
 EXTERNAL EMAIL - Please use caution with links and attachments

________________________________
ok that shouldn't happen. I couldn't find anything wrong in the code so far; will continue trying to reproduce it.

If this happens, does it persist indefinitely for a particular triggerId, or does it reappear later on again?
Are you only ever triggering a single savepoint for a given job?

Are you using session or application clusters?

On 16/06/2022 16:59, Peter Westermann wrote:
If it happens it happens immediately. Once we receive the triggerId from /jobs/:jobid/stop or /jobs/:jobid/savepoints we poll /jobs/:jobid/savepoints/:triggerid every second until the status is no longer IN_PROGRESS.

Peter Westermann
Analytics Software Architect
[cid:image003.jpg@01D88178.3859FDB0]
peter.westermann@genesys.com<ma...@genesys.com>
[cid:image003.jpg@01D88178.3859FDB0]
[cid:image004.jpg@01D88178.3859FDB0]<http://www.genesys.com/>


From: Chesnay Schepler <ch...@apache.org>
Date: Thursday, June 16, 2022 at 10:55 AM
To: Peter Westermann <no...@genesys.com>, user@flink.apache.org<ma...@flink.apache.org> <us...@flink.apache.org>
Subject: Re: Sporadic issues with savepoint status lookup in Flink 1.15
 EXTERNAL EMAIL - Please use caution with links and attachments

________________________________
There is an expected case where this might happen:
if too much time has elapsed since the savepoint was completed (default 5 minutes; controlled by rest.async.store-duration)

Did this happen earlier than that?

On 16/06/2022 15:53, Peter Westermann wrote:
We recently upgraded one of our Flink clusters to version 1.15.0 and are now seeing sporadic issues when stopping a job with a savepoint via the REST API. This happens for /jobs/:jobid/savepoints and /jobs/:jobid/stop:
The job finishes with a savepoint but the triggerId returned from the REST API seems to be invalid. Any lookups via /jobs/:jobid/savepoints/:triggerid fail with a 404 and the following error:

org.apache.flink.runtime.rest.handler.RestHandlerException: There is no savepoint operation with triggerId=cee5054245598efb42245b3046a6ae75 for job 0995a9461f0178294ea71c9accbe750c


Peter Westermann
Analytics Software Architect
[cidimage001.jpg@01D78D4C.C00AC080]
peter.westermann@genesys.com<ma...@genesys.com>
[cidimage001.jpg@01D78D4C.C00AC080]
[cidimage002.jpg@01D78D4C.C00AC080]<http://www.genesys.com/>

Re: Sporadic issues with savepoint status lookup in Flink 1.15

Posted by Chesnay Schepler <ch...@apache.org>.

ok that shouldn't happen. I couldn't find anything wrong in the code so 
far; will continue trying to reproduce it.

If this happens, does it persist indefinitely for a particular 
triggerId, or does it reappear later on again?
Are you only ever triggering a single savepoint for a given job?

Are you using session or application clusters?

On 16/06/2022 16:59, Peter Westermann wrote:
>
> If it happens it happens immediately. Once we receive the triggerId 
> from */jobs/:jobid/stop *or*/jobs/:jobid/savepoints* we poll 
> */jobs/:jobid/savepoints/:triggerid *every second until the status is 
> no longer IN_PROGRESS.
>
> Peter Westermann
>
> Analytics Software Architect
>
> cidimage001.jpg@01D78D4C.C00AC080
>
> peter.westermann@genesys.com <ma...@genesys.com>
>
> cidimage001.jpg@01D78D4C.C00AC080
>
> cidimage002.jpg@01D78D4C.C00AC080 <http://www.genesys.com/>
>
> *From: *Chesnay Schepler <ch...@apache.org>
> *Date: *Thursday, June 16, 2022 at 10:55 AM
> *To: *Peter Westermann <no...@genesys.com>, 
> user@flink.apache.org <us...@flink.apache.org>
> *Subject: *Re: Sporadic issues with savepoint status lookup in Flink 1.15
>
> * EXTERNAL EMAIL - Please use caution with links and attachments *
>
> ------------------------------------------------------------------------
>
> There is an expected case where this might happen:
>
> if too much time has elapsed since the savepoint was completed 
> (default 5 minutes; controlled by rest.async.store-duration)
>
> Did this happen earlier than that?
>
> On 16/06/2022 15:53, Peter Westermann wrote:
>
>     We recently upgraded one of our Flink clusters to version 1.15.0
>     and are now seeing sporadic issues when stopping a job with a
>     savepoint via the REST API. This happens for
>     */jobs/:jobid/savepoints *and*/jobs/:jobid/stop*:
>
>     The job finishes with a savepoint but the triggerId returned from
>     the REST API seems to be invalid. Any lookups via
>     */jobs/:jobid/savepoints/:triggerid* fail with a 404 and the
>     following error:
>
>     org.apache.flink.runtime.rest.handler.RestHandlerException: There
>     is no savepoint operation with
>     triggerId=cee5054245598efb42245b3046a6ae75 for job
>     0995a9461f0178294ea71c9accbe750c
>
>     Peter Westermann
>
>     Analytics Software Architect
>
>     cidimage001.jpg@01D78D4C.C00AC080
>
>     peter.westermann@genesys.com <ma...@genesys.com>
>
>     cidimage001.jpg@01D78D4C.C00AC080
>
>     cidimage002.jpg@01D78D4C.C00AC080 <http://www.genesys.com/>
>

Re: Sporadic issues with savepoint status lookup in Flink 1.15

Posted by Peter Westermann <no...@genesys.com>.

If it happens it happens immediately. Once we receive the triggerId from /jobs/:jobid/stop or /jobs/:jobid/savepoints we poll /jobs/:jobid/savepoints/:triggerid every second until the status is no longer IN_PROGRESS.

Peter Westermann
Analytics Software Architect
[cidimage001.jpg@01D78D4C.C00AC080]
peter.westermann@genesys.com<ma...@genesys.com>
[cidimage001.jpg@01D78D4C.C00AC080]
[cidimage002.jpg@01D78D4C.C00AC080]<http://www.genesys.com/>


From: Chesnay Schepler <ch...@apache.org>
Date: Thursday, June 16, 2022 at 10:55 AM
To: Peter Westermann <no...@genesys.com>, user@flink.apache.org <us...@flink.apache.org>
Subject: Re: Sporadic issues with savepoint status lookup in Flink 1.15
 EXTERNAL EMAIL - Please use caution with links and attachments

________________________________
There is an expected case where this might happen:
if too much time has elapsed since the savepoint was completed (default 5 minutes; controlled by rest.async.store-duration)

Did this happen earlier than that?

On 16/06/2022 15:53, Peter Westermann wrote:
We recently upgraded one of our Flink clusters to version 1.15.0 and are now seeing sporadic issues when stopping a job with a savepoint via the REST API. This happens for /jobs/:jobid/savepoints and /jobs/:jobid/stop:
The job finishes with a savepoint but the triggerId returned from the REST API seems to be invalid. Any lookups via /jobs/:jobid/savepoints/:triggerid fail with a 404 and the following error:

org.apache.flink.runtime.rest.handler.RestHandlerException: There is no savepoint operation with triggerId=cee5054245598efb42245b3046a6ae75 for job 0995a9461f0178294ea71c9accbe750c


Peter Westermann
Analytics Software Architect
[cidimage001.jpg@01D78D4C.C00AC080]
peter.westermann@genesys.com<ma...@genesys.com>
[cidimage001.jpg@01D78D4C.C00AC080]
[cidimage002.jpg@01D78D4C.C00AC080]<http://www.genesys.com/>

Re: Sporadic issues with savepoint status lookup in Flink 1.15

Posted by Chesnay Schepler <ch...@apache.org>.

There is an expected case where this might happen:
if too much time has elapsed since the savepoint was completed (default 
5 minutes; controlled by rest.async.store-duration)

Did this happen earlier than that?

On 16/06/2022 15:53, Peter Westermann wrote:
>
> We recently upgraded one of our Flink clusters to version 1.15.0 and 
> are now seeing sporadic issues when stopping a job with a savepoint 
> via the REST API. This happens for */jobs/:jobid/savepoints 
> *and*/jobs/:jobid/stop*:
>
> The job finishes with a savepoint but the triggerId returned from the 
> REST API seems to be invalid. Any lookups via 
> */jobs/:jobid/savepoints/:triggerid* fail with a 404 and the following 
> error:
>
> org.apache.flink.runtime.rest.handler.RestHandlerException: There is 
> no savepoint operation with triggerId=cee5054245598efb42245b3046a6ae75 
> for job 0995a9461f0178294ea71c9accbe750c
>
> Peter Westermann
>
> Analytics Software Architect
>
> cidimage001.jpg@01D78D4C.C00AC080
>
> peter.westermann@genesys.com <ma...@genesys.com>
>
> cidimage001.jpg@01D78D4C.C00AC080
>
> cidimage002.jpg@01D78D4C.C00AC080 <http://www.genesys.com/>
>