You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Fabian Paul <fa...@data-artisans.com> on 2020/08/07 13:57:53 UTC

[DISCUSS] Retrieve savepoint location after suspension of jobclusters

Hi all,

Due to recent changes in the shutdown mechanism of Flink [1] it is not 
conveniently possible anymore to suspend a job running on a jobcluster 
with a savepoint and retrieve the savepoint location via the Flink API 
programmatically.

With the introduced changes the rest endpoint shutdowns immediately 
and rejects new request which makes the information inaccessible.

Before the changes it was possible to stop the job and query the savepoint 
info endpoint until the location was shown.
Admittedly, this was never a safe solution because it expected that the 
rest endpoint stays alive long enough.

I would like to see what the community thinks about this and whether it is 
worth to implement a different solution to retrieve those information.

Best,
Fabian
[1] https://issues.apache.org/jira/browse/FLINK-18663

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

Posted by Till Rohrmann <tr...@apache.org>.

Thanks for the logs Fabian. It is indeed a problem we introduced recently.
I've created a JIRA issue to fix the problem [1]. This fix will also be
included in the Flink 1.10.2 release.

[1] https://issues.apache.org/jira/browse/FLINK-18902

Cheers,
Till

On Wed, Aug 12, 2020 at 2:30 PM Fabian Paul <fa...@data-artisans.com>
wrote:

> I attached the last log lines[1] of the jobmanager after triggering the
> savepoint. I just
> saw the release for 1.10.2 is started so it would probably be great if we
> determine
> whether it is a bug to postpone the release if necessary.
> What do you think?
>
> Best,
> Fabian
>
> [1] https://pastebin.com/eWXN5fzS
>  <https://pastebin.com/eWXN5fzS>
>

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

Posted by Fabian Paul <fa...@data-artisans.com>.

I attached the last log lines[1] of the jobmanager after triggering the savepoint. I just
saw the release for 1.10.2 is started so it would probably be great if we determine 
whether it is a bug to postpone the release if necessary.
What do you think?

Best,
Fabian

[1] https://pastebin.com/eWXN5fzS
 <https://pastebin.com/eWXN5fzS>

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

Posted by Till Rohrmann <tr...@apache.org>.

This sounds like a bug in Flink. Could you share the logs of the cluster
(ideally with TRACE log level) with us?

Cheers,
Till

On Tue, Aug 11, 2020 at 9:49 AM Fabian Paul <fa...@data-artisans.com>
wrote:

> Hi Till,
>
> The problem is reproducible with a basic shell script doing the following
> operations.
>
> 1. Post request to /jobs/${JOB_ID}/savepoints with the payload
>          {"cancel-job": true,"target-directory": $(LOCATION)}
>         and store the trigger ID
>
> 2. Sleep 10 seconds
>
> 3. Get jobs/${JOB_ID}/savepoints/$(TRIGGER_ID)
>         results in a connect exception because rest endpoint is shutdown.
>
> Sorry, if I misunderstood you previous answer but I would expect that
> stopping the job
> with a savepoint is an asynchronous operation and should block the
> shutdown until
> the result is served.
> I also can confirm that the cluster is not shutdown but the rest endpoint
> is which makes
> it impossible to serve the asynchronous result.
>
> Best,
> Fabian
>
>

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

Posted by Fabian Paul <fa...@data-artisans.com>.

Hi Till,

The problem is reproducible with a basic shell script doing the following operations.

1. Post request to /jobs/${JOB_ID}/savepoints with the payload
	 {"cancel-job": true,"target-directory": $(LOCATION)}
	and store the trigger ID

2. Sleep 10 seconds

3. Get jobs/${JOB_ID}/savepoints/$(TRIGGER_ID)
	results in a connect exception because rest endpoint is shutdown.

Sorry, if I misunderstood you previous answer but I would expect that stopping the job 
with a savepoint is an asynchronous operation and should block the shutdown until 
the result is served.
I also can confirm that the cluster is not shutdown but the rest endpoint is which makes 
it impossible to serve the asynchronous result.

Best,
Fabian

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

Posted by Till Rohrmann <tr...@apache.org>.

Hi Fabian,

could explain a bit how you are cancelling a job with savepoint and then
try to retrieve the savepoint path?

When running Flink in per-job mode, the system should not shut down if you
have an asynchronous operation running whose result you have not yet
queried. I believe that this feature was introduced with FLINK-10309 [1].
The semantics is that Flink waits 5 minutes or until the result has been
queried (by any client) [2]. If this is not working, then this is clearly a
bug.

FLINK-18663 [3] solved a bug where the cluster would hang while trying to
shut it down. This was also a bug obviously.

[1] https://issues.apache.org/jira/browse/FLINK-10309
[2]
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/async/CompletedOperationCache.java#L141
[3] https://issues.apache.org/jira/browse/FLINK-18663

Cheers,
Till

On Fri, Aug 7, 2020 at 5:58 PM Eleanore Jin <el...@gmail.com> wrote:

> +1 Thank you Fabian!
>
> On Fri, Aug 7, 2020 at 6:58 AM Fabian Paul <fa...@data-artisans.com>
> wrote:
>
> > Hi all,
> >
> > Due to recent changes in the shutdown mechanism of Flink [1] it is not
> > conveniently possible anymore to suspend a job running on a jobcluster
> > with a savepoint and retrieve the savepoint location via the Flink API
> > programmatically.
> >
> > With the introduced changes the rest endpoint shutdowns immediately
> > and rejects new request which makes the information inaccessible.
> >
> > Before the changes it was possible to stop the job and query the
> savepoint
> > info endpoint until the location was shown.
> > Admittedly, this was never a safe solution because it expected that the
> > rest endpoint stays alive long enough.
> >
> > I would like to see what the community thinks about this and whether it
> is
> > worth to implement a different solution to retrieve those information.
> >
> > Best,
> > Fabian
> > [1] https://issues.apache.org/jira/browse/FLINK-18663
> >
>

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

Posted by Eleanore Jin <el...@gmail.com>.

+1 Thank you Fabian!

On Fri, Aug 7, 2020 at 6:58 AM Fabian Paul <fa...@data-artisans.com>
wrote:

> Hi all,
>
> Due to recent changes in the shutdown mechanism of Flink [1] it is not
> conveniently possible anymore to suspend a job running on a jobcluster
> with a savepoint and retrieve the savepoint location via the Flink API
> programmatically.
>
> With the introduced changes the rest endpoint shutdowns immediately
> and rejects new request which makes the information inaccessible.
>
> Before the changes it was possible to stop the job and query the savepoint
> info endpoint until the location was shown.
> Admittedly, this was never a safe solution because it expected that the
> rest endpoint stays alive long enough.
>
> I would like to see what the community thinks about this and whether it is
> worth to implement a different solution to retrieve those information.
>
> Best,
> Fabian
> [1] https://issues.apache.org/jira/browse/FLINK-18663
>