You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Ivan Yang <iv...@gmail.com> on 2020/07/24 02:02:29 UTC

Flink 1.11 job stop with save point timeout error

Hello everyone,

We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange behavior when we stop a job to a save point got following time out error.
I checked Flink web console, the save point is created in s3 in 1 second.The job is fairly simple, so 1 second for savepoint generation is expected. We use kubernetes deployment. I clocked it, it’s about 60 seconds when it returns this error. So afterwards, the job is hanging (it still says running, but actually not doing anything). I need run another command to cancel it. Anyone has idea what’s going on here? BTW, “flink stop works” in 1.19.1 for us before



flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ flink stop 88d9b46f59d131428e2a18c9c7b3aa3f
Suspending job "88d9b46f59d131428e2a18c9c7b3aa3f" with a savepoint.

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.util.FlinkException: Could not stop with a savepoint job "88d9b46f59d131428e2a18c9c7b3aa3f".
	at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
	at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
	at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
	at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
	at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
	... 9 more
flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ 


Thanks in advance,
Ivan

Re: Flink 1.11 job stop with save point timeout error

Posted by Congxian Qiu <qc...@gmail.com>.

Hi Ivan
   From the JM log, the savepoint complete with 1 second, and the timeout
exception said that the stop-with-savepoint can not be completed in
60s(this was calculated by 20 -- RestOptions#RETRAY_MAX_ATTEMPTS * 3s --
RestOptions#RETRY_DELAY. you can check the logic here[1]). I'm not sure
what the root cause is currently, could you please share the complete job
JM log. thanks.

[1]
https://github.com/apache/flink/blob/abd58adb7aad54d242b67219498c211e9e18168b/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L382
Best,
Congxian


Ivan Yang <iv...@gmail.com> 于2020年7月25日周六 上午3:48写道：

> Hi Robert,
> Below is the job manager log after issuing the “flink stop” command
>
> ====================
> 2020-07-24 19:24:12,388 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> Triggering checkpoint 1 (type=CHECKPOINT) @ 1595618652138 for job
> 853c59916ac33dfbf46503b33289929e.
> 2020-07-24 19:24:13,914 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed
> checkpoint 1 for job 853c59916ac33dfbf46503b33289929e (7146 bytes in 1774
> ms).
> 2020-07-24 19:27:59,299 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Triggering stop-with-savepoint for job
> 853c59916ac33dfbf46503b33289929e.
> 2020-07-24 19:27:59,655 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> Triggering checkpoint 2 (type=SYNC_SAVEPOINT) @ 1595618879302 for job
> 853c59916ac33dfbf46503b33289929e.
> 2020-07-24 19:28:00,962 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed
> checkpoint 2 for job 853c59916ac33dfbf46503b33289929e (7147 bytes in 1240
> ms).
> ======================
>
> It looks normal to me.
>
> In the kubernetes deployment cluster, we set up a metric reporter, it has
> these keys in the flink-config.yaml
>
> # Metrics Reporter Configuration
> metrics.reporters: wavefront
> metrics.reporter.wavefront.interval: 60 SECONDS
> metrics.reporter.wavefront.env: prod
> metrics.reporter.wavefront.class:
> com.xxxxxxxxx.flink.monitor.WavefrontReporter
> metrics.reporter.wavefront.host: xxxxxx
> metrics.reporter.wavefront.token: xxxxxxxxxx
> metrics.scope.tm: flink.taskmanager
>
> Could this reporter interval interfere the job manager? I test the same
> job in a standalone
> Flink 1.11.0 without the reporter, Flink stop worked, and no hanging nor
> timeout. Also the same reporter is used in 1.9.1 version where we didn’t
> have issue on “flink stop”.
>
> Thanks
> Ivan
>
>
> On Jul 24, 2020, at 5:15 AM, Robert Metzger <rm...@apache.org> wrote:
>
> Hi Ivan,
> thanks a lot for your message. Can you post the JobManager log here as
> well? It might contain additional information on the reason for the timeout.
>
> On Fri, Jul 24, 2020 at 4:03 AM Ivan Yang <iv...@gmail.com> wrote:
>
>> Hello everyone,
>>
>> We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange
>> behavior when we stop a job to a save point got following time out error.
>> I checked Flink web console, the save point is created in s3 in 1
>> second.The job is fairly simple, so 1 second for savepoint generation is
>> expected. We use kubernetes deployment. I clocked it, it’s about 60 seconds
>> when it returns this error. So afterwards, the job is hanging (it still
>> says running, but actually not doing anything). I need run another command
>> to cancel it. Anyone has idea what’s going on here? BTW, “flink stop works”
>> in 1.19.1 for us before
>>
>>
>>
>> flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ flink stop
>> 88d9b46f59d131428e2a18c9c7b3aa3f
>> Suspending job "88d9b46f59d131428e2a18c9c7b3aa3f" with a savepoint.
>>
>> ------------------------------------------------------------
>>  The program finished with the following exception:
>>
>> org.apache.flink.util.FlinkException: Could not stop with a savepoint job
>> "88d9b46f59d131428e2a18c9c7b3aa3f".
>> at
>> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
>> at
>> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
>> at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
>> at
>> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
>> at
>> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>> at
>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
>> Caused by: java.util.concurrent.TimeoutException
>> at
>> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
>> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
>> at
>> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
>> ... 9 more
>> flink@flink-jobmanager-fcf5d84c5-sz4wk:~$
>>
>>
>> Thanks in advance,
>> Ivan
>>
>
>

Re: Flink 1.11 job stop with save point timeout error

Posted by Ivan Yang <iv...@gmail.com>.

Hi Robert,
Below is the job manager log after issuing the “flink stop” command

====================
2020-07-24 19:24:12,388 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 1 (type=CHECKPOINT) @ 1595618652138 for job 853c59916ac33dfbf46503b33289929e.
2020-07-24 19:24:13,914 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 1 for job 853c59916ac33dfbf46503b33289929e (7146 bytes in 1774 ms).
2020-07-24 19:27:59,299 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Triggering stop-with-savepoint for job 853c59916ac33dfbf46503b33289929e.
2020-07-24 19:27:59,655 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 2 (type=SYNC_SAVEPOINT) @ 1595618879302 for job 853c59916ac33dfbf46503b33289929e.
2020-07-24 19:28:00,962 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 2 for job 853c59916ac33dfbf46503b33289929e (7147 bytes in 1240 ms).
======================

It looks normal to me. 

In the kubernetes deployment cluster, we set up a metric reporter, it has these keys in the flink-config.yaml

# Metrics Reporter Configuration
metrics.reporters: wavefront
metrics.reporter.wavefront.interval: 60 SECONDS
metrics.reporter.wavefront.env: prod
metrics.reporter.wavefront.class: com.xxxxxxxxx.flink.monitor.WavefrontReporter
metrics.reporter.wavefront.host: xxxxxx
metrics.reporter.wavefront.token: xxxxxxxxxx
metrics.scope.tm: flink.taskmanager

Could this reporter interval interfere the job manager? I test the same job in a standalone 
Flink 1.11.0 without the reporter, Flink stop worked, and no hanging nor timeout. Also the same reporter is used in 1.9.1 version where we didn’t have issue on “flink stop”.

Thanks 
Ivan


> On Jul 24, 2020, at 5:15 AM, Robert Metzger <rm...@apache.org> wrote:
> 
> Hi Ivan,
> thanks a lot for your message. Can you post the JobManager log here as well? It might contain additional information on the reason for the timeout.
> 
> On Fri, Jul 24, 2020 at 4:03 AM Ivan Yang <ivanygyang@gmail.com <ma...@gmail.com>> wrote:
> Hello everyone,
> 
> We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange behavior when we stop a job to a save point got following time out error.
> I checked Flink web console, the save point is created in s3 in 1 second.The job is fairly simple, so 1 second for savepoint generation is expected. We use kubernetes deployment. I clocked it, it’s about 60 seconds when it returns this error. So afterwards, the job is hanging (it still says running, but actually not doing anything). I need run another command to cancel it. Anyone has idea what’s going on here? BTW, “flink stop works” in 1.19.1 for us before
> 
> 
> 
> flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ flink stop 88d9b46f59d131428e2a18c9c7b3aa3f
> Suspending job "88d9b46f59d131428e2a18c9c7b3aa3f" with a savepoint.
> 
> ------------------------------------------------------------
>  The program finished with the following exception:
> 
> org.apache.flink.util.FlinkException: Could not stop with a savepoint job "88d9b46f59d131428e2a18c9c7b3aa3f".
> 	at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
> 	at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
> 	at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
> 	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
> 	at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
> 	at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> 	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
> Caused by: java.util.concurrent.TimeoutException
> 	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
> 	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
> 	at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
> 	... 9 more
> flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ 
> 
> 
> Thanks in advance,
> Ivan

Re: Flink 1.11 job stop with save point timeout error

Posted by Robert Metzger <rm...@apache.org>.

Hi Ivan,
thanks a lot for your message. Can you post the JobManager log here as
well? It might contain additional information on the reason for the timeout.

On Fri, Jul 24, 2020 at 4:03 AM Ivan Yang <iv...@gmail.com> wrote:

> Hello everyone,
>
> We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange behavior
> when we stop a job to a save point got following time out error.
> I checked Flink web console, the save point is created in s3 in 1
> second.The job is fairly simple, so 1 second for savepoint generation is
> expected. We use kubernetes deployment. I clocked it, it’s about 60 seconds
> when it returns this error. So afterwards, the job is hanging (it still
> says running, but actually not doing anything). I need run another command
> to cancel it. Anyone has idea what’s going on here? BTW, “flink stop works”
> in 1.19.1 for us before
>
>
>
> flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ flink stop
> 88d9b46f59d131428e2a18c9c7b3aa3f
> Suspending job "88d9b46f59d131428e2a18c9c7b3aa3f" with a savepoint.
>
> ------------------------------------------------------------
>  The program finished with the following exception:
>
> org.apache.flink.util.FlinkException: Could not stop with a savepoint job
> "88d9b46f59d131428e2a18c9c7b3aa3f".
> at
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
> at
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
> at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
> at
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
> at
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
> at
> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
> Caused by: java.util.concurrent.TimeoutException
> at
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
> at
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
> ... 9 more
> flink@flink-jobmanager-fcf5d84c5-sz4wk:~$
>
>
> Thanks in advance,
> Ivan
>