You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Diwakar Jha <di...@gmail.com> on 2022/03/25 04:04:30 UTC

Flink STOP with savepoint

Hello Everyone,

I'm running Flink 1.11 as EMR 6.1 as a Yarn application. I'm trying to use
STOP command to capture savepoint and restart job from the same savepoint
during redeployment.

flink stop -p $JOB_RUNNING -yid $YARN_APP_ID


Problem :
job completes savepoint on Flink UI but it throw the following error on
CLI. because of which i'm not able to capture savepoint and redeploy the
application.

SLF4J: Actual binding is of type
[org.apache.logging.slf4j.Log4jLoggerFactory]
Suspending job "72a4746f2b6f1d58e4c61c1e9214a7e3" with a savepoint.
2022-03-24 21 <2022032421>:30:10,850 INFO
org.apache.hadoop.yarn.client.RMProxy [] - Connecting to ResourceManager at
ip-10-0-36-99.ec2.internal/10.0.36.99:8032
2022-03-24 21 <2022032421>:30:11,032 INFO
org.apache.hadoop.yarn.client.AHSProxy [] - Connecting to Application
History server at ip-10-0-36-99.ec2.internal/10.0.36.99:10200
2022-03-24 21 <2022032421>:30:11,044 INFO
org.apache.flink.yarn.YarnClusterDescriptor [] - No path for the flink jar
passed. Using the location of class
org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2022-03-24 21 <2022032421>:30:11,142 INFO
org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface
ip-10-0-32-110.ec2.internal:46821 of application
'application_1647995456636_0001'.


The program finished with the following exception:

org.apache.flink.util.FlinkException: Could not stop with a savepoint job
"72a4746f2b6f1d58e4c61c1e9214a7e3".
at
org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
at
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
at
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
at
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at
org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
Caused by: java.util.concurrent.TimeoutException
at
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at
org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
... 9 more

If i use CANCEL instead of STOP then it always works but since CANCEL
doesn't give graceful shutdown hence i'm trying to use STOP.

Could someone please suggest how to fix this error?

Re: Flink STOP with savepoint

Posted by Zhanghao Chen <zh...@outlook.com>.
Hi Diwakar,

The client log doesn't contain much useful info except that the operation timed out. You could try:

  1.  Check the JM log to see if there is any relevant info.
  2.  Increase the client timeout to see if that helps.

Best,
Zhanghao Chen
________________________________
From: Diwakar Jha <di...@gmail.com>
Sent: Friday, March 25, 2022 12:04
To: user <us...@flink.apache.org>
Cc: Arvid Heise <ar...@apache.org>; Gen Luo <lu...@gmail.com>
Subject: Flink STOP with savepoint


Hello Everyone,

I'm running Flink 1.11 as EMR 6.1 as a Yarn application. I'm trying to use STOP command to capture savepoint and restart job from the same savepoint during redeployment.

flink stop -p $JOB_RUNNING -yid $YARN_APP_ID

Problem :
job completes savepoint on Flink UI but it throw the following error on CLI. because of which i'm not able to capture savepoint and redeploy the application.

SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Suspending job "72a4746f2b6f1d58e4c61c1e9214a7e3" with a savepoint.
2022-03-24 21<tel:2022032421>:30:10,850 INFO org.apache.hadoop.yarn.client.RMProxy [] - Connecting to ResourceManager at ip-10-0-36-99.ec2.internal/10.0.36.99:8032<http://10.0.36.99:8032>
2022-03-24 21<tel:2022032421>:30:11,032 INFO org.apache.hadoop.yarn.client.AHSProxy [] - Connecting to Application History server at ip-10-0-36-99.ec2.internal/10.0.36.99:10200<http://10.0.36.99:10200>
2022-03-24 21<tel:2022032421>:30:11,044 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2022-03-24 21<tel:2022032421>:30:11,142 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface ip-10-0-32-110.ec2.internal:46821 of application 'application_1647995456636_0001'.


The program finished with the following exception:

org.apache.flink.util.FlinkException: Could not stop with a savepoint job "72a4746f2b6f1d58e4c61c1e9214a7e3".
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
... 9 more

If i use CANCEL instead of STOP then it always works but since CANCEL doesn't give graceful shutdown hence i'm trying to use STOP.

Could someone please suggest how to fix this error?