You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Diwakar Jha <di...@gmail.com> on 2022/03/25 04:04:30 UTC
Flink STOP with savepoint
Hello Everyone,
I'm running Flink 1.11 as EMR 6.1 as a Yarn application. I'm trying to use
STOP command to capture savepoint and restart job from the same savepoint
during redeployment.
flink stop -p $JOB_RUNNING -yid $YARN_APP_ID
Problem :
job completes savepoint on Flink UI but it throw the following error on
CLI. because of which i'm not able to capture savepoint and redeploy the
application.
SLF4J: Actual binding is of type
[org.apache.logging.slf4j.Log4jLoggerFactory]
Suspending job "72a4746f2b6f1d58e4c61c1e9214a7e3" with a savepoint.
2022-03-24 21 <2022032421>:30:10,850 INFO
org.apache.hadoop.yarn.client.RMProxy [] - Connecting to ResourceManager at
ip-10-0-36-99.ec2.internal/10.0.36.99:8032
2022-03-24 21 <2022032421>:30:11,032 INFO
org.apache.hadoop.yarn.client.AHSProxy [] - Connecting to Application
History server at ip-10-0-36-99.ec2.internal/10.0.36.99:10200
2022-03-24 21 <2022032421>:30:11,044 INFO
org.apache.flink.yarn.YarnClusterDescriptor [] - No path for the flink jar
passed. Using the location of class
org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2022-03-24 21 <2022032421>:30:11,142 INFO
org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface
ip-10-0-32-110.ec2.internal:46821 of application
'application_1647995456636_0001'.
The program finished with the following exception:
org.apache.flink.util.FlinkException: Could not stop with a savepoint job
"72a4746f2b6f1d58e4c61c1e9214a7e3".
at
org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
at
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
at
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
at
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at
org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
Caused by: java.util.concurrent.TimeoutException
at
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at
org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
... 9 more
If i use CANCEL instead of STOP then it always works but since CANCEL
doesn't give graceful shutdown hence i'm trying to use STOP.
Could someone please suggest how to fix this error?
Re: Flink STOP with savepoint
Posted by Zhanghao Chen <zh...@outlook.com>.
Hi Diwakar,
The client log doesn't contain much useful info except that the operation timed out. You could try:
1. Check the JM log to see if there is any relevant info.
2. Increase the client timeout to see if that helps.
Best,
Zhanghao Chen
________________________________
From: Diwakar Jha <di...@gmail.com>
Sent: Friday, March 25, 2022 12:04
To: user <us...@flink.apache.org>
Cc: Arvid Heise <ar...@apache.org>; Gen Luo <lu...@gmail.com>
Subject: Flink STOP with savepoint
Hello Everyone,
I'm running Flink 1.11 as EMR 6.1 as a Yarn application. I'm trying to use STOP command to capture savepoint and restart job from the same savepoint during redeployment.
flink stop -p $JOB_RUNNING -yid $YARN_APP_ID
Problem :
job completes savepoint on Flink UI but it throw the following error on CLI. because of which i'm not able to capture savepoint and redeploy the application.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Suspending job "72a4746f2b6f1d58e4c61c1e9214a7e3" with a savepoint.
2022-03-24 21<tel:2022032421>:30:10,850 INFO org.apache.hadoop.yarn.client.RMProxy [] - Connecting to ResourceManager at ip-10-0-36-99.ec2.internal/10.0.36.99:8032<http://10.0.36.99:8032>
2022-03-24 21<tel:2022032421>:30:11,032 INFO org.apache.hadoop.yarn.client.AHSProxy [] - Connecting to Application History server at ip-10-0-36-99.ec2.internal/10.0.36.99:10200<http://10.0.36.99:10200>
2022-03-24 21<tel:2022032421>:30:11,044 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2022-03-24 21<tel:2022032421>:30:11,142 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface ip-10-0-32-110.ec2.internal:46821 of application 'application_1647995456636_0001'.
The program finished with the following exception:
org.apache.flink.util.FlinkException: Could not stop with a savepoint job "72a4746f2b6f1d58e4c61c1e9214a7e3".
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
... 9 more
If i use CANCEL instead of STOP then it always works but since CANCEL doesn't give graceful shutdown hence i'm trying to use STOP.
Could someone please suggest how to fix this error?