You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Vinicius Peracini <vi...@zenvia.com> on 2022/03/07 13:56:22 UTC

Could not stop job with a savepoint

Hello everyone,

I have a Flink job (version 1.14.0 running on EMR) and I'm having this
issue while trying to stop a job with a savepoint on S3:

org.apache.flink.util.FlinkException: Could not stop with a savepoint job
"df3a3c590fabac737a17f1160c21094c".
at
org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:581)
at
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:569)
at
org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1069)
at
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at
org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: java.util.concurrent.ExecutionException:
java.util.concurrent.CompletionException:
org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint
Coordinator is suspending.
at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at
org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:579)
... 9 more

I'm using incremental and unaligned checkpoints (aligned checkpoint timeout
is 30 seconds). I also tried to create the savepoint without stopping the
job (using flink savepoint command) and got the same error. Any idea what
is happening here?

Thanks in advance,

-- 
Aviso Legal: Este documento pode conter informações confidenciais e/ou 
privilegiadas. Se você não for o destinatário ou a pessoa autorizada a 
receber este documento, não deve usar, copiar ou divulgar as informações 
nele contidas ou tomar qualquer ação baseada nessas informações.


Disclaimer: The information contained in this document may be privileged 
and confidential and protected from disclosure. If the reader of this 
document is not the intended recipient, or an employee agent responsible 
for delivering this document to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this 
communication is strictly prohibited.


Re: Could not stop job with a savepoint

Posted by Vinicius Peracini <vi...@zenvia.com>.
Hi Schwalbe!

Yes, I'm using RocksDBStateBackend. I guess your suspicion was right, I
changed the memory allocator to jemalloc and the issue seems to be gone.

Here is what I did to change the memory allocator on EMR:

1. Installed the jemalloc package by using an EMR bootstrap action script:

sudo amazon-linux-extras install -y epel
sudo yum install -y jemalloc-devel

2. Configured Flink to use jemalloc:

"containerized.master.env.LD_PRELOAD": "/usr/lib64/libjemalloc.so"
"containerized.taskmanager.env.LD_PRELOAD": "/usr/lib64/libjemalloc.so"

Thank you so much!

Best,

On Thu, Mar 10, 2022 at 4:43 AM Schwalbe Matthias <
Matthias.Schwalbe@viseca.ch> wrote:

> Hi Vinicius,
>
>
>
> Your case, the taskmanager being actively killed by yarn was the other way
> this happened.
>
>
>
> You are using RocksDBStateBackend, right?
>
> Not being sure, I’ve got the strong suspicion that this has got to do with
> the glibc bug that is seemingly in the works.
>
> There is some documentation here [1] and a solution that has been
> implemented for k8s containers [2] which replaces the glibc allocator with
> libjemalloc.so .
>
>
>
> However we are not completely through with our encounter of the same
> problem.
>
> Our intermediate solution is to reserve some unused extra memory, so the
> problem is delayed but not completely prevented (we restart our jobs daily
> by means of savepoint taking):
>
>
>
> flink-conf.yaml:
>
> …
>
> taskmanager.memory.managed.fraction: 0.2
>
> #reserve 2GB extra unused space (out of 8GB per TM) in order to mitigate
> the glibc memory leakage problem
>
> taskmanager.memory.task.off-heap.size: 2048mb
>
> …
>
>
>
> I’m not sure if and starting with which Flink version libjemalloc.so is
> integrated by default into the flink runtime
>
> … Flink team to the rescue 😊!
>
>
>
> Hope this helps
>
>
>
> Thias
>
>
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/memory/mem_trouble/#container-memory-exceeded
>
> [2] https://issues.apache.org/jira/browse/FLINK-19125
>
>
>
> *From:* Vinicius Peracini <vi...@zenvia.com>
> *Sent:* Mittwoch, 9. März 2022 17:56
> *To:* Schwalbe Matthias <Ma...@viseca.ch>
> *Cc:* Dawid Wysakowicz <dw...@apache.org>; user@flink.apache.org
> *Subject:* Re: Could not stop job with a savepoint
>
>
>
> So apparently the YARN container for Task Manager is running out of memory
> during the savepoint execution. Never had any problems with checkpoints
> though. Task Manager configuration:
>
>
>
> "taskmanager.memory.process.size": "10240m",
> "taskmanager.memory.managed.fraction": "0.6",
> "taskmanager.memory.jvm-overhead.fraction": "0.07",
> "taskmanager.memory.jvm-metaspace.size": "192mb",
> "taskmanager.network.memory.buffer-debloat.enabled": "true",
>
>
>
> On Wed, Mar 9, 2022 at 1:33 PM Vinicius Peracini <
> vinicius.peracini@zenvia.com> wrote:
>
> Bom dia Schwalbe!
>
>
>
> Thanks for the reply.
>
>
>
> I'm using Flink 1.14.0. EMR is a managed cluster platform to run big data
> applications on AWS. This way Flink services are running on YARN. I tried
> to create another savepoint today and was able to retrieve the Job Manager
> log:
>
>
>
> 2022-03-09 15:42:10,294 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Triggering savepoint for job
> 6f9d71e57efba96dad7f5328ab9ac717.
> 2022-03-09 15:42:10,298 INFO
>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> Triggering checkpoint 1378 (type=SAVEPOINT) @ 1646840530294 for job
> 6f9d71e57efba96dad7f5328ab9ac717.
> 2022-03-09 15:45:19,636 WARN  akka.remote.transport.netty.NettyTransport
>                 [] - Remote connection to [/172.30.0.169:57520] failed
> with java.io.IOException: Connection reset by peer
> 2022-03-09 15:45:19,648 WARN  akka.remote.ReliableDeliverySupervisor
>                 [] - Association with remote system
> [akka.tcp://flink@ip-172-30-0-169.ec2.internal:46639] has failed, address
> is now gated for [50] ms. Reason: [Disassociated]
> 2022-03-09 15:45:19,652 WARN  akka.remote.ReliableDeliverySupervisor
>                 [] - Association with remote system
> [akka.tcp://flink-metrics@ip-172-30-0-169.ec2.internal:41533] has failed,
> address is now gated for [50] ms. Reason: [Disassociated]
> 2022-03-09 15:45:19,707 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> LEFT_JOIN_MESSAGE_BULK -> Map (1/3) (866e32468227f9f0adac82e9b83b970a)
> switched from RUNNING to FAILED on container_1646341714746_0005_01_000004 @
> ip-172-30-0-165.ec2.internal (dataPort=40231).
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connection unexpectedly closed by remote task manager
> 'ip-172-30-0-169.ec2.internal/172.30.0.169:34413'. This might indicate
> that the remote task manager was lost.
> at
> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:186)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:94)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:831)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_322]
> 2022-03-09 15:45:19,720 INFO
>  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
> [] - Calculating tasks to restart to recover the failed task
> 5d739cfcb34ba91e39db0d6db0a4f1a2_0.
> 2022-03-09 15:45:19,721 INFO
>  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
> [] - 18 tasks should be restarted to recover the failed task
> 5d739cfcb34ba91e39db0d6db0a4f1a2_0.
> 2022-03-09 15:45:19,723 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
> TOOL_MESSAGE_STREAM (6f9d71e57efba96dad7f5328ab9ac717) switched from state
> RUNNING to RESTARTING.
> 2022-03-09 15:45:19,728 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> LEFT_JOIN_MESSAGE_BULK -> Map (3/3) (df8df89abf1761a726dd4593387cbd76)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,732 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 1 of source Source:
> BULK_SENDER_EVENT -> Filter -> Map.
> 2022-03-09 15:45:19,732 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> BULK_SENDER_EVENT -> Filter -> Map (2/3) (2ba29adbeeceb86a69536082fbfb4931)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,732 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> LEFT_JOIN_MESSAGE_BULK -> Map (2/3) (3674e936354727457b06293592b65f20)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 0 of source Source:
> BULK_SENDER_EVENT -> Filter -> Map.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> BULK_SENDER_EVENT -> Filter -> Map (1/3) (117be79102d086ddb65fb0b31c245f49)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> LEFT_JOIN_MESSAGE_FLOW -> Map (3/3) (3add641adeebe4e14dd0111bd647aa75)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> LEFT_JOIN_MESSAGE_FLOW -> Map (2/3) (f5f4d0d49c1d2c2c65685ccb6c35eab4)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (3/3)
> (07a1c10037596788aee6844603ab17a2) switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> LEFT_JOIN_MESSAGE_FLOW -> Map (1/3) (3caaf3a8072b29567d94691d1765a294)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (2/3)
> (216e7b0a7d0f8c887a5c91a3e5267c73) switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (1/3)
> (2e3973007d7c3d71f10456c3aca5a0b3) switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
> Filter)) (3/3) (bcebf365a3e6da0281d80b3ab0e2cff8) switched from RUNNING to
> CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 2 of source Source:
> MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
> Filter)).
> 2022-03-09 15:45:19,734 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 1 of source Source:
> MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
> Filter)).
> 2022-03-09 15:45:19,734 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
> Filter)) (2/3) (8d912317f46bf11f866bd70be09377ef) switched from RUNNING to
> CANCELING.
> 2022-03-09 15:45:19,734 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 0 of source Source:
> MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
> Filter)).
> 2022-03-09 15:45:19,734 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
> Filter)) (1/3) (c4a8b6f2d0b7ea837dd4e69a107e652e) switched from RUNNING to
> CANCELING.
> 2022-03-09 15:45:19,735 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 2 of source Source:
> FLOW_EVENT -> Filter -> Map.
> 2022-03-09 15:45:19,735 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> FLOW_EVENT -> Filter -> Map (3/3) (b811833b1ce35416fae61aba7cdbeb53)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,735 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 1 of source Source:
> FLOW_EVENT -> Filter -> Map.
> 2022-03-09 15:45:19,735 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> FLOW_EVENT -> Filter -> Map (2/3) (c5048ba45046caa3b954257602b4f0a4)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,735 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 0 of source Source:
> FLOW_EVENT -> Filter -> Map.
> 2022-03-09 15:45:19,735 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> FLOW_EVENT -> Filter -> Map (1/3) (4693404a84c87a07b1f35ed62f9edad7)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,735 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 2 of source Source:
> BULK_SENDER_EVENT -> Filter -> Map.
> 2022-03-09 15:45:19,735 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> BULK_SENDER_EVENT -> Filter -> Map (3/3) (d4421c51bf8fb716c672e13fd249450f)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,744 WARN  akka.remote.transport.netty.NettyTransport
>                 [] - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: ip-172-30-0-169.ec2.internal/
> 172.30.0.169:46639
> 2022-03-09 15:45:19,746 WARN  akka.remote.ReliableDeliverySupervisor
>                 [] - Association with remote system
> [akka.tcp://flink@ip-172-30-0-169.ec2.internal:46639] has failed, address
> is now gated for [50] ms. Reason: [Association failed with
> [akka.tcp://flink@ip-172-30-0-169.ec2.internal:46639]] Caused by:
> [java.net.ConnectException: Connection refused:
> ip-172-30-0-169.ec2.internal/172.30.0.169:46639]
> 2022-03-09 15:45:19,751 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> LEFT_JOIN_MESSAGE_BULK -> Map (3/3) (df8df89abf1761a726dd4593387cbd76)
> switched from CANCELING to CANCELED.
> 2022-03-09 15:45:19,751 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> df8df89abf1761a726dd4593387cbd76.
> 2022-03-09 15:45:19,753 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> df8df89abf1761a726dd4593387cbd76.
> 2022-03-09 15:45:19,754 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> LEFT_JOIN_MESSAGE_FLOW -> Map (3/3) (3add641adeebe4e14dd0111bd647aa75)
> switched from CANCELING to CANCELED.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> 3add641adeebe4e14dd0111bd647aa75.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> 3add641adeebe4e14dd0111bd647aa75.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (3/3)
> (07a1c10037596788aee6844603ab17a2) switched from CANCELING to CANCELED.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> 07a1c10037596788aee6844603ab17a2.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
> Filter)) (3/3) (bcebf365a3e6da0281d80b3ab0e2cff8) switched from CANCELING
> to CANCELED.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> bcebf365a3e6da0281d80b3ab0e2cff8.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> bcebf365a3e6da0281d80b3ab0e2cff8.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> FLOW_EVENT -> Filter -> Map (3/3) (b811833b1ce35416fae61aba7cdbeb53)
> switched from CANCELING to CANCELED.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> b811833b1ce35416fae61aba7cdbeb53.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> b811833b1ce35416fae61aba7cdbeb53.
> 2022-03-09 15:45:19,756 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> BULK_SENDER_EVENT -> Filter -> Map (3/3) (d4421c51bf8fb716c672e13fd249450f)
> switched from CANCELING to CANCELED.
> 2022-03-09 15:45:19,757 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> d4421c51bf8fb716c672e13fd249450f.
> 2022-03-09 15:45:19,757 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> d4421c51bf8fb716c672e13fd249450f.
> 2022-03-09 15:45:19,757 INFO
>  org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> 6f9d71e57efba96dad7f5328ab9ac717:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=2}]
> 2022-03-09 15:45:19,790 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> BULK_SENDER_EVENT -> Filter -> Map (1/3) (117be79102d086ddb65fb0b31c245f49)
> switched from CANCELING to CANCELED.
> 2022-03-09 15:45:19,797 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> FLOW_EVENT -> Filter -> Map (1/3) (4693404a84c87a07b1f35ed62f9edad7)
> switched from CANCELING to CANCELED.
> 2022-03-09 15:45:19,801 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
> Filter)) (1/3) (c4a8b6f2d0b7ea837dd4e69a107e652e) switched from CANCELING
> to CANCELED.
> 2022-03-09 15:45:20,026 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (1/3)
> (2e3973007d7c3d71f10456c3aca5a0b3) switched from CANCELING to CANCELED.
> 2022-03-09 15:45:23,957 INFO
>  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Worker container_1646341714746_0005_01_000003 is terminated. Diagnostics:
> Container container_1646341714746_0005_01_000003 marked as failed.
>  Exit code:137.
>  Diagnostics:[2022-03-09 15:45:19.639]Container killed on request. Exit
> code is 137
> [2022-03-09 15:45:19.641]Container exited with a non-zero exit code 137.
> [2022-03-09 15:45:19.642]Killed by external signal
>
> 2022-03-09 15:45:23,957 INFO
>  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Closing TaskExecutor connection container_1646341714746_0005_01_000003
> because: Container container_1646341714746_0005_01_000003 marked as failed.
>  Exit code:137.
>  Diagnostics:[2022-03-09 15:45:19.639]Container killed on request. Exit
> code is 137
> [2022-03-09 15:45:19.641]Container exited with a non-zero exit code 137.
> [2022-03-09 15:45:19.642]Killed by external signal
>
>
>
> Thanks,
>
>
>
> On Tue, Mar 8, 2022 at 4:57 AM Schwalbe Matthias <
> Matthias.Schwalbe@viseca.ch> wrote:
>
> Bom Dia Vinicius,
>
>
>
> Can You still find (and post) the exception stack from your jobmanager
> log, the flink client log does not reveal enough information.
>
> Your situation reminds me of something similar I had.
>
> In the log you might find something like this or similar:
>
>
>
> 2022-03-07 02:15:41,347 INFO
> org.apache.flink.runtime.jobmaster.JobMaster                 [] -
> Triggering stop-with-savepoint for job e12f22653f79194863ab426312dd666a.
>
> 2022-03-07 02:15:41,380 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering
> checkpoint 4983974 (type=SAVEPOINT_SUSPEND) @ 1646615741347 for job
> e12f22653f79194863ab426312dd666a.
>
> 2022-03-07 02:15:43,042 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline
> checkpoint 4983974 by task 0e659ac720e3e0b3e4072dbc1cc85cd3 of job
> e12f22653f79194863ab426312dd666a at
> container_e1093_1646358077201_0002_01_000001 @ ulxxphaddtn02.adgr.net
> (dataPort=44767).
>
> org.apache.flink.util.SerializedThrowable: Asynchronous task checkpoint
> failed.
>
>             at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:279)
> ~[flink-dist_2.11-1.13.0.jar:1.13.0]
>
>
>
> BTW what Flink version are you running?
>
> What is EMR (what technology underneath).
>
>
>
>
>
>
>
> *From:* Vinicius Peracini <vi...@zenvia.com>
> *Sent:* Montag, 7. März 2022 20:46
> *To:* Dawid Wysakowicz <dw...@apache.org>
> *Cc:* user@flink.apache.org
> *Subject:* Re: Could not stop job with a savepoint
>
>
>
> Hi Dawid, thanks for the reply.
>
>
>
> The job was still in progress and producing events. Unfortunately I was
> not able to stop the job with a savepoint or to just create a savepoint. I
> had to stop the job without the savepoint and restore the state using the
> last checkpoint. Still reviewing my configuration and trying to figure out
> why this is happening. Any help would be appreciated.
>
>
>
> Thanks!
>
>
>
>
>
> On Mon, Mar 7, 2022 at 11:56 AM Dawid Wysakowicz <dw...@apache.org>
> wrote:
>
> Hi,
>
> From the exception it seems the job has been already done when you're
> triggering the savepoint.
>
> Best,
>
> Dawid
>
> On 07/03/2022 14:56, Vinicius Peracini wrote:
>
> Hello everyone,
>
>
>
> I have a Flink job (version 1.14.0 running on EMR) and I'm having this
> issue while trying to stop a job with a savepoint on S3:
>
>
>
> org.apache.flink.util.FlinkException: Could not stop with a savepoint job
> "df3a3c590fabac737a17f1160c21094c".
> at
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:581)
> at
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
> at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:569)
> at
> org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1069)
> at
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at
> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
> Caused by: java.util.concurrent.ExecutionException:
> java.util.concurrent.CompletionException:
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint
> Coordinator is suspending.
> at
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
> at
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:579)
> ... 9 more
>
>
>
> I'm using incremental and unaligned checkpoints (aligned checkpoint
> timeout is 30 seconds). I also tried to create the savepoint without
> stopping the job (using flink savepoint command) and got the same error.
> Any idea what is happening here?
>
>
>
> Thanks in advance,
>
>
>
> Aviso Legal: Este documento pode conter informações confidenciais e/ou
> privilegiadas. Se você não for o destinatário ou a pessoa autorizada a
> receber este documento, não deve usar, copiar ou divulgar as informações
> nele contidas ou tomar qualquer ação baseada nessas informações.
>
>
>
> Disclaimer: The information contained in this document may be privileged
> and confidential and protected from disclosure. If the reader of this
> document is not the intended recipient, or an employee agent responsible
> for delivering this document to the intended recipient, you are hereby
> notified that any dissemination, distribution or copying of this
> communication is strictly prohibited.
>
>
>
> Aviso Legal: Este documento pode conter informações confidenciais e/ou
> privilegiadas. Se você não for o destinatário ou a pessoa autorizada a
> receber este documento, não deve usar, copiar ou divulgar as informações
> nele contidas ou tomar qualquer ação baseada nessas informações.
>
>
>
> Disclaimer: The information contained in this document may be privileged
> and confidential and protected from disclosure. If the reader of this
> document is not the intended recipient, or an employee agent responsible
> for delivering this document to the intended recipient, you are hereby
> notified that any dissemination, distribution or copying of this
> communication is strictly prohibited.
>
> Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und
> beinhaltet unter Umständen vertrauliche Mitteilungen. Da die
> Vertraulichkeit von e-Mail-Nachrichten nicht gewährleistet werden kann,
> übernehmen wir keine Haftung für die Gewährung der Vertraulichkeit und
> Unversehrtheit dieser Mitteilung. Bei irrtümlicher Zustellung bitten wir
> Sie um Benachrichtigung per e-Mail und um Löschung dieser Nachricht sowie
> eventueller Anhänge. Jegliche unberechtigte Verwendung oder Verbreitung
> dieser Informationen ist streng verboten.
>
> This message is intended only for the named recipient and may contain
> confidential or privileged information. As the confidentiality of email
> communication cannot be guaranteed, we do not accept any responsibility for
> the confidentiality and the intactness of this message. If you have
> received it in error, please advise the sender by return e-mail and delete
> this message and any attachments. Any unauthorised use or dissemination of
> this information is strictly prohibited.
>
>
>
> Aviso Legal: Este documento pode conter informações confidenciais e/ou
> privilegiadas. Se você não for o destinatário ou a pessoa autorizada a
> receber este documento, não deve usar, copiar ou divulgar as informações
> nele contidas ou tomar qualquer ação baseada nessas informações.
>
>
>
> Disclaimer: The information contained in this document may be privileged
> and confidential and protected from disclosure. If the reader of this
> document is not the intended recipient, or an employee agent responsible
> for delivering this document to the intended recipient, you are hereby
> notified that any dissemination, distribution or copying of this
> communication is strictly prohibited.
> Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und
> beinhaltet unter Umständen vertrauliche Mitteilungen. Da die
> Vertraulichkeit von e-Mail-Nachrichten nicht gewährleistet werden kann,
> übernehmen wir keine Haftung für die Gewährung der Vertraulichkeit und
> Unversehrtheit dieser Mitteilung. Bei irrtümlicher Zustellung bitten wir
> Sie um Benachrichtigung per e-Mail und um Löschung dieser Nachricht sowie
> eventueller Anhänge. Jegliche unberechtigte Verwendung oder Verbreitung
> dieser Informationen ist streng verboten.
>
> This message is intended only for the named recipient and may contain
> confidential or privileged information. As the confidentiality of email
> communication cannot be guaranteed, we do not accept any responsibility for
> the confidentiality and the intactness of this message. If you have
> received it in error, please advise the sender by return e-mail and delete
> this message and any attachments. Any unauthorised use or dissemination of
> this information is strictly prohibited.
>

-- 
Aviso Legal: Este documento pode conter informações confidenciais e/ou 
privilegiadas. Se você não for o destinatário ou a pessoa autorizada a 
receber este documento, não deve usar, copiar ou divulgar as informações 
nele contidas ou tomar qualquer ação baseada nessas informações.


Disclaimer: The information contained in this document may be privileged 
and confidential and protected from disclosure. If the reader of this 
document is not the intended recipient, or an employee agent responsible 
for delivering this document to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this 
communication is strictly prohibited.


RE: Could not stop job with a savepoint

Posted by Schwalbe Matthias <Ma...@viseca.ch>.
Hi Vinicius,

Your case, the taskmanager being actively killed by yarn was the other way this happened.

You are using RocksDBStateBackend, right?
Not being sure, I’ve got the strong suspicion that this has got to do with the glibc bug that is seemingly in the works.
There is some documentation here [1] and a solution that has been implemented for k8s containers [2] which replaces the glibc allocator with libjemalloc.so .

However we are not completely through with our encounter of the same problem.
Our intermediate solution is to reserve some unused extra memory, so the problem is delayed but not completely prevented (we restart our jobs daily by means of savepoint taking):

flink-conf.yaml:
…
taskmanager.memory.managed.fraction: 0.2
#reserve 2GB extra unused space (out of 8GB per TM) in order to mitigate the glibc memory leakage problem
taskmanager.memory.task.off-heap.size: 2048mb
…

I’m not sure if and starting with which Flink version libjemalloc.so is integrated by default into the flink runtime
… Flink team to the rescue 😊!

Hope this helps

Thias

[1] https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/memory/mem_trouble/#container-memory-exceeded
[2] https://issues.apache.org/jira/browse/FLINK-19125

From: Vinicius Peracini <vi...@zenvia.com>
Sent: Mittwoch, 9. März 2022 17:56
To: Schwalbe Matthias <Ma...@viseca.ch>
Cc: Dawid Wysakowicz <dw...@apache.org>; user@flink.apache.org
Subject: Re: Could not stop job with a savepoint

So apparently the YARN container for Task Manager is running out of memory during the savepoint execution. Never had any problems with checkpoints though. Task Manager configuration:

"taskmanager.memory.process.size": "10240m",
"taskmanager.memory.managed.fraction": "0.6",
"taskmanager.memory.jvm-overhead.fraction": "0.07",
"taskmanager.memory.jvm-metaspace.size": "192mb",
"taskmanager.network.memory.buffer-debloat.enabled": "true",

On Wed, Mar 9, 2022 at 1:33 PM Vinicius Peracini <vi...@zenvia.com>> wrote:
Bom dia Schwalbe!

Thanks for the reply.

I'm using Flink 1.14.0. EMR is a managed cluster platform to run big data applications on AWS. This way Flink services are running on YARN. I tried to create another savepoint today and was able to retrieve the Job Manager log:

2022-03-09 15:42:10,294 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Triggering savepoint for job 6f9d71e57efba96dad7f5328ab9ac717.
2022-03-09 15:42:10,298 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 1378 (type=SAVEPOINT) @ 1646840530294 for job 6f9d71e57efba96dad7f5328ab9ac717.
2022-03-09 15:45:19,636 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [/172.30.0.169:57520<http://172.30.0.169:57520>] failed with java.io.IOException: Connection reset by peer
2022-03-09 15:45:19,648 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@ip-172-30-0-169.ec2.internal:46639] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2022-03-09 15:45:19,652 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@ip-172-30-0-169.ec2.internal:41533] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2022-03-09 15:45:19,707 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - LEFT_JOIN_MESSAGE_BULK -> Map (1/3) (866e32468227f9f0adac82e9b83b970a) switched from RUNNING to FAILED on container_1646341714746_0005_01_000004 @ ip-172-30-0-165.ec2.internal (dataPort=40231).
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'ip-172-30-0-169.ec2.internal/172.30.0.169:34413<http://172.30.0.169:34413>'. This might indicate that the remote task manager was lost.
at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:186) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:94) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:831) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.14.0.jar:1.14.0]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_322]
2022-03-09 15:45:19,720 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task 5d739cfcb34ba91e39db0d6db0a4f1a2_0.
2022-03-09 15:45:19,721 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 18 tasks should be restarted to recover the failed task 5d739cfcb34ba91e39db0d6db0a4f1a2_0.
2022-03-09 15:45:19,723 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TOOL_MESSAGE_STREAM (6f9d71e57efba96dad7f5328ab9ac717) switched from state RUNNING to RESTARTING.
2022-03-09 15:45:19,728 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - LEFT_JOIN_MESSAGE_BULK -> Map (3/3) (df8df89abf1761a726dd4593387cbd76) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,732 INFO  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Removing registered reader after failure for subtask 1 of source Source: BULK_SENDER_EVENT -> Filter -> Map.
2022-03-09 15:45:19,732 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: BULK_SENDER_EVENT -> Filter -> Map (2/3) (2ba29adbeeceb86a69536082fbfb4931) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,732 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - LEFT_JOIN_MESSAGE_BULK -> Map (2/3) (3674e936354727457b06293592b65f20) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,733 INFO  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Removing registered reader after failure for subtask 0 of source Source: BULK_SENDER_EVENT -> Filter -> Map.
2022-03-09 15:45:19,733 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: BULK_SENDER_EVENT -> Filter -> Map (1/3) (117be79102d086ddb65fb0b31c245f49) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,733 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - LEFT_JOIN_MESSAGE_FLOW -> Map (3/3) (3add641adeebe4e14dd0111bd647aa75) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,733 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - LEFT_JOIN_MESSAGE_FLOW -> Map (2/3) (f5f4d0d49c1d2c2c65685ccb6c35eab4) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,733 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (3/3) (07a1c10037596788aee6844603ab17a2) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,733 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - LEFT_JOIN_MESSAGE_FLOW -> Map (1/3) (3caaf3a8072b29567d94691d1765a294) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,733 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (2/3) (216e7b0a7d0f8c887a5c91a3e5267c73) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,733 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (1/3) (2e3973007d7c3d71f10456c3aca5a0b3) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,733 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter, Filter)) (3/3) (bcebf365a3e6da0281d80b3ab0e2cff8) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,733 INFO  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Removing registered reader after failure for subtask 2 of source Source: MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter, Filter)).
2022-03-09 15:45:19,734 INFO  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Removing registered reader after failure for subtask 1 of source Source: MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter, Filter)).
2022-03-09 15:45:19,734 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter, Filter)) (2/3) (8d912317f46bf11f866bd70be09377ef) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,734 INFO  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Removing registered reader after failure for subtask 0 of source Source: MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter, Filter)).
2022-03-09 15:45:19,734 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter, Filter)) (1/3) (c4a8b6f2d0b7ea837dd4e69a107e652e) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,735 INFO  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Removing registered reader after failure for subtask 2 of source Source: FLOW_EVENT -> Filter -> Map.
2022-03-09 15:45:19,735 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: FLOW_EVENT -> Filter -> Map (3/3) (b811833b1ce35416fae61aba7cdbeb53) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,735 INFO  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Removing registered reader after failure for subtask 1 of source Source: FLOW_EVENT -> Filter -> Map.
2022-03-09 15:45:19,735 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: FLOW_EVENT -> Filter -> Map (2/3) (c5048ba45046caa3b954257602b4f0a4) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,735 INFO  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Removing registered reader after failure for subtask 0 of source Source: FLOW_EVENT -> Filter -> Map.
2022-03-09 15:45:19,735 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: FLOW_EVENT -> Filter -> Map (1/3) (4693404a84c87a07b1f35ed62f9edad7) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,735 INFO  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Removing registered reader after failure for subtask 2 of source Source: BULK_SENDER_EVENT -> Filter -> Map.
2022-03-09 15:45:19,735 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: BULK_SENDER_EVENT -> Filter -> Map (3/3) (d4421c51bf8fb716c672e13fd249450f) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,744 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [null] failed with java.net.ConnectException: Connection refused: ip-172-30-0-169.ec2.internal/172.30.0.169:46639<http://172.30.0.169:46639>
2022-03-09 15:45:19,746 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@ip-172-30-0-169.ec2.internal:46639] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@ip-172-30-0-169.ec2.internal:46639]] Caused by: [java.net.ConnectException: Connection refused: ip-172-30-0-169.ec2.internal/172.30.0.169:46639<http://172.30.0.169:46639>]
2022-03-09 15:45:19,751 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - LEFT_JOIN_MESSAGE_BULK -> Map (3/3) (df8df89abf1761a726dd4593387cbd76) switched from CANCELING to CANCELED.
2022-03-09 15:45:19,751 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Discarding the results produced by task execution df8df89abf1761a726dd4593387cbd76.
2022-03-09 15:45:19,753 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Discarding the results produced by task execution df8df89abf1761a726dd4593387cbd76.
2022-03-09 15:45:19,754 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - LEFT_JOIN_MESSAGE_FLOW -> Map (3/3) (3add641adeebe4e14dd0111bd647aa75) switched from CANCELING to CANCELED.
2022-03-09 15:45:19,755 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Discarding the results produced by task execution 3add641adeebe4e14dd0111bd647aa75.
2022-03-09 15:45:19,755 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Discarding the results produced by task execution 3add641adeebe4e14dd0111bd647aa75.
2022-03-09 15:45:19,755 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (3/3) (07a1c10037596788aee6844603ab17a2) switched from CANCELING to CANCELED.
2022-03-09 15:45:19,755 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Discarding the results produced by task execution 07a1c10037596788aee6844603ab17a2.
2022-03-09 15:45:19,755 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter, Filter)) (3/3) (bcebf365a3e6da0281d80b3ab0e2cff8) switched from CANCELING to CANCELED.
2022-03-09 15:45:19,755 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Discarding the results produced by task execution bcebf365a3e6da0281d80b3ab0e2cff8.
2022-03-09 15:45:19,755 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Discarding the results produced by task execution bcebf365a3e6da0281d80b3ab0e2cff8.
2022-03-09 15:45:19,755 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: FLOW_EVENT -> Filter -> Map (3/3) (b811833b1ce35416fae61aba7cdbeb53) switched from CANCELING to CANCELED.
2022-03-09 15:45:19,755 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Discarding the results produced by task execution b811833b1ce35416fae61aba7cdbeb53.
2022-03-09 15:45:19,755 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Discarding the results produced by task execution b811833b1ce35416fae61aba7cdbeb53.
2022-03-09 15:45:19,756 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: BULK_SENDER_EVENT -> Filter -> Map (3/3) (d4421c51bf8fb716c672e13fd249450f) switched from CANCELING to CANCELED.
2022-03-09 15:45:19,757 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Discarding the results produced by task execution d4421c51bf8fb716c672e13fd249450f.
2022-03-09 15:45:19,757 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Discarding the results produced by task execution d4421c51bf8fb716c672e13fd249450f.
2022-03-09 15:45:19,757 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Received resource requirements from job 6f9d71e57efba96dad7f5328ab9ac717: [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, numberOfRequiredSlots=2}]
2022-03-09 15:45:19,790 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: BULK_SENDER_EVENT -> Filter -> Map (1/3) (117be79102d086ddb65fb0b31c245f49) switched from CANCELING to CANCELED.
2022-03-09 15:45:19,797 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: FLOW_EVENT -> Filter -> Map (1/3) (4693404a84c87a07b1f35ed62f9edad7) switched from CANCELING to CANCELED.
2022-03-09 15:45:19,801 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter, Filter)) (1/3) (c4a8b6f2d0b7ea837dd4e69a107e652e) switched from CANCELING to CANCELED.
2022-03-09 15:45:20,026 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (1/3) (2e3973007d7c3d71f10456c3aca5a0b3) switched from CANCELING to CANCELED.
2022-03-09 15:45:23,957 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker container_1646341714746_0005_01_000003 is terminated. Diagnostics: Container container_1646341714746_0005_01_000003 marked as failed.
 Exit code:137.
 Diagnostics:[2022-03-09 15:45:19.639]Container killed on request. Exit code is 137
[2022-03-09 15:45:19.641]Container exited with a non-zero exit code 137.
[2022-03-09 15:45:19.642]Killed by external signal

2022-03-09 15:45:23,957 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Closing TaskExecutor connection container_1646341714746_0005_01_000003 because: Container container_1646341714746_0005_01_000003 marked as failed.
 Exit code:137.
 Diagnostics:[2022-03-09 15:45:19.639]Container killed on request. Exit code is 137
[2022-03-09 15:45:19.641]Container exited with a non-zero exit code 137.
[2022-03-09 15:45:19.642]Killed by external signal

Thanks,

On Tue, Mar 8, 2022 at 4:57 AM Schwalbe Matthias <Ma...@viseca.ch>> wrote:
Bom Dia Vinicius,

Can You still find (and post) the exception stack from your jobmanager log, the flink client log does not reveal enough information.
Your situation reminds me of something similar I had.
In the log you might find something like this or similar:

2022-03-07 02:15:41,347 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Triggering stop-with-savepoint for job e12f22653f79194863ab426312dd666a.
2022-03-07 02:15:41,380 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 4983974 (type=SAVEPOINT_SUSPEND) @ 1646615741347 for job e12f22653f79194863ab426312dd666a.
2022-03-07 02:15:43,042 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline checkpoint 4983974 by task 0e659ac720e3e0b3e4072dbc1cc85cd3 of job e12f22653f79194863ab426312dd666a at container_e1093_1646358077201_0002_01_000001 @ ulxxphaddtn02.adgr.net<http://ulxxphaddtn02.adgr.net> (dataPort=44767).
org.apache.flink.util.SerializedThrowable: Asynchronous task checkpoint failed.
            at org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:279) ~[flink-dist_2.11-1.13.0.jar:1.13.0]

BTW what Flink version are you running?
What is EMR (what technology underneath).



From: Vinicius Peracini <vi...@zenvia.com>>
Sent: Montag, 7. März 2022 20:46
To: Dawid Wysakowicz <dw...@apache.org>>
Cc: user@flink.apache.org<ma...@flink.apache.org>
Subject: Re: Could not stop job with a savepoint

Hi Dawid, thanks for the reply.

The job was still in progress and producing events. Unfortunately I was not able to stop the job with a savepoint or to just create a savepoint. I had to stop the job without the savepoint and restore the state using the last checkpoint. Still reviewing my configuration and trying to figure out why this is happening. Any help would be appreciated.

Thanks!


On Mon, Mar 7, 2022 at 11:56 AM Dawid Wysakowicz <dw...@apache.org>> wrote:

Hi,

From the exception it seems the job has been already done when you're triggering the savepoint.

Best,

Dawid
On 07/03/2022 14:56, Vinicius Peracini wrote:
Hello everyone,

I have a Flink job (version 1.14.0 running on EMR) and I'm having this issue while trying to stop a job with a savepoint on S3:

org.apache.flink.util.FlinkException: Could not stop with a savepoint job "df3a3c590fabac737a17f1160c21094c".
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:581)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:569)
at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1069)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending.
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:579)
... 9 more

I'm using incremental and unaligned checkpoints (aligned checkpoint timeout is 30 seconds). I also tried to create the savepoint without stopping the job (using flink savepoint command) and got the same error. Any idea what is happening here?

Thanks in advance,

Aviso Legal: Este documento pode conter informações confidenciais e/ou privilegiadas. Se você não for o destinatário ou a pessoa autorizada a receber este documento, não deve usar, copiar ou divulgar as informações nele contidas ou tomar qualquer ação baseada nessas informações.

Disclaimer: The information contained in this document may be privileged and confidential and protected from disclosure. If the reader of this document is not the intended recipient, or an employee agent responsible for delivering this document to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited.

Aviso Legal: Este documento pode conter informações confidenciais e/ou privilegiadas. Se você não for o destinatário ou a pessoa autorizada a receber este documento, não deve usar, copiar ou divulgar as informações nele contidas ou tomar qualquer ação baseada nessas informações.

Disclaimer: The information contained in this document may be privileged and confidential and protected from disclosure. If the reader of this document is not the intended recipient, or an employee agent responsible for delivering this document to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited.
Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und beinhaltet unter Umständen vertrauliche Mitteilungen. Da die Vertraulichkeit von e-Mail-Nachrichten nicht gewährleistet werden kann, übernehmen wir keine Haftung für die Gewährung der Vertraulichkeit und Unversehrtheit dieser Mitteilung. Bei irrtümlicher Zustellung bitten wir Sie um Benachrichtigung per e-Mail und um Löschung dieser Nachricht sowie eventueller Anhänge. Jegliche unberechtigte Verwendung oder Verbreitung dieser Informationen ist streng verboten.

This message is intended only for the named recipient and may contain confidential or privileged information. As the confidentiality of email communication cannot be guaranteed, we do not accept any responsibility for the confidentiality and the intactness of this message. If you have received it in error, please advise the sender by return e-mail and delete this message and any attachments. Any unauthorised use or dissemination of this information is strictly prohibited.

Aviso Legal: Este documento pode conter informações confidenciais e/ou privilegiadas. Se você não for o destinatário ou a pessoa autorizada a receber este documento, não deve usar, copiar ou divulgar as informações nele contidas ou tomar qualquer ação baseada nessas informações.

Disclaimer: The information contained in this document may be privileged and confidential and protected from disclosure. If the reader of this document is not the intended recipient, or an employee agent responsible for delivering this document to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited.
Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und beinhaltet unter Umständen vertrauliche Mitteilungen. Da die Vertraulichkeit von e-Mail-Nachrichten nicht gewährleistet werden kann, übernehmen wir keine Haftung für die Gewährung der Vertraulichkeit und Unversehrtheit dieser Mitteilung. Bei irrtümlicher Zustellung bitten wir Sie um Benachrichtigung per e-Mail und um Löschung dieser Nachricht sowie eventueller Anhänge. Jegliche unberechtigte Verwendung oder Verbreitung dieser Informationen ist streng verboten.

This message is intended only for the named recipient and may contain confidential or privileged information. As the confidentiality of email communication cannot be guaranteed, we do not accept any responsibility for the confidentiality and the intactness of this message. If you have received it in error, please advise the sender by return e-mail and delete this message and any attachments. Any unauthorised use or dissemination of this information is strictly prohibited.

Re: Could not stop job with a savepoint

Posted by Vinicius Peracini <vi...@zenvia.com>.
So apparently the YARN container for Task Manager is running out of memory
during the savepoint execution. Never had any problems with checkpoints
though. Task Manager configuration:

"taskmanager.memory.process.size": "10240m",
"taskmanager.memory.managed.fraction": "0.6",
"taskmanager.memory.jvm-overhead.fraction": "0.07",
"taskmanager.memory.jvm-metaspace.size": "192mb",
"taskmanager.network.memory.buffer-debloat.enabled": "true",

On Wed, Mar 9, 2022 at 1:33 PM Vinicius Peracini <
vinicius.peracini@zenvia.com> wrote:

> Bom dia Schwalbe!
>
> Thanks for the reply.
>
> I'm using Flink 1.14.0. EMR is a managed cluster platform to run big data
> applications on AWS. This way Flink services are running on YARN. I tried
> to create another savepoint today and was able to retrieve the Job Manager
> log:
>
> 2022-03-09 15:42:10,294 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Triggering savepoint for job
> 6f9d71e57efba96dad7f5328ab9ac717.
> 2022-03-09 15:42:10,298 INFO
>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> Triggering checkpoint 1378 (type=SAVEPOINT) @ 1646840530294 for job
> 6f9d71e57efba96dad7f5328ab9ac717.
> 2022-03-09 15:45:19,636 WARN  akka.remote.transport.netty.NettyTransport
>                 [] - Remote connection to [/172.30.0.169:57520] failed
> with java.io.IOException: Connection reset by peer
> 2022-03-09 15:45:19,648 WARN  akka.remote.ReliableDeliverySupervisor
>                 [] - Association with remote system
> [akka.tcp://flink@ip-172-30-0-169.ec2.internal:46639] has failed, address
> is now gated for [50] ms. Reason: [Disassociated]
> 2022-03-09 15:45:19,652 WARN  akka.remote.ReliableDeliverySupervisor
>                 [] - Association with remote system
> [akka.tcp://flink-metrics@ip-172-30-0-169.ec2.internal:41533] has failed,
> address is now gated for [50] ms. Reason: [Disassociated]
> 2022-03-09 15:45:19,707 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> LEFT_JOIN_MESSAGE_BULK -> Map (1/3) (866e32468227f9f0adac82e9b83b970a)
> switched from RUNNING to FAILED on container_1646341714746_0005_01_000004 @
> ip-172-30-0-165.ec2.internal (dataPort=40231).
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connection unexpectedly closed by remote task manager
> 'ip-172-30-0-169.ec2.internal/172.30.0.169:34413'. This might indicate
> that the remote task manager was lost.
> at
> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:186)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:94)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:831)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> ~[flink-dist_2.12-1.14.0.jar:1.14.0]
> at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_322]
> 2022-03-09 15:45:19,720 INFO
>  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
> [] - Calculating tasks to restart to recover the failed task
> 5d739cfcb34ba91e39db0d6db0a4f1a2_0.
> 2022-03-09 15:45:19,721 INFO
>  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
> [] - 18 tasks should be restarted to recover the failed task
> 5d739cfcb34ba91e39db0d6db0a4f1a2_0.
> 2022-03-09 15:45:19,723 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
> TOOL_MESSAGE_STREAM (6f9d71e57efba96dad7f5328ab9ac717) switched from state
> RUNNING to RESTARTING.
> 2022-03-09 15:45:19,728 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> LEFT_JOIN_MESSAGE_BULK -> Map (3/3) (df8df89abf1761a726dd4593387cbd76)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,732 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 1 of source Source:
> BULK_SENDER_EVENT -> Filter -> Map.
> 2022-03-09 15:45:19,732 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> BULK_SENDER_EVENT -> Filter -> Map (2/3) (2ba29adbeeceb86a69536082fbfb4931)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,732 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> LEFT_JOIN_MESSAGE_BULK -> Map (2/3) (3674e936354727457b06293592b65f20)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 0 of source Source:
> BULK_SENDER_EVENT -> Filter -> Map.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> BULK_SENDER_EVENT -> Filter -> Map (1/3) (117be79102d086ddb65fb0b31c245f49)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> LEFT_JOIN_MESSAGE_FLOW -> Map (3/3) (3add641adeebe4e14dd0111bd647aa75)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> LEFT_JOIN_MESSAGE_FLOW -> Map (2/3) (f5f4d0d49c1d2c2c65685ccb6c35eab4)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (3/3)
> (07a1c10037596788aee6844603ab17a2) switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> LEFT_JOIN_MESSAGE_FLOW -> Map (1/3) (3caaf3a8072b29567d94691d1765a294)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (2/3)
> (216e7b0a7d0f8c887a5c91a3e5267c73) switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (1/3)
> (2e3973007d7c3d71f10456c3aca5a0b3) switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
> Filter)) (3/3) (bcebf365a3e6da0281d80b3ab0e2cff8) switched from RUNNING to
> CANCELING.
> 2022-03-09 15:45:19,733 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 2 of source Source:
> MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
> Filter)).
> 2022-03-09 15:45:19,734 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 1 of source Source:
> MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
> Filter)).
> 2022-03-09 15:45:19,734 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
> Filter)) (2/3) (8d912317f46bf11f866bd70be09377ef) switched from RUNNING to
> CANCELING.
> 2022-03-09 15:45:19,734 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 0 of source Source:
> MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
> Filter)).
> 2022-03-09 15:45:19,734 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
> Filter)) (1/3) (c4a8b6f2d0b7ea837dd4e69a107e652e) switched from RUNNING to
> CANCELING.
> 2022-03-09 15:45:19,735 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 2 of source Source:
> FLOW_EVENT -> Filter -> Map.
> 2022-03-09 15:45:19,735 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> FLOW_EVENT -> Filter -> Map (3/3) (b811833b1ce35416fae61aba7cdbeb53)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,735 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 1 of source Source:
> FLOW_EVENT -> Filter -> Map.
> 2022-03-09 15:45:19,735 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> FLOW_EVENT -> Filter -> Map (2/3) (c5048ba45046caa3b954257602b4f0a4)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,735 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 0 of source Source:
> FLOW_EVENT -> Filter -> Map.
> 2022-03-09 15:45:19,735 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> FLOW_EVENT -> Filter -> Map (1/3) (4693404a84c87a07b1f35ed62f9edad7)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,735 INFO
>  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
> Removing registered reader after failure for subtask 2 of source Source:
> BULK_SENDER_EVENT -> Filter -> Map.
> 2022-03-09 15:45:19,735 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> BULK_SENDER_EVENT -> Filter -> Map (3/3) (d4421c51bf8fb716c672e13fd249450f)
> switched from RUNNING to CANCELING.
> 2022-03-09 15:45:19,744 WARN  akka.remote.transport.netty.NettyTransport
>                 [] - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: ip-172-30-0-169.ec2.internal/
> 172.30.0.169:46639
> 2022-03-09 15:45:19,746 WARN  akka.remote.ReliableDeliverySupervisor
>                 [] - Association with remote system
> [akka.tcp://flink@ip-172-30-0-169.ec2.internal:46639] has failed, address
> is now gated for [50] ms. Reason: [Association failed with
> [akka.tcp://flink@ip-172-30-0-169.ec2.internal:46639]] Caused by:
> [java.net.ConnectException: Connection refused:
> ip-172-30-0-169.ec2.internal/172.30.0.169:46639]
> 2022-03-09 15:45:19,751 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> LEFT_JOIN_MESSAGE_BULK -> Map (3/3) (df8df89abf1761a726dd4593387cbd76)
> switched from CANCELING to CANCELED.
> 2022-03-09 15:45:19,751 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> df8df89abf1761a726dd4593387cbd76.
> 2022-03-09 15:45:19,753 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> df8df89abf1761a726dd4593387cbd76.
> 2022-03-09 15:45:19,754 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> LEFT_JOIN_MESSAGE_FLOW -> Map (3/3) (3add641adeebe4e14dd0111bd647aa75)
> switched from CANCELING to CANCELED.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> 3add641adeebe4e14dd0111bd647aa75.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> 3add641adeebe4e14dd0111bd647aa75.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (3/3)
> (07a1c10037596788aee6844603ab17a2) switched from CANCELING to CANCELED.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> 07a1c10037596788aee6844603ab17a2.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
> Filter)) (3/3) (bcebf365a3e6da0281d80b3ab0e2cff8) switched from CANCELING
> to CANCELED.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> bcebf365a3e6da0281d80b3ab0e2cff8.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> bcebf365a3e6da0281d80b3ab0e2cff8.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> FLOW_EVENT -> Filter -> Map (3/3) (b811833b1ce35416fae61aba7cdbeb53)
> switched from CANCELING to CANCELED.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> b811833b1ce35416fae61aba7cdbeb53.
> 2022-03-09 15:45:19,755 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> b811833b1ce35416fae61aba7cdbeb53.
> 2022-03-09 15:45:19,756 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> BULK_SENDER_EVENT -> Filter -> Map (3/3) (d4421c51bf8fb716c672e13fd249450f)
> switched from CANCELING to CANCELED.
> 2022-03-09 15:45:19,757 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> d4421c51bf8fb716c672e13fd249450f.
> 2022-03-09 15:45:19,757 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> d4421c51bf8fb716c672e13fd249450f.
> 2022-03-09 15:45:19,757 INFO
>  org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> 6f9d71e57efba96dad7f5328ab9ac717:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=2}]
> 2022-03-09 15:45:19,790 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> BULK_SENDER_EVENT -> Filter -> Map (1/3) (117be79102d086ddb65fb0b31c245f49)
> switched from CANCELING to CANCELED.
> 2022-03-09 15:45:19,797 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> FLOW_EVENT -> Filter -> Map (1/3) (4693404a84c87a07b1f35ed62f9edad7)
> switched from CANCELING to CANCELED.
> 2022-03-09 15:45:19,801 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
> Filter)) (1/3) (c4a8b6f2d0b7ea837dd4e69a107e652e) switched from CANCELING
> to CANCELED.
> 2022-03-09 15:45:20,026 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (1/3)
> (2e3973007d7c3d71f10456c3aca5a0b3) switched from CANCELING to CANCELED.
> 2022-03-09 15:45:23,957 INFO
>  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Worker container_1646341714746_0005_01_000003 is terminated. Diagnostics:
> Container container_1646341714746_0005_01_000003 marked as failed.
>  Exit code:137.
>  Diagnostics:[2022-03-09 15:45:19.639]Container killed on request. Exit
> code is 137
> [2022-03-09 15:45:19.641]Container exited with a non-zero exit code 137.
> [2022-03-09 15:45:19.642]Killed by external signal
>
> 2022-03-09 15:45:23,957 INFO
>  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Closing TaskExecutor connection container_1646341714746_0005_01_000003
> because: Container container_1646341714746_0005_01_000003 marked as failed.
>  Exit code:137.
>  Diagnostics:[2022-03-09 15:45:19.639]Container killed on request. Exit
> code is 137
> [2022-03-09 15:45:19.641]Container exited with a non-zero exit code 137.
> [2022-03-09 15:45:19.642]Killed by external signal
>
> Thanks,
>
> On Tue, Mar 8, 2022 at 4:57 AM Schwalbe Matthias <
> Matthias.Schwalbe@viseca.ch> wrote:
>
>> Bom Dia Vinicius,
>>
>>
>>
>> Can You still find (and post) the exception stack from your jobmanager
>> log, the flink client log does not reveal enough information.
>>
>> Your situation reminds me of something similar I had.
>>
>> In the log you might find something like this or similar:
>>
>>
>>
>> 2022-03-07 02:15:41,347 INFO
>> org.apache.flink.runtime.jobmaster.JobMaster                 [] -
>> Triggering stop-with-savepoint for job e12f22653f79194863ab426312dd666a.
>>
>> 2022-03-07 02:15:41,380 INFO
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering
>> checkpoint 4983974 (type=SAVEPOINT_SUSPEND) @ 1646615741347 for job
>> e12f22653f79194863ab426312dd666a.
>>
>> 2022-03-07 02:15:43,042 INFO
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline
>> checkpoint 4983974 by task 0e659ac720e3e0b3e4072dbc1cc85cd3 of job
>> e12f22653f79194863ab426312dd666a at
>> container_e1093_1646358077201_0002_01_000001 @ ulxxphaddtn02.adgr.net
>> (dataPort=44767).
>>
>> org.apache.flink.util.SerializedThrowable: Asynchronous task checkpoint
>> failed.
>>
>>             at
>> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:279)
>> ~[flink-dist_2.11-1.13.0.jar:1.13.0]
>>
>>
>>
>> BTW what Flink version are you running?
>>
>> What is EMR (what technology underneath).
>>
>>
>>
>>
>>
>>
>>
>> *From:* Vinicius Peracini <vi...@zenvia.com>
>> *Sent:* Montag, 7. März 2022 20:46
>> *To:* Dawid Wysakowicz <dw...@apache.org>
>> *Cc:* user@flink.apache.org
>> *Subject:* Re: Could not stop job with a savepoint
>>
>>
>>
>> Hi Dawid, thanks for the reply.
>>
>>
>>
>> The job was still in progress and producing events. Unfortunately I was
>> not able to stop the job with a savepoint or to just create a savepoint. I
>> had to stop the job without the savepoint and restore the state using the
>> last checkpoint. Still reviewing my configuration and trying to figure out
>> why this is happening. Any help would be appreciated.
>>
>>
>>
>> Thanks!
>>
>>
>>
>>
>>
>> On Mon, Mar 7, 2022 at 11:56 AM Dawid Wysakowicz <dw...@apache.org>
>> wrote:
>>
>> Hi,
>>
>> From the exception it seems the job has been already done when you're
>> triggering the savepoint.
>>
>> Best,
>>
>> Dawid
>>
>> On 07/03/2022 14:56, Vinicius Peracini wrote:
>>
>> Hello everyone,
>>
>>
>>
>> I have a Flink job (version 1.14.0 running on EMR) and I'm having this
>> issue while trying to stop a job with a savepoint on S3:
>>
>>
>>
>> org.apache.flink.util.FlinkException: Could not stop with a savepoint job
>> "df3a3c590fabac737a17f1160c21094c".
>> at
>> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:581)
>> at
>> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
>> at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:569)
>> at
>> org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1069)
>> at
>> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>> at
>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
>> Caused by: java.util.concurrent.ExecutionException:
>> java.util.concurrent.CompletionException:
>> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint
>> Coordinator is suspending.
>> at
>> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
>> at
>> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:579)
>> ... 9 more
>>
>>
>>
>> I'm using incremental and unaligned checkpoints (aligned checkpoint
>> timeout is 30 seconds). I also tried to create the savepoint without
>> stopping the job (using flink savepoint command) and got the same error.
>> Any idea what is happening here?
>>
>>
>>
>> Thanks in advance,
>>
>>
>>
>> Aviso Legal: Este documento pode conter informações confidenciais e/ou
>> privilegiadas. Se você não for o destinatário ou a pessoa autorizada a
>> receber este documento, não deve usar, copiar ou divulgar as informações
>> nele contidas ou tomar qualquer ação baseada nessas informações.
>>
>>
>>
>> Disclaimer: The information contained in this document may be privileged
>> and confidential and protected from disclosure. If the reader of this
>> document is not the intended recipient, or an employee agent responsible
>> for delivering this document to the intended recipient, you are hereby
>> notified that any dissemination, distribution or copying of this
>> communication is strictly prohibited.
>>
>>
>>
>> Aviso Legal: Este documento pode conter informações confidenciais e/ou
>> privilegiadas. Se você não for o destinatário ou a pessoa autorizada a
>> receber este documento, não deve usar, copiar ou divulgar as informações
>> nele contidas ou tomar qualquer ação baseada nessas informações.
>>
>>
>>
>> Disclaimer: The information contained in this document may be privileged
>> and confidential and protected from disclosure. If the reader of this
>> document is not the intended recipient, or an employee agent responsible
>> for delivering this document to the intended recipient, you are hereby
>> notified that any dissemination, distribution or copying of this
>> communication is strictly prohibited.
>> Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und
>> beinhaltet unter Umständen vertrauliche Mitteilungen. Da die
>> Vertraulichkeit von e-Mail-Nachrichten nicht gewährleistet werden kann,
>> übernehmen wir keine Haftung für die Gewährung der Vertraulichkeit und
>> Unversehrtheit dieser Mitteilung. Bei irrtümlicher Zustellung bitten wir
>> Sie um Benachrichtigung per e-Mail und um Löschung dieser Nachricht sowie
>> eventueller Anhänge. Jegliche unberechtigte Verwendung oder Verbreitung
>> dieser Informationen ist streng verboten.
>>
>> This message is intended only for the named recipient and may contain
>> confidential or privileged information. As the confidentiality of email
>> communication cannot be guaranteed, we do not accept any responsibility for
>> the confidentiality and the intactness of this message. If you have
>> received it in error, please advise the sender by return e-mail and delete
>> this message and any attachments. Any unauthorised use or dissemination of
>> this information is strictly prohibited.
>>
>

-- 
Aviso Legal: Este documento pode conter informações confidenciais e/ou 
privilegiadas. Se você não for o destinatário ou a pessoa autorizada a 
receber este documento, não deve usar, copiar ou divulgar as informações 
nele contidas ou tomar qualquer ação baseada nessas informações.


Disclaimer: The information contained in this document may be privileged 
and confidential and protected from disclosure. If the reader of this 
document is not the intended recipient, or an employee agent responsible 
for delivering this document to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this 
communication is strictly prohibited.


Re: Could not stop job with a savepoint

Posted by Vinicius Peracini <vi...@zenvia.com>.
Bom dia Schwalbe!

Thanks for the reply.

I'm using Flink 1.14.0. EMR is a managed cluster platform to run big data
applications on AWS. This way Flink services are running on YARN. I tried
to create another savepoint today and was able to retrieve the Job Manager
log:

2022-03-09 15:42:10,294 INFO  org.apache.flink.runtime.jobmaster.JobMaster
                [] - Triggering savepoint for job
6f9d71e57efba96dad7f5328ab9ac717.
2022-03-09 15:42:10,298 INFO
 org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
Triggering checkpoint 1378 (type=SAVEPOINT) @ 1646840530294 for job
6f9d71e57efba96dad7f5328ab9ac717.
2022-03-09 15:45:19,636 WARN  akka.remote.transport.netty.NettyTransport
                [] - Remote connection to [/172.30.0.169:57520] failed with
java.io.IOException: Connection reset by peer
2022-03-09 15:45:19,648 WARN  akka.remote.ReliableDeliverySupervisor
                [] - Association with remote system
[akka.tcp://flink@ip-172-30-0-169.ec2.internal:46639] has failed, address
is now gated for [50] ms. Reason: [Disassociated]
2022-03-09 15:45:19,652 WARN  akka.remote.ReliableDeliverySupervisor
                [] - Association with remote system
[akka.tcp://flink-metrics@ip-172-30-0-169.ec2.internal:41533] has failed,
address is now gated for [50] ms. Reason: [Disassociated]
2022-03-09 15:45:19,707 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
LEFT_JOIN_MESSAGE_BULK -> Map (1/3) (866e32468227f9f0adac82e9b83b970a)
switched from RUNNING to FAILED on container_1646341714746_0005_01_000004 @
ip-172-30-0-165.ec2.internal (dataPort=40231).
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
Connection unexpectedly closed by remote task manager
'ip-172-30-0-169.ec2.internal/172.30.0.169:34413'. This might indicate that
the remote task manager was lost.
at
org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:186)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:94)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:831)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at
org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
~[flink-dist_2.12-1.14.0.jar:1.14.0]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_322]
2022-03-09 15:45:19,720 INFO
 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
[] - Calculating tasks to restart to recover the failed task
5d739cfcb34ba91e39db0d6db0a4f1a2_0.
2022-03-09 15:45:19,721 INFO
 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
[] - 18 tasks should be restarted to recover the failed task
5d739cfcb34ba91e39db0d6db0a4f1a2_0.
2022-03-09 15:45:19,723 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
TOOL_MESSAGE_STREAM (6f9d71e57efba96dad7f5328ab9ac717) switched from state
RUNNING to RESTARTING.
2022-03-09 15:45:19,728 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
LEFT_JOIN_MESSAGE_BULK -> Map (3/3) (df8df89abf1761a726dd4593387cbd76)
switched from RUNNING to CANCELING.
2022-03-09 15:45:19,732 INFO
 org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
Removing registered reader after failure for subtask 1 of source Source:
BULK_SENDER_EVENT -> Filter -> Map.
2022-03-09 15:45:19,732 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
BULK_SENDER_EVENT -> Filter -> Map (2/3) (2ba29adbeeceb86a69536082fbfb4931)
switched from RUNNING to CANCELING.
2022-03-09 15:45:19,732 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
LEFT_JOIN_MESSAGE_BULK -> Map (2/3) (3674e936354727457b06293592b65f20)
switched from RUNNING to CANCELING.
2022-03-09 15:45:19,733 INFO
 org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
Removing registered reader after failure for subtask 0 of source Source:
BULK_SENDER_EVENT -> Filter -> Map.
2022-03-09 15:45:19,733 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
BULK_SENDER_EVENT -> Filter -> Map (1/3) (117be79102d086ddb65fb0b31c245f49)
switched from RUNNING to CANCELING.
2022-03-09 15:45:19,733 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
LEFT_JOIN_MESSAGE_FLOW -> Map (3/3) (3add641adeebe4e14dd0111bd647aa75)
switched from RUNNING to CANCELING.
2022-03-09 15:45:19,733 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
LEFT_JOIN_MESSAGE_FLOW -> Map (2/3) (f5f4d0d49c1d2c2c65685ccb6c35eab4)
switched from RUNNING to CANCELING.
2022-03-09 15:45:19,733 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (3/3)
(07a1c10037596788aee6844603ab17a2) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,733 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
LEFT_JOIN_MESSAGE_FLOW -> Map (1/3) (3caaf3a8072b29567d94691d1765a294)
switched from RUNNING to CANCELING.
2022-03-09 15:45:19,733 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (2/3)
(216e7b0a7d0f8c887a5c91a3e5267c73) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,733 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (1/3)
(2e3973007d7c3d71f10456c3aca5a0b3) switched from RUNNING to CANCELING.
2022-03-09 15:45:19,733 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
Filter)) (3/3) (bcebf365a3e6da0281d80b3ab0e2cff8) switched from RUNNING to
CANCELING.
2022-03-09 15:45:19,733 INFO
 org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
Removing registered reader after failure for subtask 2 of source Source:
MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
Filter)).
2022-03-09 15:45:19,734 INFO
 org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
Removing registered reader after failure for subtask 1 of source Source:
MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
Filter)).
2022-03-09 15:45:19,734 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
Filter)) (2/3) (8d912317f46bf11f866bd70be09377ef) switched from RUNNING to
CANCELING.
2022-03-09 15:45:19,734 INFO
 org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
Removing registered reader after failure for subtask 0 of source Source:
MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
Filter)).
2022-03-09 15:45:19,734 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
Filter)) (1/3) (c4a8b6f2d0b7ea837dd4e69a107e652e) switched from RUNNING to
CANCELING.
2022-03-09 15:45:19,735 INFO
 org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
Removing registered reader after failure for subtask 2 of source Source:
FLOW_EVENT -> Filter -> Map.
2022-03-09 15:45:19,735 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
FLOW_EVENT -> Filter -> Map (3/3) (b811833b1ce35416fae61aba7cdbeb53)
switched from RUNNING to CANCELING.
2022-03-09 15:45:19,735 INFO
 org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
Removing registered reader after failure for subtask 1 of source Source:
FLOW_EVENT -> Filter -> Map.
2022-03-09 15:45:19,735 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
FLOW_EVENT -> Filter -> Map (2/3) (c5048ba45046caa3b954257602b4f0a4)
switched from RUNNING to CANCELING.
2022-03-09 15:45:19,735 INFO
 org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
Removing registered reader after failure for subtask 0 of source Source:
FLOW_EVENT -> Filter -> Map.
2022-03-09 15:45:19,735 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
FLOW_EVENT -> Filter -> Map (1/3) (4693404a84c87a07b1f35ed62f9edad7)
switched from RUNNING to CANCELING.
2022-03-09 15:45:19,735 INFO
 org.apache.flink.runtime.source.coordinator.SourceCoordinator [] -
Removing registered reader after failure for subtask 2 of source Source:
BULK_SENDER_EVENT -> Filter -> Map.
2022-03-09 15:45:19,735 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
BULK_SENDER_EVENT -> Filter -> Map (3/3) (d4421c51bf8fb716c672e13fd249450f)
switched from RUNNING to CANCELING.
2022-03-09 15:45:19,744 WARN  akka.remote.transport.netty.NettyTransport
                [] - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: ip-172-30-0-169.ec2.internal/
172.30.0.169:46639
2022-03-09 15:45:19,746 WARN  akka.remote.ReliableDeliverySupervisor
                [] - Association with remote system
[akka.tcp://flink@ip-172-30-0-169.ec2.internal:46639] has failed, address
is now gated for [50] ms. Reason: [Association failed with
[akka.tcp://flink@ip-172-30-0-169.ec2.internal:46639]] Caused by:
[java.net.ConnectException: Connection refused:
ip-172-30-0-169.ec2.internal/172.30.0.169:46639]
2022-03-09 15:45:19,751 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
LEFT_JOIN_MESSAGE_BULK -> Map (3/3) (df8df89abf1761a726dd4593387cbd76)
switched from CANCELING to CANCELED.
2022-03-09 15:45:19,751 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
Discarding the results produced by task execution
df8df89abf1761a726dd4593387cbd76.
2022-03-09 15:45:19,753 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
Discarding the results produced by task execution
df8df89abf1761a726dd4593387cbd76.
2022-03-09 15:45:19,754 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
LEFT_JOIN_MESSAGE_FLOW -> Map (3/3) (3add641adeebe4e14dd0111bd647aa75)
switched from CANCELING to CANCELED.
2022-03-09 15:45:19,755 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
Discarding the results produced by task execution
3add641adeebe4e14dd0111bd647aa75.
2022-03-09 15:45:19,755 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
Discarding the results produced by task execution
3add641adeebe4e14dd0111bd647aa75.
2022-03-09 15:45:19,755 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (3/3)
(07a1c10037596788aee6844603ab17a2) switched from CANCELING to CANCELED.
2022-03-09 15:45:19,755 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
Discarding the results produced by task execution
07a1c10037596788aee6844603ab17a2.
2022-03-09 15:45:19,755 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
Filter)) (3/3) (bcebf365a3e6da0281d80b3ab0e2cff8) switched from CANCELING
to CANCELED.
2022-03-09 15:45:19,755 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
Discarding the results produced by task execution
bcebf365a3e6da0281d80b3ab0e2cff8.
2022-03-09 15:45:19,755 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
Discarding the results produced by task execution
bcebf365a3e6da0281d80b3ab0e2cff8.
2022-03-09 15:45:19,755 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
FLOW_EVENT -> Filter -> Map (3/3) (b811833b1ce35416fae61aba7cdbeb53)
switched from CANCELING to CANCELED.
2022-03-09 15:45:19,755 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
Discarding the results produced by task execution
b811833b1ce35416fae61aba7cdbeb53.
2022-03-09 15:45:19,755 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
Discarding the results produced by task execution
b811833b1ce35416fae61aba7cdbeb53.
2022-03-09 15:45:19,756 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
BULK_SENDER_EVENT -> Filter -> Map (3/3) (d4421c51bf8fb716c672e13fd249450f)
switched from CANCELING to CANCELED.
2022-03-09 15:45:19,757 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
Discarding the results produced by task execution
d4421c51bf8fb716c672e13fd249450f.
2022-03-09 15:45:19,757 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
Discarding the results produced by task execution
d4421c51bf8fb716c672e13fd249450f.
2022-03-09 15:45:19,757 INFO
 org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
[] - Received resource requirements from job
6f9d71e57efba96dad7f5328ab9ac717:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=2}]
2022-03-09 15:45:19,790 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
BULK_SENDER_EVENT -> Filter -> Map (1/3) (117be79102d086ddb65fb0b31c245f49)
switched from CANCELING to CANCELED.
2022-03-09 15:45:19,797 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
FLOW_EVENT -> Filter -> Map (1/3) (4693404a84c87a07b1f35ed62f9edad7)
switched from CANCELING to CANCELED.
2022-03-09 15:45:19,801 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
MESSAGE_EVENT -> Filter -> Map -> (Filter -> Map, Filter -> (Filter,
Filter)) (1/3) (c4a8b6f2d0b7ea837dd4e69a107e652e) switched from CANCELING
to CANCELED.
2022-03-09 15:45:20,026 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
DEDUP_TOOL_MESSAGE_EVENT -> Sink SINK_TOOL_MESSAGE_EVENT (1/3)
(2e3973007d7c3d71f10456c3aca5a0b3) switched from CANCELING to CANCELED.
2022-03-09 15:45:23,957 INFO
 org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
Worker container_1646341714746_0005_01_000003 is terminated. Diagnostics:
Container container_1646341714746_0005_01_000003 marked as failed.
 Exit code:137.
 Diagnostics:[2022-03-09 15:45:19.639]Container killed on request. Exit
code is 137
[2022-03-09 15:45:19.641]Container exited with a non-zero exit code 137.
[2022-03-09 15:45:19.642]Killed by external signal

2022-03-09 15:45:23,957 INFO
 org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
Closing TaskExecutor connection container_1646341714746_0005_01_000003
because: Container container_1646341714746_0005_01_000003 marked as failed.
 Exit code:137.
 Diagnostics:[2022-03-09 15:45:19.639]Container killed on request. Exit
code is 137
[2022-03-09 15:45:19.641]Container exited with a non-zero exit code 137.
[2022-03-09 15:45:19.642]Killed by external signal

Thanks,

On Tue, Mar 8, 2022 at 4:57 AM Schwalbe Matthias <
Matthias.Schwalbe@viseca.ch> wrote:

> Bom Dia Vinicius,
>
>
>
> Can You still find (and post) the exception stack from your jobmanager
> log, the flink client log does not reveal enough information.
>
> Your situation reminds me of something similar I had.
>
> In the log you might find something like this or similar:
>
>
>
> 2022-03-07 02:15:41,347 INFO
> org.apache.flink.runtime.jobmaster.JobMaster                 [] -
> Triggering stop-with-savepoint for job e12f22653f79194863ab426312dd666a.
>
> 2022-03-07 02:15:41,380 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering
> checkpoint 4983974 (type=SAVEPOINT_SUSPEND) @ 1646615741347 for job
> e12f22653f79194863ab426312dd666a.
>
> 2022-03-07 02:15:43,042 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline
> checkpoint 4983974 by task 0e659ac720e3e0b3e4072dbc1cc85cd3 of job
> e12f22653f79194863ab426312dd666a at
> container_e1093_1646358077201_0002_01_000001 @ ulxxphaddtn02.adgr.net
> (dataPort=44767).
>
> org.apache.flink.util.SerializedThrowable: Asynchronous task checkpoint
> failed.
>
>             at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:279)
> ~[flink-dist_2.11-1.13.0.jar:1.13.0]
>
>
>
> BTW what Flink version are you running?
>
> What is EMR (what technology underneath).
>
>
>
>
>
>
>
> *From:* Vinicius Peracini <vi...@zenvia.com>
> *Sent:* Montag, 7. März 2022 20:46
> *To:* Dawid Wysakowicz <dw...@apache.org>
> *Cc:* user@flink.apache.org
> *Subject:* Re: Could not stop job with a savepoint
>
>
>
> Hi Dawid, thanks for the reply.
>
>
>
> The job was still in progress and producing events. Unfortunately I was
> not able to stop the job with a savepoint or to just create a savepoint. I
> had to stop the job without the savepoint and restore the state using the
> last checkpoint. Still reviewing my configuration and trying to figure out
> why this is happening. Any help would be appreciated.
>
>
>
> Thanks!
>
>
>
>
>
> On Mon, Mar 7, 2022 at 11:56 AM Dawid Wysakowicz <dw...@apache.org>
> wrote:
>
> Hi,
>
> From the exception it seems the job has been already done when you're
> triggering the savepoint.
>
> Best,
>
> Dawid
>
> On 07/03/2022 14:56, Vinicius Peracini wrote:
>
> Hello everyone,
>
>
>
> I have a Flink job (version 1.14.0 running on EMR) and I'm having this
> issue while trying to stop a job with a savepoint on S3:
>
>
>
> org.apache.flink.util.FlinkException: Could not stop with a savepoint job
> "df3a3c590fabac737a17f1160c21094c".
> at
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:581)
> at
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
> at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:569)
> at
> org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1069)
> at
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at
> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
> Caused by: java.util.concurrent.ExecutionException:
> java.util.concurrent.CompletionException:
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint
> Coordinator is suspending.
> at
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
> at
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:579)
> ... 9 more
>
>
>
> I'm using incremental and unaligned checkpoints (aligned checkpoint
> timeout is 30 seconds). I also tried to create the savepoint without
> stopping the job (using flink savepoint command) and got the same error.
> Any idea what is happening here?
>
>
>
> Thanks in advance,
>
>
>
> Aviso Legal: Este documento pode conter informações confidenciais e/ou
> privilegiadas. Se você não for o destinatário ou a pessoa autorizada a
> receber este documento, não deve usar, copiar ou divulgar as informações
> nele contidas ou tomar qualquer ação baseada nessas informações.
>
>
>
> Disclaimer: The information contained in this document may be privileged
> and confidential and protected from disclosure. If the reader of this
> document is not the intended recipient, or an employee agent responsible
> for delivering this document to the intended recipient, you are hereby
> notified that any dissemination, distribution or copying of this
> communication is strictly prohibited.
>
>
>
> Aviso Legal: Este documento pode conter informações confidenciais e/ou
> privilegiadas. Se você não for o destinatário ou a pessoa autorizada a
> receber este documento, não deve usar, copiar ou divulgar as informações
> nele contidas ou tomar qualquer ação baseada nessas informações.
>
>
>
> Disclaimer: The information contained in this document may be privileged
> and confidential and protected from disclosure. If the reader of this
> document is not the intended recipient, or an employee agent responsible
> for delivering this document to the intended recipient, you are hereby
> notified that any dissemination, distribution or copying of this
> communication is strictly prohibited.
> Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und
> beinhaltet unter Umständen vertrauliche Mitteilungen. Da die
> Vertraulichkeit von e-Mail-Nachrichten nicht gewährleistet werden kann,
> übernehmen wir keine Haftung für die Gewährung der Vertraulichkeit und
> Unversehrtheit dieser Mitteilung. Bei irrtümlicher Zustellung bitten wir
> Sie um Benachrichtigung per e-Mail und um Löschung dieser Nachricht sowie
> eventueller Anhänge. Jegliche unberechtigte Verwendung oder Verbreitung
> dieser Informationen ist streng verboten.
>
> This message is intended only for the named recipient and may contain
> confidential or privileged information. As the confidentiality of email
> communication cannot be guaranteed, we do not accept any responsibility for
> the confidentiality and the intactness of this message. If you have
> received it in error, please advise the sender by return e-mail and delete
> this message and any attachments. Any unauthorised use or dissemination of
> this information is strictly prohibited.
>

-- 
Aviso Legal: Este documento pode conter informações confidenciais e/ou 
privilegiadas. Se você não for o destinatário ou a pessoa autorizada a 
receber este documento, não deve usar, copiar ou divulgar as informações 
nele contidas ou tomar qualquer ação baseada nessas informações.


Disclaimer: The information contained in this document may be privileged 
and confidential and protected from disclosure. If the reader of this 
document is not the intended recipient, or an employee agent responsible 
for delivering this document to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this 
communication is strictly prohibited.


RE: Could not stop job with a savepoint

Posted by Schwalbe Matthias <Ma...@viseca.ch>.
Bom Dia Vinicius,

Can You still find (and post) the exception stack from your jobmanager log, the flink client log does not reveal enough information.
Your situation reminds me of something similar I had.
In the log you might find something like this or similar:

2022-03-07 02:15:41,347 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Triggering stop-with-savepoint for job e12f22653f79194863ab426312dd666a.
2022-03-07 02:15:41,380 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 4983974 (type=SAVEPOINT_SUSPEND) @ 1646615741347 for job e12f22653f79194863ab426312dd666a.
2022-03-07 02:15:43,042 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline checkpoint 4983974 by task 0e659ac720e3e0b3e4072dbc1cc85cd3 of job e12f22653f79194863ab426312dd666a at container_e1093_1646358077201_0002_01_000001 @ ulxxphaddtn02.adgr.net (dataPort=44767).
org.apache.flink.util.SerializedThrowable: Asynchronous task checkpoint failed.
            at org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:279) ~[flink-dist_2.11-1.13.0.jar:1.13.0]

BTW what Flink version are you running?
What is EMR (what technology underneath).



From: Vinicius Peracini <vi...@zenvia.com>
Sent: Montag, 7. März 2022 20:46
To: Dawid Wysakowicz <dw...@apache.org>
Cc: user@flink.apache.org
Subject: Re: Could not stop job with a savepoint

Hi Dawid, thanks for the reply.

The job was still in progress and producing events. Unfortunately I was not able to stop the job with a savepoint or to just create a savepoint. I had to stop the job without the savepoint and restore the state using the last checkpoint. Still reviewing my configuration and trying to figure out why this is happening. Any help would be appreciated.

Thanks!


On Mon, Mar 7, 2022 at 11:56 AM Dawid Wysakowicz <dw...@apache.org>> wrote:

Hi,

From the exception it seems the job has been already done when you're triggering the savepoint.

Best,

Dawid
On 07/03/2022 14:56, Vinicius Peracini wrote:
Hello everyone,

I have a Flink job (version 1.14.0 running on EMR) and I'm having this issue while trying to stop a job with a savepoint on S3:

org.apache.flink.util.FlinkException: Could not stop with a savepoint job "df3a3c590fabac737a17f1160c21094c".
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:581)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:569)
at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1069)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending.
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:579)
... 9 more

I'm using incremental and unaligned checkpoints (aligned checkpoint timeout is 30 seconds). I also tried to create the savepoint without stopping the job (using flink savepoint command) and got the same error. Any idea what is happening here?

Thanks in advance,

Aviso Legal: Este documento pode conter informações confidenciais e/ou privilegiadas. Se você não for o destinatário ou a pessoa autorizada a receber este documento, não deve usar, copiar ou divulgar as informações nele contidas ou tomar qualquer ação baseada nessas informações.

Disclaimer: The information contained in this document may be privileged and confidential and protected from disclosure. If the reader of this document is not the intended recipient, or an employee agent responsible for delivering this document to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited.

Aviso Legal: Este documento pode conter informações confidenciais e/ou privilegiadas. Se você não for o destinatário ou a pessoa autorizada a receber este documento, não deve usar, copiar ou divulgar as informações nele contidas ou tomar qualquer ação baseada nessas informações.

Disclaimer: The information contained in this document may be privileged and confidential and protected from disclosure. If the reader of this document is not the intended recipient, or an employee agent responsible for delivering this document to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited.
Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und beinhaltet unter Umständen vertrauliche Mitteilungen. Da die Vertraulichkeit von e-Mail-Nachrichten nicht gewährleistet werden kann, übernehmen wir keine Haftung für die Gewährung der Vertraulichkeit und Unversehrtheit dieser Mitteilung. Bei irrtümlicher Zustellung bitten wir Sie um Benachrichtigung per e-Mail und um Löschung dieser Nachricht sowie eventueller Anhänge. Jegliche unberechtigte Verwendung oder Verbreitung dieser Informationen ist streng verboten.

This message is intended only for the named recipient and may contain confidential or privileged information. As the confidentiality of email communication cannot be guaranteed, we do not accept any responsibility for the confidentiality and the intactness of this message. If you have received it in error, please advise the sender by return e-mail and delete this message and any attachments. Any unauthorised use or dissemination of this information is strictly prohibited.

Re: Could not stop job with a savepoint

Posted by Vinicius Peracini <vi...@zenvia.com>.
Hi Dawid, thanks for the reply.

The job was still in progress and producing events. Unfortunately I was not
able to stop the job with a savepoint or to just create a savepoint. I had
to stop the job without the savepoint and restore the state using the last
checkpoint. Still reviewing my configuration and trying to figure out why
this is happening. Any help would be appreciated.

Thanks!


On Mon, Mar 7, 2022 at 11:56 AM Dawid Wysakowicz <dw...@apache.org>
wrote:

> Hi,
>
> From the exception it seems the job has been already done when you're
> triggering the savepoint.
>
> Best,
>
> Dawid
> On 07/03/2022 14:56, Vinicius Peracini wrote:
>
> Hello everyone,
>
> I have a Flink job (version 1.14.0 running on EMR) and I'm having this
> issue while trying to stop a job with a savepoint on S3:
>
> org.apache.flink.util.FlinkException: Could not stop with a savepoint job
> "df3a3c590fabac737a17f1160c21094c".
> at
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:581)
> at
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
> at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:569)
> at
> org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1069)
> at
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at
> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
> Caused by: java.util.concurrent.ExecutionException:
> java.util.concurrent.CompletionException:
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint
> Coordinator is suspending.
> at
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
> at
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:579)
> ... 9 more
>
> I'm using incremental and unaligned checkpoints (aligned checkpoint
> timeout is 30 seconds). I also tried to create the savepoint without
> stopping the job (using flink savepoint command) and got the same error.
> Any idea what is happening here?
>
> Thanks in advance,
>
> Aviso Legal: Este documento pode conter informações confidenciais e/ou
> privilegiadas. Se você não for o destinatário ou a pessoa autorizada a
> receber este documento, não deve usar, copiar ou divulgar as informações
> nele contidas ou tomar qualquer ação baseada nessas informações.
>
> Disclaimer: The information contained in this document may be privileged
> and confidential and protected from disclosure. If the reader of this
> document is not the intended recipient, or an employee agent responsible
> for delivering this document to the intended recipient, you are hereby
> notified that any dissemination, distribution or copying of this
> communication is strictly prohibited.
>
>

-- 
Aviso Legal: Este documento pode conter informações confidenciais e/ou 
privilegiadas. Se você não for o destinatário ou a pessoa autorizada a 
receber este documento, não deve usar, copiar ou divulgar as informações 
nele contidas ou tomar qualquer ação baseada nessas informações.


Disclaimer: The information contained in this document may be privileged 
and confidential and protected from disclosure. If the reader of this 
document is not the intended recipient, or an employee agent responsible 
for delivering this document to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this 
communication is strictly prohibited.


Re: Could not stop job with a savepoint

Posted by Dawid Wysakowicz <dw...@apache.org>.
Hi,

 From the exception it seems the job has been already done when you're 
triggering the savepoint.

Best,

Dawid

On 07/03/2022 14:56, Vinicius Peracini wrote:
> Hello everyone,
>
> I have a Flink job (version 1.14.0 running on EMR) and I'm having this 
> issue while trying to stop a job with a savepoint on S3:
>
> org.apache.flink.util.FlinkException: Could not stop with a savepoint 
> job "df3a3c590fabac737a17f1160c21094c".
> at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:581)
> at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
> at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:569)
> at 
> org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1069)
> at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at 
> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
> Caused by: java.util.concurrent.ExecutionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint 
> Coordinator is suspending.
> at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
> at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:579)
> ... 9 more
>
> I'm using incremental and unaligned checkpoints (aligned checkpoint 
> timeout is 30 seconds). I also tried to create the savepoint without 
> stopping the job (using flink savepoint command) and got the same 
> error. Any idea what is happening here?
>
> Thanks in advance,
>
> Aviso Legal: Este documento pode conter informações confidenciais e/ou 
> privilegiadas. Se você não for o destinatário ou a pessoa autorizada a 
> receber este documento, não deve usar, copiar ou divulgar as 
> informações nele contidas ou tomar qualquer ação baseada nessas 
> informações.
>
> Disclaimer: The information contained in this document may be 
> privileged and confidential and protected from disclosure. If the 
> reader of this document is not the intended recipient, or an employee 
> agent responsible for delivering this document to the intended 
> recipient, you are hereby notified that any dissemination, 
> distribution or copying of this communication is strictly prohibited.