You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ouyangzhe (JIRA)" <ji...@apache.org> on 2018/11/12 09:52:00 UTC
[jira] [Created] (FLINK-10850) Job may hang on FAILING state if
taskmanager updateTaskExecutionState failed
ouyangzhe created FLINK-10850:
---------------------------------
Summary: Job may hang on FAILING state if taskmanager updateTaskExecutionState failed
Key: FLINK-10850
URL: https://issues.apache.org/jira/browse/FLINK-10850
Project: Flink
Issue Type: Bug
Components: JobManager
Affects Versions: 1.5.5
Reporter: ouyangzhe
Fix For: 1.8.0
I encountered a job which is oom but hung on FAILING state. It left 3 slots to release, and the corresponding task state is CANCELING.
I found the following log in the taskmanager, it seems that taskmanager tried to updateTaskExecutionState from CANCELING to CANCELED, but OOMed.
{panel}
2018-11-08 18:01:23,250 INFO org.apache.flink.runtime.taskmanager.Task - PartialSolution (BulkIteration (Bulk Iteration)) (97/600) (46005ba837e
fc4ebf783fc92121e55a8) switched from RUNNING to CANCELING.
2018-11-08 18:01:23,257 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code PartialSolution (BulkIteration (B
ulk Iteration)) (97/600) (46005ba837efc4ebf783fc92121e55a8).
2018-11-08 18:01:44,081 INFO org.apache.flink.runtime.taskmanager.Task - PartialSolution (BulkIteration (Bulk Iteration)) (97/600) (46005ba837e
fc4ebf783fc92121e55a8) switched from CANCELING to CANCELED.
2018-11-08 18:01:44,081 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for PartialSolution (BulkIteration (Bulk Iterat
ion)) (97/600) (46005ba837efc4ebf783fc92121e55a8).
2018-11-08 18:02:03,097 WARN org.apache.flink.runtime.taskmanager.Task - Task 'PartialSolution (BulkIteration (Bulk Iteration)) (97/600)' did n
ot react to cancelling signal for 30 seconds, but is stuck in method:
org.apache.flink.shaded.guava18.com.google.common.collect.Maps$EntryFunction$1.apply(Maps.java:86)
org.apache.flink.shaded.guava18.com.google.common.collect.Iterators$8.transform(Iterators.java:799)
org.apache.flink.shaded.guava18.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
java.util.AbstractCollection.toArray(AbstractCollection.java:141)
org.apache.flink.shaded.guava18.com.google.common.collect.ImmutableList.copyOf(ImmutableList.java:258)
org.apache.flink.runtime.io.network.partition.ResultPartitionManager.releasePartitionsProducedBy(ResultPartitionManager.java:100)
org.apache.flink.runtime.io.network.NetworkEnvironment.unregisterTask(NetworkEnvironment.java:275)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:833)
java.lang.Thread.run(Thread.java:745)
2018-11-08 18:02:05,665 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Discarding the results produced by task execution e9141e20871e530dee90
4ddce11adca0.
2018-11-08 18:02:22,536 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Discarding the results produced by task execution 7fac76a5d76247d803e1
f1c47a6b385f.
2018-11-08 18:03:47,210 WARN org.apache.flink.runtime.taskmanager.Task - Task 'PartialSolution (BulkIteration (Bulk Iteration)) (97/600)' did n
ot react to cancelling signal for 30 seconds, but is stuck in method:
org.apache.flink.runtime.memory.MemoryManager.releaseAll(MemoryManager.java:497)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:837)
java.lang.Thread.run(Thread.java:745)
2018-11-08 18:03:47,213 INFO org.apache.flink.runtime.taskmanager.Task - Ensuring all FileSystem streams are closed for task PartialSolution (B
ulkIteration (Bulk Iteration)) (97/600) (46005ba837efc4ebf783fc92121e55a8) [CANCELED]
2018-11-08 18:03:47,215 WARN org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline - An exception was thrown by a user handler while handlin
g an exception event ([id: 0x397132f7, /11.10.199.197:33286 => /11.9.137.228:40859] EXCEPTION: java.lang.OutOfMemoryError: GC overhead limit exceeded)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.flink.shaded.akka.org.jboss.netty.buffer.HeapChannelBuffer.<init>(HeapChannelBuffer.java:42)
at org.apache.flink.shaded.akka.org.jboss.netty.buffer.BigEndianHeapChannelBuffer.<init>(BigEndianHeapChannelBuffer.java:34)
at org.apache.flink.shaded.akka.org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134)
at org.apache.flink.shaded.akka.org.jboss.netty.buffer.HeapChannelBufferFactory.getBuffer(HeapChannelBufferFactory.java:68)
at org.apache.flink.shaded.akka.org.jboss.netty.buffer.AbstractChannelBufferFactory.getBuffer(AbstractChannelBufferFactory.java:48)
at org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.FrameDecoder.extractFrame(FrameDecoder.java:566)
at org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:391)
at org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
at org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at org.apache.flink.shaded.akka.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.apache.flink.shaded.akka.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.apache.flink.shaded.akka.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.apache.flink.shaded.akka.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.apache.flink.shaded.akka.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{panel}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)