You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Daniel Harper (JIRA)" <ji...@apache.org> on 2018/03/02 09:50:00 UTC

[jira] [Created] (FLINK-8834) Job fails to restart due to some tasks stuck in cancelling state

Daniel Harper created FLINK-8834:
------------------------------------

             Summary: Job fails to restart due to some tasks stuck in cancelling state
                 Key: FLINK-8834
                 URL: https://issues.apache.org/jira/browse/FLINK-8834
             Project: Flink
          Issue Type: Bug
    Affects Versions: 1.4.0
         Environment: EMR 5.12

Flink 1.4.0

Beam 2.3.0
            Reporter: Daniel Harper


Our job threw an exception overnight, causing the job to commence attempting a restart.

However it never managed to restart because 2 tasks are stuck in "Cancelling" state, with the following exception
{code:java}
2018-03-02 02:29:31,604 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'PTransformTranslation.UnknownRawPTransform -> ParDoTranslation.RawParDo -> ParDoTranslation.RawParDo -> uk.co.bbc.sawmill.streaming.pipeline.output.io.file.WriteWindowToFile-RDotRecord/TextIO.Write/WriteFiles/GatherTempFileResults/Reshuffle/Window.Into()/Window.Assign.out -> ParDoTranslation.RawParDo -> ToKeyedWorkItem (24/32)' did not react to cancelling signal, but is stuck in method:
 java.lang.Thread.blockedOn(Thread.java:239)
java.lang.System$2.blockedOn(System.java:1252)
java.nio.channels.spi.AbstractInterruptibleChannel.blockedOn(AbstractInterruptibleChannel.java:211)
java.nio.channels.spi.AbstractInterruptibleChannel.begin(AbstractInterruptibleChannel.java:170)
java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:457)
java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
java.nio.channels.Channels.writeFully(Channels.java:101)
java.nio.channels.Channels.access$000(Channels.java:61)
java.nio.channels.Channels$1.write(Channels.java:174)
java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:458)
java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
java.nio.channels.Channels.writeFully(Channels.java:101)
java.nio.channels.Channels.access$000(Channels.java:61)
java.nio.channels.Channels$1.write(Channels.java:174)
sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
sun.nio.cs.StreamEncoder.write(StreamEncoder.java:135)
java.io.OutputStreamWriter.write(OutputStreamWriter.java:220)
java.io.Writer.write(Writer.java:157)
org.apache.beam.sdk.io.TextSink$TextWriter.writeLine(TextSink.java:102)
org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:118)
org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:76)
org.apache.beam.sdk.io.WriteFiles.writeOrClose(WriteFiles.java:550)
org.apache.beam.sdk.io.WriteFiles.access$1000(WriteFiles.java:112)
org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:718)
org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177)
org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:138)
org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65)
org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processElement(DoFnOperator.java:425)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809)
org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.emit(DoFnOperator.java:888)
org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.output(DoFnOperator.java:865)
org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:94)
org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:87)
org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1040)
org.apache.beam.runners.core.ReduceFnRunner$$Lambda$163/1408647946.output(Unknown Source)
org.apache.beam.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:433)
org.apache.beam.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:127)
org.apache.beam.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1043)
org.apache.beam.runners.core.ReduceFnRunner.emit(ReduceFnRunner.java:911)
org.apache.beam.runners.core.ReduceFnRunner.onTimers(ReduceFnRunner.java:776)
org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn.processElement(GroupAlsoByWindowViaWindowSetNewDoFn.java:134)
org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177)
org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:141)
org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:74)
org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65)
org.apache.beam.runners.flink.translation.wrappers.streaming.WindowDoFnOperator.fireTimer(WindowDoFnOperator.java:108)
org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.onEventTime(DoFnOperator.java:767)
org.apache.flink.streaming.api.operators.HeapInternalTimerService.advanceWatermark(HeapInternalTimerService.java:277)
org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark1(DoFnOperator.java:532)
org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark(DoFnOperator.java:501)
org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:288)
org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189)
org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111)
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:189)
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
java.lang.Thread.run(Thread.java:748)

2018-03-02 02:29:32,332 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'PTransformTranslation.UnknownRawPTransform -> ParDoTranslation.RawParDo -> ParDoTranslation.RawParDo -> uk.co.bbc.sawmill.streaming.pipeline.output.io.file.WriteWindowToFile-RDotRecord2/TextIO.Write/WriteFiles/GatherTempFileResults/Reshuffle/Window.Into()/Window.Assign.out -> ParDoTranslation.RawParDo -> ToKeyedWorkItem (22/32)' did not react to cancelling signal, but is stuck in method:
 java.lang.Thread.blockedOn(Thread.java:239)
java.lang.System$2.blockedOn(System.java:1252)
java.nio.channels.spi.AbstractInterruptibleChannel.blockedOn(AbstractInterruptibleChannel.java:211)
java.nio.channels.spi.AbstractInterruptibleChannel.begin(AbstractInterruptibleChannel.java:170)
java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:457)
java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
java.nio.channels.Channels.writeFully(Channels.java:101)
java.nio.channels.Channels.access$000(Channels.java:61)
java.nio.channels.Channels$1.write(Channels.java:174)
java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:458)
java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
java.nio.channels.Channels.writeFully(Channels.java:101)
java.nio.channels.Channels.access$000(Channels.java:61)
java.nio.channels.Channels$1.write(Channels.java:174)
sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
sun.nio.cs.StreamEncoder.write(StreamEncoder.java:135)
java.io.OutputStreamWriter.write(OutputStreamWriter.java:220)
java.io.Writer.write(Writer.java:157)
org.apache.beam.sdk.io.TextSink$TextWriter.writeLine(TextSink.java:102)
org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:118)
org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:76)
org.apache.beam.sdk.io.WriteFiles.writeOrClose(WriteFiles.java:550)
org.apache.beam.sdk.io.WriteFiles.access$1000(WriteFiles.java:112)
org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:718)
org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177)
org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:138)
org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65)
org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processElement(DoFnOperator.java:425)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809)
org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.emit(DoFnOperator.java:888)
org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.output(DoFnOperator.java:865)
org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:94)
org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:87)
org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1040)
org.apache.beam.runners.core.ReduceFnRunner$$Lambda$163/1408647946.output(Unknown Source)
org.apache.beam.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:433)
org.apache.beam.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:127)
org.apache.beam.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1043)
org.apache.beam.runners.core.ReduceFnRunner.emit(ReduceFnRunner.java:911)
org.apache.beam.runners.core.ReduceFnRunner.onTimers(ReduceFnRunner.java:776)
org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn.processElement(GroupAlsoByWindowViaWindowSetNewDoFn.java:134)
org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177)
org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:141)
org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:74)
org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65)
org.apache.beam.runners.flink.translation.wrappers.streaming.WindowDoFnOperator.fireTimer(WindowDoFnOperator.java:108)
org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.onEventTime(DoFnOperator.java:767)
org.apache.flink.streaming.api.operators.HeapInternalTimerService.advanceWatermark(HeapInternalTimerService.java:277)
org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark1(DoFnOperator.java:532)
org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark(DoFnOperator.java:501)
org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:288)
org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189)
org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111)
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:189)
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
java.lang.Thread.run(Thread.java:748)
{code}
I can see a bit further up in the logs the following exceptions too (although not sure if they are related) - this exception looks similar to FLINK-8751
{code:java}
2018-03-02 02:29:07,094 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask           - Could not shut down timer service
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2067)
        at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
        at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService.shutdownAndAwaitPending(SystemProcessingTimeService.java:197)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:317)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718){code}
 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)