You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Stephan Ewen (JIRA)" <ji...@apache.org> on 2018/03/08 10:56:00 UTC

[jira] [Resolved] (FLINK-8834) Job fails to restart due to some tasks stuck in cancelling state

     [ https://issues.apache.org/jira/browse/FLINK-8834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stephan Ewen resolved FLINK-8834.
---------------------------------
    Resolution: Fixed

Subsumed by FLINK-8856

> Job fails to restart due to some tasks stuck in cancelling state
> ----------------------------------------------------------------
>
>                 Key: FLINK-8834
>                 URL: https://issues.apache.org/jira/browse/FLINK-8834
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.4.0
>         Environment: AWS EMR 5.12
> Flink 1.4.0
> Beam 2.3.0
>            Reporter: Daniel Harper
>            Priority: Major
>             Fix For: 1.5.0
>
>
> Our job threw an exception overnight, causing the job to commence attempting a restart.
> However it never managed to restart because 2 tasks on one of the Task Managers are stuck in "Cancelling" state, with the following exception
> {code:java}
> 2018-03-02 02:29:31,604 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'PTransformTranslation.UnknownRawPTransform -> ParDoTranslation.RawParDo -> ParDoTranslation.RawParDo -> uk.co.bbc.sawmill.streaming.pipeline.output.io.file.WriteWindowToFile-RDotRecord/TextIO.Write/WriteFiles/GatherTempFileResults/Reshuffle/Window.Into()/Window.Assign.out -> ParDoTranslation.RawParDo -> ToKeyedWorkItem (24/32)' did not react to cancelling signal, but is stuck in method:
>  java.lang.Thread.blockedOn(Thread.java:239)
> java.lang.System$2.blockedOn(System.java:1252)
> java.nio.channels.spi.AbstractInterruptibleChannel.blockedOn(AbstractInterruptibleChannel.java:211)
> java.nio.channels.spi.AbstractInterruptibleChannel.begin(AbstractInterruptibleChannel.java:170)
> java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:457)
> java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
> java.nio.channels.Channels.writeFully(Channels.java:101)
> java.nio.channels.Channels.access$000(Channels.java:61)
> java.nio.channels.Channels$1.write(Channels.java:174)
> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
> java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
> java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:458)
> java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
> java.nio.channels.Channels.writeFully(Channels.java:101)
> java.nio.channels.Channels.access$000(Channels.java:61)
> java.nio.channels.Channels$1.write(Channels.java:174)
> sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
> sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
> sun.nio.cs.StreamEncoder.write(StreamEncoder.java:135)
> java.io.OutputStreamWriter.write(OutputStreamWriter.java:220)
> java.io.Writer.write(Writer.java:157)
> org.apache.beam.sdk.io.TextSink$TextWriter.writeLine(TextSink.java:102)
> org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:118)
> org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:76)
> org.apache.beam.sdk.io.WriteFiles.writeOrClose(WriteFiles.java:550)
> org.apache.beam.sdk.io.WriteFiles.access$1000(WriteFiles.java:112)
> org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:718)
> org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source)
> org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177)
> org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:138)
> org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65)
> org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processElement(DoFnOperator.java:425)
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831)
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809)
> org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.emit(DoFnOperator.java:888)
> org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.output(DoFnOperator.java:865)
> org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:94)
> org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:87)
> org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1040)
> org.apache.beam.runners.core.ReduceFnRunner$$Lambda$163/1408647946.output(Unknown Source)
> org.apache.beam.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:433)
> org.apache.beam.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:127)
> org.apache.beam.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1043)
> org.apache.beam.runners.core.ReduceFnRunner.emit(ReduceFnRunner.java:911)
> org.apache.beam.runners.core.ReduceFnRunner.onTimers(ReduceFnRunner.java:776)
> org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn.processElement(GroupAlsoByWindowViaWindowSetNewDoFn.java:134)
> org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$DoFnInvoker.invokeProcessElement(Unknown Source)
> org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177)
> org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:141)
> org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:74)
> org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65)
> org.apache.beam.runners.flink.translation.wrappers.streaming.WindowDoFnOperator.fireTimer(WindowDoFnOperator.java:108)
> org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.onEventTime(DoFnOperator.java:767)
> org.apache.flink.streaming.api.operators.HeapInternalTimerService.advanceWatermark(HeapInternalTimerService.java:277)
> org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark1(DoFnOperator.java:532)
> org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark(DoFnOperator.java:501)
> org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:288)
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189)
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111)
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:189)
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
> java.lang.Thread.run(Thread.java:748)
> 2018-03-02 02:29:32,332 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'PTransformTranslation.UnknownRawPTransform -> ParDoTranslation.RawParDo -> ParDoTranslation.RawParDo -> uk.co.bbc.sawmill.streaming.pipeline.output.io.file.WriteWindowToFile-RDotRecord2/TextIO.Write/WriteFiles/GatherTempFileResults/Reshuffle/Window.Into()/Window.Assign.out -> ParDoTranslation.RawParDo -> ToKeyedWorkItem (22/32)' did not react to cancelling signal, but is stuck in method:
>  java.lang.Thread.blockedOn(Thread.java:239)
> java.lang.System$2.blockedOn(System.java:1252)
> java.nio.channels.spi.AbstractInterruptibleChannel.blockedOn(AbstractInterruptibleChannel.java:211)
> java.nio.channels.spi.AbstractInterruptibleChannel.begin(AbstractInterruptibleChannel.java:170)
> java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:457)
> java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
> java.nio.channels.Channels.writeFully(Channels.java:101)
> java.nio.channels.Channels.access$000(Channels.java:61)
> java.nio.channels.Channels$1.write(Channels.java:174)
> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
> java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
> java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:458)
> java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
> java.nio.channels.Channels.writeFully(Channels.java:101)
> java.nio.channels.Channels.access$000(Channels.java:61)
> java.nio.channels.Channels$1.write(Channels.java:174)
> sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
> sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
> sun.nio.cs.StreamEncoder.write(StreamEncoder.java:135)
> java.io.OutputStreamWriter.write(OutputStreamWriter.java:220)
> java.io.Writer.write(Writer.java:157)
> org.apache.beam.sdk.io.TextSink$TextWriter.writeLine(TextSink.java:102)
> org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:118)
> org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:76)
> org.apache.beam.sdk.io.WriteFiles.writeOrClose(WriteFiles.java:550)
> org.apache.beam.sdk.io.WriteFiles.access$1000(WriteFiles.java:112)
> org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:718)
> org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source)
> org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177)
> org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:138)
> org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65)
> org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processElement(DoFnOperator.java:425)
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831)
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809)
> org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.emit(DoFnOperator.java:888)
> org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.output(DoFnOperator.java:865)
> org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:94)
> org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:87)
> org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1040)
> org.apache.beam.runners.core.ReduceFnRunner$$Lambda$163/1408647946.output(Unknown Source)
> org.apache.beam.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:433)
> org.apache.beam.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:127)
> org.apache.beam.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1043)
> org.apache.beam.runners.core.ReduceFnRunner.emit(ReduceFnRunner.java:911)
> org.apache.beam.runners.core.ReduceFnRunner.onTimers(ReduceFnRunner.java:776)
> org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn.processElement(GroupAlsoByWindowViaWindowSetNewDoFn.java:134)
> org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$DoFnInvoker.invokeProcessElement(Unknown Source)
> org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177)
> org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:141)
> org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:74)
> org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65)
> org.apache.beam.runners.flink.translation.wrappers.streaming.WindowDoFnOperator.fireTimer(WindowDoFnOperator.java:108)
> org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.onEventTime(DoFnOperator.java:767)
> org.apache.flink.streaming.api.operators.HeapInternalTimerService.advanceWatermark(HeapInternalTimerService.java:277)
> org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark1(DoFnOperator.java:532)
> org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark(DoFnOperator.java:501)
> org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:288)
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189)
> org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111)
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:189)
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
> java.lang.Thread.run(Thread.java:748)
> {code}
> I can see a bit further up in the logs the following exceptions too (although not sure if they are related) - this exception looks similar to FLINK-8751
> {code:java}
> 2018-03-02 02:29:07,094 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask           - Could not shut down timer service
> java.lang.InterruptedException
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2067)
>         at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
>         at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService.shutdownAndAwaitPending(SystemProcessingTimeService.java:197)
>         at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:317)
>         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718){code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)