You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Flink Jira Bot (Jira)" <ji...@apache.org> on 2021/10/28 22:40:02 UTC

[jira] [Updated] (FLINK-21263) Job hangs under backpressure

     [ https://issues.apache.org/jira/browse/FLINK-21263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Flink Jira Bot updated FLINK-21263:
-----------------------------------
    Labels: auto-deprioritized-major stale-minor  (was: auto-deprioritized-major)

I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help the community manage its development. I see this issues has been marked as Minor but is unassigned and neither itself nor its Sub-Tasks have been updated for 180 days. I have gone ahead and marked it "stale-minor". If this ticket is still Minor, please either assign yourself or give an update. Afterwards, please remove the label or in 7 days the issue will be deprioritized.


> Job hangs under backpressure
> ----------------------------
>
>                 Key: FLINK-21263
>                 URL: https://issues.apache.org/jira/browse/FLINK-21263
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.11.0
>            Reporter: Lu Niu
>            Priority: Minor
>              Labels: auto-deprioritized-major, stale-minor
>         Attachments: source_graph.svg, source_js1, source_js2, source_js3
>
>
> We have a flink job that runs fine for a few days but suddenly hangs and could never recover. Once we relanuch the job, the job runs fine. We detected the job has backpressure, but in all other cases, backpressure would only lead to slower consumption but what is wired here is the job made no progress at all. The symptoms looks similar with FLINK-20618
>  
> About the job:
> 1. Reads from Kafka and writes to Kafka
> 2. version 1.11
> 3. enabled unaligned checkpoint
>  
> symptoms:
>  # All source/sink throughput drop to 0
>  # All checkpoint fails immediately after triggering.
>  # backpressure shows "high" from source to two downstream operators. 
>  # Flamegraph shows all subtask threads are in waiting
>  # Source jstack shows the Source thread is BLOCKED, as belows.
> {code:java}
> Source: impression-reader -> impression-filter -> impression-data-conversion (1/60)
> Stack Trace is:
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000003a3e71330> (a java.util.concurrent.CompletableFuture$Signaller)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:293)
> at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:266)
> at org.apache.flink.runtime.io.network.partition.ResultPartition.getBufferBuilder(ResultPartition.java:213)
> at org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:294)
> at org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.requestNewBufferBuilder(ChannelSelectorRecordWriter.java:103)
> at org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.getBufferBuilder(ChannelSelectorRecordWriter.java:95)
> at org.apache.flink.runtime.io.network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:135)
> at org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.broadcastEmit(ChannelSelectorRecordWriter.java:80)
> at org.apache.flink.streaming.runtime.io.RecordWriterOutput.emitWatermark(RecordWriterOutput.java:121)
> at org.apache.flink.streaming.api.operators.CountingOutput.emitWatermark(CountingOutput.java:41)
> at org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:570)
> at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitWatermark(OperatorChain.java:638)
> at org.apache.flink.streaming.api.operators.CountingOutput.emitWatermark(CountingOutput.java:41)
> at org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:570)
> at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitWatermark(OperatorChain.java:638)
> at org.apache.flink.streaming.api.operators.CountingOutput.emitWatermark(CountingOutput.java:41)
> at org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndEmitWatermark(StreamSourceContexts.java:315)
> at org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.emitWatermark(StreamSourceContexts.java:425)
> - locked <0x00000006a485dab0> (a java.lang.Object)
> at org.apache.flink.streaming.connectors.kafka.internals.SourceContextWatermarkOutputAdapter.emitWatermark(SourceContextWatermarkOutputAdapter.java:37)
> at org.apache.flink.api.common.eventtime.WatermarkOutputMultiplexer.updateCombinedWatermark(WatermarkOutputMultiplexer.java:167)
> at org.apache.flink.api.common.eventtime.WatermarkOutputMultiplexer.onPeriodicEmit(WatermarkOutputMultiplexer.java:136)
> at org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher$PeriodicWatermarkEmitter.onProcessingTime(AbstractFetcher.java:574)
> - locked <0x00000006a485dab0> (a java.lang.Object)
> at org.apache.flink.streaming.runtime.tasks.StreamTask.invokeProcessingTimeCallback(StreamTask.java:1220)
> at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$null$16(StreamTask.java:1211)
> at org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$590/1066788035.run(Unknown Source)
> at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:92)
> - locked <0x00000006a485dab0> (a java.lang.Object)
> at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
> at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:282)
> at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxStep(MailboxProcessor.java:190)
> at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:181)
> at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:566)
> at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:536)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
> at java.lang.Thread.run(Thread.java:748){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)