You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/05/10 20:44:00 UTC

[jira] [Work logged] (BEAM-12144) Dataflow streaming worker stuck and unable to get work from Streaming Engine

     [ https://issues.apache.org/jira/browse/BEAM-12144?focusedWorklogId=594232&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-594232 ]

ASF GitHub Bot logged work on BEAM-12144:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 10/May/21 20:43
            Start Date: 10/May/21 20:43
    Worklog Time Spent: 10m 
      Work Description: scwhittle commented on pull request #14714:
URL: https://github.com/apache/beam/pull/14714#issuecomment-837309662


   Repushing to fix jira in description


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 594232)
    Remaining Estimate: 0h
            Time Spent: 10m

> Dataflow streaming worker stuck and unable to get work from Streaming Engine
> ----------------------------------------------------------------------------
>
>                 Key: BEAM-12144
>                 URL: https://issues.apache.org/jira/browse/BEAM-12144
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-dataflow
>    Affects Versions: 2.26.0
>            Reporter: Sam Whittle
>            Assignee: Sam Whittle
>            Priority: P2
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Observed in 2.26 but seems like it could affect later versions as well, as previous issues addressing similar problems were before 2.26.  This seems similar to BEAM-9651 but not the deadlock observed there.
> The thread getting work has the following stack:
> --- Threads (1): [Thread[DispatchThread,1,main]] State: WAITING stack: ---
>   java.base@11.0.9/jdk.internal.misc.Unsafe.park(Native Method)
>   java.base@11.0.9/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
>   java.base@11.0.9/java.util.concurrent.Phaser$QNode.block(Phaser.java:1127)
>   java.base@11.0.9/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128)
>   java.base@11.0.9/java.util.concurrent.Phaser.internalAwaitAdvance(Phaser.java:1057)
>   java.base@11.0.9/java.util.concurrent.Phaser.awaitAdvanceInterruptibly(Phaser.java:747)
>   app//org.apache.beam.runners.dataflow.worker.windmill.DirectStreamObserver.onNext(DirectStreamObserver.java:49)
>   app//org.apache.beam.runners.dataflow.worker.windmill.GrpcWindmillServer$AbstractWindmillStream.send(GrpcWindmillServer.java:662)
>   app//org.apache.beam.runners.dataflow.worker.windmill.GrpcWindmillServer$GrpcGetWorkStream.onNewStream(GrpcWindmillServer.java:868)
>   app//org.apache.beam.runners.dataflow.worker.windmill.GrpcWindmillServer$AbstractWindmillStream.startStream(GrpcWindmillServer.java:677)
>   app//org.apache.beam.runners.dataflow.worker.windmill.GrpcWindmillServer$GrpcGetWorkStream.(GrpcWindmillServer.java:860)
>   app//org.apache.beam.runners.dataflow.worker.windmill.GrpcWindmillServer$GrpcGetWorkStream.(GrpcWindmillServer.java:843)
>   app//org.apache.beam.runners.dataflow.worker.windmill.GrpcWindmillServer.getWorkStream(GrpcWindmillServer.java:543)
>   app//org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.streamingDispatchLoop(StreamingDataflowWorker.java:1047)
>   app//org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$1.run(StreamingDataflowWorker.java:670)
>   java.base@11.0.9/java.lang.Thread.run(Thread.java:834)
> The status page shows:
> GetWorkStream: 0 buffers, 400 inflight messages allowed, 67108864 inflight bytes allowed, current stream is 61355396ms old, last send 61355396ms, last response -1ms
> Showing that the stream was created 17 hours ago, sent the header message but never received a response.  With the stack trace it appears that the header was never sent but the stream also didn't terminate with a deadline exceed.  This seems like a grpc issue to not get an error for the stream, however it would be safer to not block indefinitely on the Phaser waiting for the send and instead throw an exception after 2x the stream deadline for example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)