You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Scott Wegner (JIRA)" <ji...@apache.org> on 2019/01/16 16:55:00 UTC

[jira] [Commented] (BEAM-6451) Portability Pipeline eventually hangs on bundle registration

    [ https://issues.apache.org/jira/browse/BEAM-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16744247#comment-16744247 ] 

Scott Wegner commented on BEAM-6451:
------------------------------------

It seems that the worker go in a bad state for one reason or another and the bundle registration was never coming back. But we don't have a timeout on the call and the worker never got marked as unhealthy.

It seems that this call and many others use the [MoreFutures.get()|https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/MoreFutures.java#L55] helper to wait on the result of a CompletionStage. This overload doesn't specify a timeout and thus waits indefinitely. There is a version specifying a timeout but it seems rarely used.

[~kenn] I see your name in the history for [MoreFutures|https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/MoreFutures.java]. Do you have any thoughts on whether it would make sense to add timeouts for these operations? Bundle registration seems like an operation that we can expect to return quickly; maybe start with 5 minute timeout?

/cc [~altay]

> Portability Pipeline eventually hangs on bundle registration
> ------------------------------------------------------------
>
>                 Key: BEAM-6451
>                 URL: https://issues.apache.org/jira/browse/BEAM-6451
>             Project: Beam
>          Issue Type: Bug
>          Components: java-fn-execution, runner-dataflow, sdk-py-harness
>            Reporter: Scott Wegner
>            Assignee: Tyler Akidau
>            Priority: Minor
>              Labels: portability
>
> We've seen jobs using portability start off in a healthy state, but then eventually get stuck and hang on bundle registration. We see error logs from the worker harness:
> {code}
> Processing stuck in step s01 for at least 06h30m00s without outputting or completing in state finish at 
> sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at 
> org.apache.beam.sdk.util.MoreFutures.get(MoreFutures.java:57) at 
> org.apache.beam.runners.dataflow.worker.fn.control.RegisterAndProcessBundleOperation.finish(RegisterAndProcessBundleOperation.java:277) at 
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:85) at 
> org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:119) at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1226) at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:141) at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:965) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at 
> java.lang.Thread.run(Thread.java:745)
> {code}
> Looking at [the code|https://github.com/apache/beam/blob/release-2.8.0/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/RegisterAndProcessBundleOperation.java#L277], it looks like there are no timeouts on the Bundle Registration calls over the FnApi, which contributes to this hanging forever rather than giving a better failure.
> This bug report came from a customer running a python streaming pipeline using the new portability framework on Dataflow. Hopefully we can repro on our own in order to link to the job / logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)