You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Kenneth Knowles (JIRA)" <ji...@apache.org> on 2019/01/16 17:26:00 UTC

[jira] [Comment Edited] (BEAM-6451) Portability Pipeline eventually hangs on bundle registration

    [ https://issues.apache.org/jira/browse/BEAM-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16744271#comment-16744271 ] 

Kenneth Knowles edited comment on BEAM-6451 at 1/16/19 5:25 PM:
----------------------------------------------------------------

I'm just going to brain-dump and not be too prescriptive about this exact case (but very prescriptive in general :-). The recommended way to use futures is to _never_ call get(). The existence of this is just to provide the very final bridge to synchronous code. Instead, asynchronous chaining should be the primary way to consume the results of futures.

The Java futures API missed the mark - all the APIs you _should_ use are on CompletionStage and then the APIs that you sometimes _must_ use are on Future (you can get a CompletableFuture for any CompletionStage IIRC). So MoreFutures is an attempt to present the appropriate API surface. A timeout is probably not always appropriate, but an asynchronous chain should be guaranteed to terminate. The timeout should be at the point where a potentially failing network call is made.


was (Author: kenn):
I'm just going to brain-dump and not be too prescriptive. The recommended way to use futures is to _never_ call get(). The existence of this is just to provide the very final bridge to synchronous code. Instead, asynchronous chaining should be the primary way to consume the results of futures.

The Java futures API missed the mark - all the APIs you _should_ use are on CompletionStage and then the APIs that you sometimes _must_ use are on Future (you can get a CompletableFuture for any CompletionStage IIRC). So MoreFutures is an attempt to present the appropriate API surface. A timeout is probably not always appropriate, but an asynchronous chain should be guaranteed to terminate. The timeout should be at the point where a potentially failing network call is made.

> Portability Pipeline eventually hangs on bundle registration
> ------------------------------------------------------------
>
>                 Key: BEAM-6451
>                 URL: https://issues.apache.org/jira/browse/BEAM-6451
>             Project: Beam
>          Issue Type: Bug
>          Components: java-fn-execution, runner-dataflow, sdk-py-harness
>            Reporter: Scott Wegner
>            Priority: Minor
>              Labels: portability
>
> We've seen jobs using portability start off in a healthy state, but then eventually get stuck and hang on bundle registration. We see error logs from the worker harness:
> {code}
> Processing stuck in step s01 for at least 06h30m00s without outputting or completing in state finish at 
> sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at 
> org.apache.beam.sdk.util.MoreFutures.get(MoreFutures.java:57) at 
> org.apache.beam.runners.dataflow.worker.fn.control.RegisterAndProcessBundleOperation.finish(RegisterAndProcessBundleOperation.java:277) at 
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:85) at 
> org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:119) at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1226) at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:141) at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:965) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at 
> java.lang.Thread.run(Thread.java:745)
> {code}
> Looking at [the code|https://github.com/apache/beam/blob/release-2.8.0/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/RegisterAndProcessBundleOperation.java#L277], it looks like there are no timeouts on the Bundle Registration calls over the FnApi, which contributes to this hanging forever rather than giving a better failure.
> This bug report came from a customer running a python streaming pipeline using the new portability framework on Dataflow. Hopefully we can repro on our own in order to link to the job / logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)