You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Eugene Kirpichov (JIRA)" <ji...@apache.org> on 2018/07/02 23:25:00 UTC

[jira] [Closed] (BEAM-3268) getPerDestinationOutputFilenames() is getting processed before write is finished on dataflow runner

     [ https://issues.apache.org/jira/browse/BEAM-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eugene Kirpichov closed BEAM-3268.
----------------------------------
       Resolution: Fixed
    Fix Version/s: 2.5.0

> getPerDestinationOutputFilenames() is getting processed before write is finished on dataflow runner
> ---------------------------------------------------------------------------------------------------
>
>                 Key: BEAM-3268
>                 URL: https://issues.apache.org/jira/browse/BEAM-3268
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-dataflow
>    Affects Versions: 2.3.0
>            Reporter: Kamil Szewczyk
>            Assignee: Eugene Kirpichov
>            Priority: Major
>             Fix For: 2.5.0
>
>         Attachments: comparison.png
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> While running filebased-io-test we found dataflow-runnner misbehaving. We run tests using single pipeline and without using Reshuffling between writing and reading dataflow jobs are unsuccessful because the runner tries to access the files that were not created yet. 
> On the picture the difference between execution of writting is presented. On the left there is working example with Reshuffling added and on the right without it.
> !comparison.png|thumbnail!
> Steps to reproduce: substitute your-bucket-name wit your valid bucket.
> {code:java}
> mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests -DintegrationTestPipelineOptions='["--runner=dataflow", "--filenamePrefix=gs://your-bucket-name/TEXTIO_IT"]' -Pdataflow-runner
> {code}
> Then look on the cloud console and job should fail.
> Now add Reshuffling to sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java as in the example.
> {code:java}
> .getPerDestinationOutputFilenames().apply(Values.<String>create())
>         .apply(Reshuffle.<String>viaRandomKey());
>     PCollection<String> consolidatedHashcode = testFilenames
> {code}
> and trigger previously used maven command to see it working in the console right now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)