You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Chamikara Jayalath (JIRA)" <ji...@apache.org> on 2018/08/08 01:25:00 UTC

[jira] [Commented] (BEAM-5105) Move load job poll to finishBundle() method to better parallelize execution

    [ https://issues.apache.org/jira/browse/BEAM-5105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572535#comment-16572535 ] 

Chamikara Jayalath commented on BEAM-5105:
------------------------------------------

Reuven, I might be missing drawbacks of this approach. Could you comment ?

> Move load job poll to finishBundle() method to better parallelize execution
> ---------------------------------------------------------------------------
>
>                 Key: BEAM-5105
>                 URL: https://issues.apache.org/jira/browse/BEAM-5105
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-gcp
>            Reporter: Chamikara Jayalath
>            Priority: Major
>
> It appears that when we write to BigQuery using WriteTablesDoFn we start a load job and wait for that job to finish.
> [https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L318]
>  
> In cases where we are trying to write a PCollection of tables (for example, when user use dynamic destinations feature) this relies on dynamic work rebalancing to parallellize execution of load jobs. If the runner does not support dynamic work rebalancing or does not execute dynamic work rebalancing from some reason this could have significant performance drawbacks. For example, scheduling times for load jobs will add up.
>  
> A better approach might be to start load jobs at process() method but wait for all load jobs to finish at finishBundle() method. This will parallelize any overheads as well as job execution (assuming more than one job is schedule by BQ.).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)