You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/10/01 13:47:00 UTC

[jira] [Work logged] (BEAM-5105) Move load job poll to finishBundle() method to better parallelize execution

     [ https://issues.apache.org/jira/browse/BEAM-5105?focusedWorklogId=150035&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-150035 ]

ASF GitHub Bot logged work on BEAM-5105:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 01/Oct/18 13:46
            Start Date: 01/Oct/18 13:46
    Worklog Time Spent: 10m 
      Work Description: chamikaramj commented on issue #6416: [BEAM-5105] Better parallelize BigQuery load jobs
URL: https://github.com/apache/beam/pull/6416#issuecomment-425913626
 
 
   LGTM.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 150035)
    Time Spent: 1h  (was: 50m)

> Move load job poll to finishBundle() method to better parallelize execution
> ---------------------------------------------------------------------------
>
>                 Key: BEAM-5105
>                 URL: https://issues.apache.org/jira/browse/BEAM-5105
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-gcp
>            Reporter: Chamikara Jayalath
>            Priority: Major
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> It appears that when we write to BigQuery using WriteTablesDoFn we start a load job and wait for that job to finish.
> [https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L318]
>  
> In cases where we are trying to write a PCollection of tables (for example, when user use dynamic destinations feature) this relies on dynamic work rebalancing to parallellize execution of load jobs. If the runner does not support dynamic work rebalancing or does not execute dynamic work rebalancing from some reason this could have significant performance drawbacks. For example, scheduling times for load jobs will add up.
>  
> A better approach might be to start load jobs at process() method but wait for all load jobs to finish at finishBundle() method. This will parallelize any overheads as well as job execution (assuming more than one job is schedule by BQ.).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)