You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Daniel Halperin (JIRA)" <ji...@apache.org> on 2016/03/01 20:31:18 UTC
[jira] [Commented] (BEAM-50) BigQueryIO.Write: reimplement in Java

    [ https://issues.apache.org/jira/browse/BEAM-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174264#comment-15174264 ] 

Daniel Halperin commented on BEAM-50:
-------------------------------------

This BEAM issue subsumes https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/91

> BigQueryIO.Write: reimplement in Java
> -------------------------------------
>
>                 Key: BEAM-50
>                 URL: https://issues.apache.org/jira/browse/BEAM-50
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-gcp
>            Reporter: Daniel Halperin
>            Priority: Minor
>
> BigQueryIO.Write is currently implemented in a somewhat hacky way.
> Unbounded sink:
> * The DirectPipelineRunner and the DataflowPipelineRunner use StreamingWriteFn and BigQueryTableInserter to insert rows using BigQuery's streaming writes API.
> Bounded sink:
> * The DirectPipelineRunner still uses streaming writes.
> * The DataflowPipelineRunner uses a different code path in the Google Cloud Dataflow service that writes to GCS and the initiates a BigQuery load job.
> * Per-window table destinations do not work scalably. (See Beam-XXX).
> We need to reimplement BigQueryIO.Write fully in Java code in order to support other runners in a scalable way.
> I additionally suggest that we revisit the design of the BigQueryIO sink in the process. A short list:
> * Do not use TableRow as the default value for rows. It could be Map<String, Object> with well-defined types, for example, or an Avro GenericRecord. Dropping TableRow will get around a variety of issues with types, fields named 'f', etc., and it will also reduce confusion as we use TableRow objects differently than usual (for good reason).
> * Possibly support not-knowing the schema until pipeline execution time.
> * Our builders for BigQueryIO.Write are useful and we should keep them. Where possible we should also allow users to provide the JSON objects that configure the underlying table creation, write disposition, etc. This would let users directly control things like table expiration time, table location, etc., Would also optimistically let users take advantage of some new BigQuery features without code changes.
> * We could choose between streaming write API and load jobs based on user preference or dynamic job properties . We could use streaming write in a batch pipeline if the data is small. We could use load jobs in streaming pipelines if the windows are large enough to make this practical.
> * When issuing BigQuery load jobs, we could leave files in GCS if the import fails, so that data errors can be debugged.
> * We should make per-window table writes scalable in batch.
> Caveat, possibly blocker:
> * (Beam-XXX): cleanup and temp file management. One advantage of the Google Cloud Dataflow implementation of BigQueryIO.Write is cleanup: we ensure that intermediate files are deleted when bundles or jobs fail, etc. Beam does not currently support this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)