You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@beam.apache.org by "Daniel Halperin (JIRA)" <ji...@apache.org> on 2016/06/15 06:30:09 UTC

[jira] [Resolved] (BEAM-48) BigQueryIO.Read reimplemented as BoundedSource

     [ https://issues.apache.org/jira/browse/BEAM-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Halperin resolved BEAM-48.
---------------------------------
    Resolution: Fixed

> BigQueryIO.Read reimplemented as BoundedSource
> ----------------------------------------------
>
>                 Key: BEAM-48
>                 URL: https://issues.apache.org/jira/browse/BEAM-48
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-gcp
>            Reporter: Daniel Halperin
>            Assignee: Pei He
>
> BigQueryIO.Read is currently implemented in a hacky way: the DirectPipelineRunner streams all rows in the table or query result directly using the JSON API, in a single-threaded manner.
> In contrast, the DataflowPipelineRunner uses an entirely different code path implemented in the Google Cloud Dataflow service. (A BigQuery export job to GCS, followed by a parallel read from GCS).
> We need to reimplement BigQueryIO as a BoundedSource in order to support other runners in a scalable way.
> I additionally suggest that we revisit the design of the BigQueryIO source in the process. A short list:
> * Do not use TableRow as the default value for rows. It could be Map<String, Object> with well-defined types, for example, or an Avro GenericRecord. Dropping TableRow will get around a variety of issues with types, fields named 'f', etc., and it will also reduce confusion as we use TableRow objects differently than usual (for good reason).
> * We could also directly add support for a RowParser to a user's POJO.
> * We should expose TableSchema as a side output from the BigQueryIO.Read.
> * Our builders for BigQueryIO.Read are useful and we should keep them. Where possible we should also allow users to provide the JSON objects that configure the underlying intermediate tables, query export, etc. This would let users directly control result flattening, location of intermediate tables, table decorators, etc., and also optimistically let users take advantage of some new BigQuery features without code changes.
> * We could use switch between whether we use a BigQuery export + parallel scan vs API read based on factors such as the size of the table at pipeline construction time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)