You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "lostluck (via GitHub)" <gi...@apache.org> on 2023/07/14 22:41:03 UTC

[GitHub] [beam] lostluck opened a new issue, #27513: [Bug][Go SDK]: Native bigqueryio package doesn't have scalable writes.

lostluck opened a new issue, #27513:
URL: https://github.com/apache/beam/issues/27513

   ### What happened?
   
   The Go SDK native BigQueryIO package (https://github.com/apache/beam/blob/sdks/v2.48.2/sdks/go/pkg/beam/io/bigqueryio/bigquery.go#L236) doesn't scale writes.
   
   It adds a fixed key then groups, serializing all writes to BigQuery to a single worker. This prevents writing larger datasets to BigQuery.
   
   An acceptable fix would be to add a `Load` method that writes all data to files a temporary directory in GCS, in a format that BigQuery can then be tasked to injest as a load job.
   
   ---
   
   The Java SDK (and I believe python) uses complex logic to choose between streaming batch RPCs for the data load vs writing to files, but in practice, it should be obvious to pipeline authors which case their jobs require for data loads.
   
   Other issues that should be taken care of as well:
   * Not vetted for streaming writes, though it likely has better support in this case due to smaller datasizes during streaming windows.
   * Doesn't retry failed retryable RPCs. (though it should be checked if the client already does this under the hood before adding additional scaffolding).
   * No tests.
   
   ### Issue Priority
   
   Priority: 3 (minor)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [ ] Component: Java SDK
   - [X] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org