You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Pablo Estrada (Jira)" <ji...@apache.org> on 2021/09/28 23:23:00 UTC
[jira] [Updated] (BEAM-8012) Perf improvements for Python
WriteToBigQuery with Streaming Inserts
[ https://issues.apache.org/jira/browse/BEAM-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pablo Estrada updated BEAM-8012:
--------------------------------
Resolution: Fixed
Status: Resolved (was: Open)
> Perf improvements for Python WriteToBigQuery with Streaming Inserts
> -------------------------------------------------------------------
>
> Key: BEAM-8012
> URL: https://issues.apache.org/jira/browse/BEAM-8012
> Project: Beam
> Issue Type: Improvement
> Components: io-py-gcp
> Reporter: Pablo Estrada
> Priority: P3
>
> Users have reported that for a pipeline that is able to process 400 msg/sec/cpu drops to 75 msg/sec/cpu when adding the WriteToBigQuery sink from the Python SDK.
> Some candidates to be optimized:
> * [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L776-L805] - The GetTable method gets called, sometimes veeery often.
> * [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L1017-L1019] - The RowAsDictJsonCoder does special treatment of bytes, and for that it iterates through the whole record first.
> * [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L823-L840] - The batching strategy for the Writing DoFn may be improved?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)