You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Pablo Estrada (Jira)" <ji...@apache.org> on 2019/10/30 16:02:00 UTC

[jira] [Assigned] (BEAM-8012) Perf improvements for Python WriteToBigQuery with Streaming Inserts

     [ https://issues.apache.org/jira/browse/BEAM-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pablo Estrada reassigned BEAM-8012:
-----------------------------------

    Assignee:     (was: Tanay Tummalapalli)

> Perf improvements for Python WriteToBigQuery with Streaming Inserts
> -------------------------------------------------------------------
>
>                 Key: BEAM-8012
>                 URL: https://issues.apache.org/jira/browse/BEAM-8012
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-py-gcp
>            Reporter: Pablo Estrada
>            Priority: Major
>
> Users have reported that for a pipeline that is able to process 400 msg/sec/cpu drops to 75 msg/sec/cpu when adding the WriteToBigQuery sink from the Python SDK.
> Some candidates to be optimized:
>  * [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L776-L805] - The GetTable method gets called, sometimes veeery often.
>  * [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L1017-L1019] - The RowAsDictJsonCoder does special treatment of bytes, and for that it iterates through the whole record first.
>  * [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L823-L840] - The batching strategy for the Writing DoFn may be improved?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)