You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Chamikara Madhusanka Jayalath (Jira)" <ji...@apache.org> on 2022/04/20 00:47:00 UTC

[jira] [Commented] (BEAM-14146) Python Streaming job failing to drain with BigQueryIO write errors

    [ https://issues.apache.org/jira/browse/BEAM-14146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524645#comment-17524645 ] 

Chamikara Madhusanka Jayalath commented on BEAM-14146:
------------------------------------------------------

I'm not sure of "schemaUpdateOptions" can be correctly supported for streaming pipelines where we have to repeatedly trigger multiple load jobs.

Assigning to Pablo to look into this and either.
(1) Make sure it's supported correctly and add appropriate tests
(2) Document that this is not supported for streaming and fail pipelines early

> Python Streaming job failing to drain with BigQueryIO write errors
> ------------------------------------------------------------------
>
>                 Key: BEAM-14146
>                 URL: https://issues.apache.org/jira/browse/BEAM-14146
>             Project: Beam
>          Issue Type: Bug
>          Components: io-py-gcp, sdk-py-core
>    Affects Versions: 2.37.0
>            Reporter: Rahul Iyer
>            Priority: P2
>
> We have a Python Streaming Dataflow job that writes to BigQuery using the {{FILE_LOADS}} method and {{auto_sharding}} enabled. When we try to drain the job it fails with the following error,
> {code:python}
> "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 1000, in perform_load_job ValueError: Either a non-empty list of fully-qualified source URIs must be provided via the source_uris parameter or an open file object must be provided via the source_stream parameter.
> {code}
> Our {{WriteToBigQuery}} configuration,
> {code:python}
> beam.io.WriteToBigQuery(
>   table=options.output_table,
>   schema=bq_schema,
>   create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
>   write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
>   insert_retry_strategy=RetryStrategy.RETRY_ON_TRANSIENT_ERROR,
>   method=beam.io.WriteToBigQuery.Method.FILE_LOADS,
>   additional_bq_parameters={
>     "timePartitioning": {
>       "type": "HOUR",
>       "field": "bq_insert_timestamp",
>     },
>     "schemaUpdateOptions": ["ALLOW_FIELD_ADDITION", "ALLOW_FIELD_RELAXATION"],
>   },
>   triggering_frequency=120,
>   with_auto_sharding=True,
> )
> {code}
> We are also noticing that the job only fails to drain when there are actual schema updates. If there are no schema updates the job drains without the above error.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)