You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Oskar Firlej (Jira)" <ji...@apache.org> on 2022/04/29 15:20:00 UTC

[jira] [Created] (BEAM-14383) Improve "FailedRows" errors returned by beam.io.WriteToBigQuery

Oskar Firlej created BEAM-14383:
-----------------------------------

             Summary: Improve "FailedRows" errors returned by beam.io.WriteToBigQuery
                 Key: BEAM-14383
                 URL: https://issues.apache.org/jira/browse/BEAM-14383
             Project: Beam
          Issue Type: Improvement
          Components: io-py-gcp
            Reporter: Oskar Firlej


`WriteToBigQuery` pipeline returns `errors` when trying to insert rows that do not match the BigQuery table schema. `errors` is a dictionary that cointains one `FailedRows` key. `FailedRows` is a list of tuples where each tuple has two elements: BigQuery table name and the row that didn't match the schema.

This can be verified by running the `BigQueryIO deadletter pattern` https://beam.apache.org/documentation/patterns/bigqueryio/

Using this approach I can print the failed rows in a pipeline. When running the job, logger simultaneously prints out the reason why the rows were invalid. The reason should also be included in the tuple in addition to the BigQuery table and the raw row. This way next pipeline could process both the invalid row and the reason why it is invalid.

During my reasearch i found a couple of alternate solutions, but i think they are more complex than they need to be. Thats why i explored the beam source code and found the solution to be an easy and simple change.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)