You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/09/01 10:07:09 UTC
[GitHub] [beam] cozos opened a new issue, #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
cozos opened a new issue, #22986:
URL: https://github.com/apache/beam/issues/22986
### What happened?
The Bigquery Deadletter pattern: https://beam.apache.org/documentation/patterns/bigqueryio/
Does not work on `WriteToBigquery` if `FILE_LOADS` strategy is used. The `insert_retry_strategy` parameter is not passed to [BigQueryBatchFileLoads](https://github.com/apache/beam/blob/b8ca0819529e0bafaae0c08abec7c4e5682d6b50/sdks/python/apache_beam/io/gcp/bigquery.py#L2363-L2380) and the output does not have a `'FailedRows'` tag.
The deadletter pattern is also important in batch systems - to prevent a single malformed row from failing the entire job.
### Issue Priority
Priority: 2
### Issue Component
Component: io-py-gcp
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1235500077
What do you think about doing something like:
* Add `max_failed_rows` to `WriteToBigQuery`
* Catch bad row exceptions when writing to temp Avro file
* If JSON, set max_failed_rows to LoadJob config
* Put bad rows in a `FailedRows` output tag
If this is too much, are there any workarounds for users?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] brucearctor commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
Posted by GitBox <gi...@apache.org>.
brucearctor commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1234914908
See this old thread, in the event finding/keeping those bad records is important: https://stackoverflow.com/questions/31904142/how-do-we-set-maximum-bad-records-when-loading-a-bigquery-table-from-dataflow
Alternately ( since I think you prefer python ), see: https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.LoadJobConfig.html#google.cloud.bigquery.job.LoadJobConfig.max_bad_records ( just set that number to be larger than the number of rows you are OK being 'bad' and the job will insert all the good rows and succeed ).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1336681555
To summarize what is happening:
- I am calling `WriteToBigQuery` in a batch job (therefore `FILE_LOADS` strategy)
- I am using Avro as the temp file format for performance reasons. Avro is the preferred format BQ Load Jobs as per the [ GCP docs](https://cloud.google.com/bigquery/docs/batch-loading-data)
- I am also setting `insert_retry_strategy='RETRY_ON_TRANSIENT_ERROR'`
Expected behavior:
When writing an element to BigQuery, failed rows such as from schema mismatch should appear as `FailedRows` in the result tuple, which can then be used in a DLQ pattern such as [here](https://beam.apache.org/documentation/patterns/bigqueryio/)
Actual behavior:
Specifically for schema mismatch errors the job will simply fail on `fastavro.write` when writing records with bad schema.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] johnjcasey commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
Posted by GitBox <gi...@apache.org>.
johnjcasey commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1337634978
This is also a general issue in BQIO for python where different strategies have the same entry point, but have different processing and result types. (And different configuration parameters)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] brucearctor commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
Posted by GitBox <gi...@apache.org>.
brucearctor commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1235721147
and @cozos -- put the link in here for reference :-) and so I know where to find it and comment there. Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] brucearctor commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
Posted by GitBox <gi...@apache.org>.
brucearctor commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1236054212
There are currently 2 separate implementations of bigqueryIO one in the java sdk, and one in the python sdk. The python implementation existed before portability framework was finished. That doesn't totally answery our question.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1236053383
So am I understanding this right?
- The current `WriteToBigQuery` is a Python SDK specific transform
- In order to use the portability framework so that transforms can be shared across SDKs, we should implement a cross-language wrapper for the Java `WriteToBigQuery` transform.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1235486942
It's failing here in the `fastavro.write` https://github.com/apache/beam/blob/b8ca0819529e0bafaae0c08abec7c4e5682d6b50/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L258
As for the answer for that TODO:
```
# TODO(pabloem): Is it possible for this to throw exception?
writer.write(row)
```
Indeed it is possible.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] kennknowles commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
Posted by GitBox <gi...@apache.org>.
kennknowles commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1335959048
Pinging @chamikaramj and @johnjcasey for questions around xlang and dead letter
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1235481552
Hi @brucearctor thanks for the suggestion. Upon further investigation, I've found out that for me (who uses `FileFormat.AVRO`), `WriteToBigQuery/BigQueryBatchFileLoads` transform fails when writing the temporary file (typically due to schema mismatch).
So, like you mentioned, the `max_bad_records` is a workaround for `FileFormat.JSON`, but unfortunately doesn't even get to the `LoadJob` step in Avro format.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1235505805
(I'm willing to work on this change if you agree with the direction. Not sure if this change needs to be applied for all SDKs though - what about the portability framework thing?)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1235858888
Sure will do
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] brucearctor commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
Posted by GitBox <gi...@apache.org>.
brucearctor commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1234909480
> The deadletter pattern is also important in batch systems - to prevent a single malformed row from failing the entire job.
Had you seen:
https://cloud.google.com/java/docs/reference/google-cloud-bigquery/latest/com.google.cloud.bigquery.LoadJobConfiguration.Builder#com_google_cloud_bigquery_LoadJobConfiguration_Builder_setMaxBadRecords_java_lang_Integer_
or ( since I think you prefer python ) -->
https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.LoadJobConfig.html#google.cloud.bigquery.job.LoadJobConfig.max_bad_records
@cozos -- with a higher number in that config number ( more than the number of bad records ) the job will succeed. It sounds like that is what you need based on the quote.
As it relates to using FILE_LOADS: I *think* the BQ API that FILE_LOADS calls would not support that functionality.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] brucearctor commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
Posted by GitBox <gi...@apache.org>.
brucearctor commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1235671298
Can you put this in a StackOverflow question ( summary of at least initial question, and I can reply there ). This feels like there are ways we can debug, and come up with workarounds rather than needing to change the codebase [ maybe we'll arrive at that, but let's see ]. That's a great source, as it is well indexed, and others are likely to find if searching for similar keywords.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] brucearctor commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
Posted by GitBox <gi...@apache.org>.
brucearctor commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1236049048
There might be room to extend the python beam bigquery io. But, let's see/experiment if able to solve without overly needing to change the codebase.
We'd also want to think about whether we would want to continue to extend the python bigquery io, or write a cross-language wrapper for the bigquery java io. If we need additional functionality, we can work that out with the community to see desired approach.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1236047485
@brucearctor here is a Stackoverflow question: https://stackoverflow.com/questions/73589662/in-apache-beam-dataflows-writetobigquery-transform-how-do-you-enable-the-deadl
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method
Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1235493876
I've tried to prevalidate my pcoll with [fastavro.validate](https://github.com/fastavro/fastavro/blob/073da50afaec729f66caea357538417118ae7619/fastavro/_validation.pyx#L301) but it seems to be very expensive - 15X CPU hours for my tests. I think this is because `fastavro.writer` uses efficient static typing to validate the schema but `fastavro.validate` does explicit type checking (i.e. isinstance)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org