You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/09/01 10:07:09 UTC

[GitHub] [beam] cozos opened a new issue, #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

cozos opened a new issue, #22986:
URL: https://github.com/apache/beam/issues/22986

   ### What happened?
   
   The Bigquery Deadletter pattern: https://beam.apache.org/documentation/patterns/bigqueryio/
   
   Does not work on `WriteToBigquery` if `FILE_LOADS` strategy is used. The `insert_retry_strategy` parameter is not passed to [BigQueryBatchFileLoads](https://github.com/apache/beam/blob/b8ca0819529e0bafaae0c08abec7c4e5682d6b50/sdks/python/apache_beam/io/gcp/bigquery.py#L2363-L2380) and the output does not have a `'FailedRows'` tag.
   
   The deadletter pattern is also important in batch systems - to prevent a single malformed  row from failing the entire job. 
   
   ### Issue Priority
   
   Priority: 2
   
   ### Issue Component
   
   Component: io-py-gcp


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1235500077

   What do you think about doing something like:
   
   * Add `max_failed_rows` to `WriteToBigQuery`
   * Catch bad row exceptions when writing to temp Avro file
   * If JSON, set max_failed_rows to LoadJob config
   * Put bad rows in a `FailedRows` output tag
   
   If this is too much, are there any workarounds for users?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] brucearctor commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

Posted by GitBox <gi...@apache.org>.
brucearctor commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1234914908

   See this old thread, in the event finding/keeping those bad records is important:  https://stackoverflow.com/questions/31904142/how-do-we-set-maximum-bad-records-when-loading-a-bigquery-table-from-dataflow
   
   Alternately ( since I think you prefer python ), see:  https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.LoadJobConfig.html#google.cloud.bigquery.job.LoadJobConfig.max_bad_records ( just set that number to be larger than the number of rows you are OK being 'bad' and the job will insert all the good rows and succeed ).  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1336681555

   To summarize what is happening:
   
   - I am calling `WriteToBigQuery` in a batch job (therefore `FILE_LOADS` strategy)
   - I am using Avro as the temp file format for performance reasons. Avro is the preferred format BQ Load Jobs as per the [ GCP docs](https://cloud.google.com/bigquery/docs/batch-loading-data)
   - I am also setting `insert_retry_strategy='RETRY_ON_TRANSIENT_ERROR'`
   
   Expected behavior:
   When writing an element to BigQuery, failed rows such as from schema mismatch should appear as `FailedRows` in the result tuple, which can then be used in a DLQ pattern such as [here](https://beam.apache.org/documentation/patterns/bigqueryio/)
   
   Actual behavior:
   Specifically for schema mismatch errors the job will simply fail on `fastavro.write` when writing records with bad schema.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] johnjcasey commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

Posted by GitBox <gi...@apache.org>.
johnjcasey commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1337634978

   This is also a general issue in BQIO for python where different strategies have the same entry point, but have different processing and result types. (And different configuration parameters)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] brucearctor commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

Posted by GitBox <gi...@apache.org>.
brucearctor commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1235721147

   and @cozos -- put the link in here for reference :-)  and so I know where to find it and comment there.  Thanks!  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] brucearctor commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

Posted by GitBox <gi...@apache.org>.
brucearctor commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1236054212

   There are currently 2 separate implementations of bigqueryIO one in the java sdk, and one in the python sdk.  The python implementation existed before portability framework was finished.  That doesn't totally answery our question.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1236053383

   So am I understanding this right?
   
   - The current `WriteToBigQuery` is a Python SDK specific transform
   - In order to use the portability framework so that transforms can be shared across SDKs, we should implement a cross-language wrapper for the Java `WriteToBigQuery` transform.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1235486942

   It's failing here in the `fastavro.write` https://github.com/apache/beam/blob/b8ca0819529e0bafaae0c08abec7c4e5682d6b50/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L258
   
   As for the answer for that TODO:
   
   ```
   # TODO(pabloem): Is it possible for this to throw exception?
   writer.write(row)
   ```
   
   Indeed it is possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] kennknowles commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

Posted by GitBox <gi...@apache.org>.
kennknowles commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1335959048

   Pinging @chamikaramj and @johnjcasey for questions around xlang and dead letter 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1235481552

   Hi @brucearctor thanks for the suggestion. Upon further investigation, I've found out that for me (who uses `FileFormat.AVRO`), `WriteToBigQuery/BigQueryBatchFileLoads` transform fails when writing the temporary file (typically due to schema mismatch).
   
   So, like you mentioned, the `max_bad_records` is a workaround for `FileFormat.JSON`, but unfortunately doesn't even get to the `LoadJob` step in Avro format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1235505805

   (I'm willing to work on this change if you agree with the direction. Not sure if this change needs to be applied for all SDKs though - what about the portability framework thing?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1235858888

   Sure will do


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] brucearctor commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

Posted by GitBox <gi...@apache.org>.
brucearctor commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1234909480

   > The deadletter pattern is also important in batch systems - to prevent a single malformed row from failing the entire job.
   
   Had you seen:
   https://cloud.google.com/java/docs/reference/google-cloud-bigquery/latest/com.google.cloud.bigquery.LoadJobConfiguration.Builder#com_google_cloud_bigquery_LoadJobConfiguration_Builder_setMaxBadRecords_java_lang_Integer_
   or ( since I think you prefer python ) -->
   https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.LoadJobConfig.html#google.cloud.bigquery.job.LoadJobConfig.max_bad_records
   
   @cozos -- with a higher number in that config number ( more than the number of bad records ) the job will succeed.  It sounds like that is what you need based on the quote.  
   
   As it relates to using FILE_LOADS: I *think* the BQ API that FILE_LOADS calls would not support that functionality.  
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] brucearctor commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

Posted by GitBox <gi...@apache.org>.
brucearctor commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1235671298

   Can you put this in a StackOverflow question ( summary of at least initial question, and I can reply there ).  This feels like there are ways we can debug, and come up with workarounds rather than needing to change the codebase [ maybe we'll arrive at that, but let's see ].  That's a great source, as it is well indexed, and others are likely to find if searching for similar keywords.  
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] brucearctor commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

Posted by GitBox <gi...@apache.org>.
brucearctor commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1236049048

   There might be room to extend the python beam bigquery io.  But, let's see/experiment if able to solve without overly needing to change the codebase.
   
   We'd also want to think about whether we would want to continue to extend the python bigquery io, or write a cross-language wrapper for the bigquery java io.  If we need additional functionality, we can work that out with the community to see desired approach.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1236047485

   @brucearctor here is a Stackoverflow question: https://stackoverflow.com/questions/73589662/in-apache-beam-dataflows-writetobigquery-transform-how-do-you-enable-the-deadl


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] cozos commented on issue #22986: [Bug]: WriteToBigquery Deadletter pattern does not work with FILE_LOADS method

Posted by GitBox <gi...@apache.org>.
cozos commented on issue #22986:
URL: https://github.com/apache/beam/issues/22986#issuecomment-1235493876

   I've tried to prevalidate my pcoll with [fastavro.validate](https://github.com/fastavro/fastavro/blob/073da50afaec729f66caea357538417118ae7619/fastavro/_validation.pyx#L301) but it seems to be very expensive - 15X CPU hours for my tests. I think this is because `fastavro.writer` uses efficient static typing to validate the schema but `fastavro.validate` does explicit type checking (i.e. isinstance)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org