You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/11/10 15:10:53 UTC

[GitHub] [beam] carlpayne opened a new issue, #24090: [Feature Request]: BigQueryIO should enable access to input object when insertion fails

carlpayne opened a new issue, #24090:
URL: https://github.com/apache/beam/issues/24090

   ### What would you like to happen?
   
   Currently, when BigQueryIO fails to write to BigQuery, we get back a `PCollection<BigQueryInsertError>`via `getFailedInsertsWithErr` (or a `PCollection<BigQueryStorageApiInsertError>` if using `getFailedStorageApiInserts`), which provides us the `TableRow` for each failure. 
   
   However, it would also be very useful to have access to the original input data and not just the transformed `TableRow`. In our use case, we stream Avro data from Kafka to BigQuery, so the input for `BigQueryIO.Write` is a `KafkaRecord<String, byte[]>`, which we transform into a `TableRow` via `withFormatFunction`. What we'd like to be able to do is write each failed insert back to Kafka (into a DLQ topic) so that we can reprocess it later on, however the only way we can currently achieve this is to convert the `TableRow` back into a `KafkaRecord`, which runs the risk of losing/transforming the original data during the conversion process.
   
   One possible workaround we've explored is joining the input `PCollection` containing the Kafka data with the failed inserts via some shared ID, so that we can get back the original messages. The main issue with this is that errors can sometimes take many hours to be visible via `getFailedStorageApiInserts` (due to https://github.com/apache/beam/issues/23291), so we would need to buffer many millions of records to cover this time window, which isn't feasible in our case.
   
   ### Issue Priority
   
   Priority: 3
   
   ### Issue Component
   
   Component: io-java-gcp


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] carlpayne commented on issue #24090: [Feature Request]: BigQueryIO should enable access to input object when insertion fails

Posted by GitBox <gi...@apache.org>.
carlpayne commented on issue #24090:
URL: https://github.com/apache/beam/issues/24090#issuecomment-1322222261

   @reuvenlax The main examples we've encountered are schema-mismatches (e.g. missing a required field, as per https://github.com/apache/beam/issues/23291). In this case, we need to manually (or automatically, if https://github.com/apache/beam/issues/24063 becomes possible) update the table schema and retry. 
   
   While it would be possible to do this via DLQ in a separate BigQuery table, this would be inconsistent with many of our other applications and re-stream processes where DLQ is always via Kafka (we use other stream-processing tools such as Flink where Kafka is the DLQ). We also prefer to keep the "raw" data for replay purposes, rather than the converted TableRow, just to rule out an issue with the raw-to-TableRow conversion process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] reuvenlax commented on issue #24090: [Feature Request]: BigQueryIO should enable access to input object when insertion fails

Posted by GitBox <gi...@apache.org>.
reuvenlax commented on issue #24090:
URL: https://github.com/apache/beam/issues/24090#issuecomment-1322424853

   Makes sense. Unfortunately I don't see a great way of doing this without doing a full join, as Beam no longer has a handle on the original message when it tries to insert it into BigQuery. Errors taking hours to be visible doesn't make sense to me, unless there is a very strange bug in the connector (I left a comment on the other issue).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] reuvenlax commented on issue #24090: [Feature Request]: BigQueryIO should enable access to input object when insertion fails

Posted by GitBox <gi...@apache.org>.
reuvenlax commented on issue #24090:
URL: https://github.com/apache/beam/issues/24090#issuecomment-1320972847

   Unfortunately at least for storage-api writes, we don't have a good way of doing this. At the point where BigQuery returns a failure, the sink no longer has a handle on the original PCollection, as it has already been converted to proto format and shuffled. 
   
   Can you explain why you want to reinject it into Kafka? getFailedStorageApiInserts should only return non-retryable errors (i.e. passing in the wrong type for a field), so any attempt to retry it will fail indefinitely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] kennknowles commented on issue #24090: [Feature Request]: BigQueryIO should enable access to input object when insertion fails

Posted by GitBox <gi...@apache.org>.
kennknowles commented on issue #24090:
URL: https://github.com/apache/beam/issues/24090#issuecomment-1314301797

   This seems quite useful. CC @johnjcasey @reuvenlax 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org