You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 20:06:03 UTC

[GitHub] [beam] damccorm opened a new issue, #20891: Suspected data loss (and/or duplicates) bug in BigQueyrServicesImpl

damccorm opened a new issue, #20891:
URL: https://github.com/apache/beam/issues/20891

   When this API yields errors specific to failed inserts for a row.
   
   Rows are selected [here for retrying](https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryServicesImpl.java#L967), using the errorIndex which is returned from the error.
   
   retryRows.add(rowsToPublish.get(errorIndex));
   
   However, this errorIndex is not valid to index rowsToPublish. So it looks like the wrong rows are being selected to be retried.
   
   *Why can't you use errorIndex to index rowsToPublish?*
   
   because rowsToPublish contains all of the rows which were passed into insertAll.
   
   These are then batched into a smaller list of ["rows"](https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryServicesImpl.java#L875) , where multpile API calls are made to bigquery to insert the rows. 
   
   The errors returned actually refer to the list of rows passed into the call made to BigQuery, so they are only valid indices for "rows". Thus, they are not valid indices for "rowsToPublish".
   
   Note: These lists have a different number of rows: rowsToPublish.size() \> rows.size()
   
   Imported from Jira [BEAM-12139](https://issues.apache.org/jira/browse/BEAM-12139). Original Jira may contain additional context.
   Reported by: ajamato@google.com.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org