You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "Abacn (via GitHub)" <gi...@apache.org> on 2023/05/19 14:54:48 UTC

[GitHub] [beam] Abacn opened a new issue, #26789: [Bug]: BigQueryIO Storage API write autoUpdateSchema may cause data corruption

Abacn opened a new issue, #26789:
URL: https://github.com/apache/beam/issues/26789

   ### What happened?
   
   It is found that the schema returned in response may have different ordering for fields. It may cause either schema mismatch or data written into wrong field.
   
   This was revealed from another bug that makes schemaUpdate always true (fixed in #26752)
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [X] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] liferoad commented on issue #26789: [Bug]: BigQueryIO Storage API write autoUpdateSchema may cause data corruption

Posted by "liferoad (via GitHub)" <gi...@apache.org>.
liferoad commented on issue #26789:
URL: https://github.com/apache/beam/issues/26789#issuecomment-1556355749

   #26794 was merged. @Abacn can you verify whether it completely addresses the current issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn commented on issue #26789: [Bug]: BigQueryIO Storage API write autoUpdateSchema may cause data corruption

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn commented on issue #26789:
URL: https://github.com/apache/beam/issues/26789#issuecomment-1555301585

   Update: @reuvenlax entered #26789 with the fix. I verified that running test on modified master it fails with
   
   ```
   rg.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException: java.lang.RuntimeException: Unknown fields set in append! 1: 6
   2: "name6"
   
   	at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:374)
           ...
   Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Unknown fields set in append! 1: 6
   2: "name6"
   
   	at org.apache.beam.sdk.io.gcp.bigquery.StorageApiWritesShardedRecords$WriteRecordsDoFn.lambda$process$10(StorageApiWritesShardedRecords.java:567)
   	at org.apache.beam.sdk.io.gcp.bigquery.RetryManager$Operation.run(RetryManager.java:132)
   	at org.apache.beam.sdk.io.gcp.bigquery.RetryManager.run(RetryManager.java:248)
   	at org.apache.beam.sdk.io.gcp.bigquery.StorageApiWritesShardedRecords$WriteRecordsDoFn.process(StorageApiWritesShardedRecords.java:740)
   Caused by: java.lang.RuntimeException: Unknown fields set in append! 1: 6
   2: "name6"
   
   	at org.apache.beam.sdk.io.gcp.testing.FakeDatasetService$1.appendRows(FakeDatasetService.java:573)
   	at org.apache.beam.sdk.io.gcp.bigquery.StorageApiWritesShardedRecords$WriteRecordsDoFn.lambda$process$10(StorageApiWritesShardedRecords.java:565)
   	at org.apache.beam.sdk.io.gcp.bigquery.RetryManager$Operation.run(RetryManager.java:132)
   	at org.apache.beam.sdk.io.gcp.bigquery.RetryManager.run(RetryManager.java:248)
   
   ```
   
   - if do not set threshold and num of stream to 1 for the test, error message is 
   
   ```
   org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException: com.google.protobuf.InvalidProtocolBufferException: Message missing required fields: req
   	at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:374)
           ......
   Caused by: java.lang.RuntimeException: com.google.protobuf.InvalidProtocolBufferException: Message missing required fields: req
   	at org.apache.beam.sdk.io.gcp.bigquery.StorageApiWritesShardedRecords$WriteRecordsDoFn.lambda$process$10(StorageApiWritesShardedRecords.java:567)
   	at org.apache.beam.sdk.io.gcp.bigquery.RetryManager$Operation.run(RetryManager.java:132)
   	at org.apache.beam.sdk.io.gcp.bigquery.RetryManager.run(RetryManager.java:248)
   	at org.apache.beam.sdk.io.gcp.bigquery.StorageApiWritesShardedRecords$WriteRecordsDoFn.process(StorageApiWritesShardedRecords.java:740)
   Caused by: com.google.protobuf.InvalidProtocolBufferException: Message missing required fields: req
   	at com.google.protobuf.UninitializedMessageException.asInvalidProtocolBufferException(UninitializedMessageException.java:79)
   	at com.google.protobuf.DynamicMessage$Builder.buildParsed(DynamicMessage.java:394)
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn closed issue #26789: [Bug]: BigQueryIO Storage API write autoUpdateSchema may cause data corruption

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn closed issue #26789: [Bug]: BigQueryIO Storage API write autoUpdateSchema may cause data corruption
URL: https://github.com/apache/beam/issues/26789


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn commented on issue #26789: [Bug]: BigQueryIO Storage API write autoUpdateSchema may cause data corruption

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn commented on issue #26789:
URL: https://github.com/apache/beam/issues/26789#issuecomment-1555266227

   I tried the steps in the description by
   
   1. Run a pipeline that write some data to a table
   2. Run another STORAGE_API_AT_LEAST_ONCE  pipeline, where
     - set `.withSchema` with a tableSchema that field order are shuffled
     - or Assemble TableRow order adding fields shuffled
   
   Both pipeline succeeded and wrote the expected results (checked the table data). Code snippet in #26796 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn commented on issue #26789: [Bug]: BigQueryIO Storage API write autoUpdateSchema may cause data corruption

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn commented on issue #26789:
URL: https://github.com/apache/beam/issues/26789#issuecomment-1556467999

   Yes (see above comment). #26794 fixes on master; and #26810 cherry pick into release-2.48.0 will close this once the cherry-pick merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org