You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/11/17 17:22:48 UTC

[GitHub] [beam] kennknowles commented on a change in pull request #15998: [BEAM-13265] Add withDeterministicRecordIdFn which allows for the no-reshuffle optimization in BigQueryIO.Write

kennknowles commented on a change in pull request #15998:
URL: https://github.com/apache/beam/pull/15998#discussion_r751459853



##########
File path: sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
##########
@@ -2427,6 +2435,22 @@ static String getExtractDestinationUri(String extractDestinationDir) {
       return toBuilder().setAutoSharding(true).build();
     }
 
+    /**
+     * Provides a function which can serve as a source of deterministic unique ids for each record
+     * to be written, replacing the unique ids generated with the default scheme. When used with
+     * {@link Method#STREAMING_INSERTS} This also elides the re-shuffle from the BigQueryIO Write by
+     * using the keys on which the data is grouped at the point at which BigQueryIO Write is
+     * applied, since the reshuffle is necessary only for the checkpointing of the default-generated

Review comment:
       We can fix later, but a reshuffle does not imply checkpointing except on Dataflow (and Dataflow could also change this if it wants)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org