You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/03/12 19:04:23 UTC

[GitHub] [beam] nehsyc commented on a change in pull request #14164: [BEAM-11934] Add runner determined sharding option for unbounded data to WriteFiles (Java)

nehsyc commented on a change in pull request #14164:
URL: https://github.com/apache/beam/pull/14164#discussion_r593385604



##########
File path: sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java
##########
@@ -301,6 +321,21 @@
     return toBuilder().setMaxNumWritersPerBundle(-1).build();
   }
 
+  /**
+   * Returns a new {@link WriteFiles} that will write to the current {@link FileBasedSink} with
+   * runner-determined sharding for unbounded data specifically. Currently manual sharding is
+   * required for writing unbounded data with a fixed number of shards or a predefined sharding
+   * function. This option allows the runners to get around that requirement and perform automatic
+   * sharding.
+   *
+   * <p>Intended to only be used by runners. Users should use {@link

Review comment:
       How does a runner using FnAPI typically override a non-standard transform? Or it always requires a transform to be added to FnAPI for runner to do something different?
   
   This is what I am going to do to set this in Dataflow runner: https://github.com/apache/beam/pull/14164/commits/3382d706ff62518fa3c8f450faa5fafc2d534d5c.
   
   The main reason I added this was that `WriteFiles` already has an interface `withRunnerDeterminedSharding` but it is disabled for streaming. Removing the condition to allow `withRunnerDeterminedSharding` for streaming will enable the new implementation for every runner - for those who don't support dynamic sharding the default implementation might perform badly. Is there a better way to allow runners to choose whether they support this option? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org