You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/07/27 19:41:07 UTC

[GitHub] [beam] steveniemitz commented on a change in pull request #14852: [BEAM-12378] GroupIntoBatches improvements

steveniemitz commented on a change in pull request #14852:
URL: https://github.com/apache/beam/pull/14852#discussion_r677746544



##########
File path: runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/GroupIntoBatchesOverride.java
##########
@@ -87,14 +94,25 @@ private BatchGroupIntoBatches(long batchSize) {
                   new DoFn<KV<K, Iterable<V>>, KV<K, Iterable<V>>>() {
                     @ProcessElement
                     public void process(ProcessContext c) {
-                      // Iterators.partition lazily creates the partitions as they are accessed
-                      // allowing it to partition very large iterators.
-                      Iterator<List<V>> iterator =
-                          Iterators.partition(c.element().getValue().iterator(), (int) batchSize);
-
-                      // Note that GroupIntoBatches only outputs when the batch is non-empty.
-                      while (iterator.hasNext()) {
-                        c.output(KV.of(c.element().getKey(), iterator.next()));
+                      List<V> currentBatch = Lists.newArrayList();
+                      long batchSizeBytes = 0;
+                      for (V element : c.element().getValue()) {
+                        currentBatch.add(element);
+                        if (weigher != null) {
+                          batchSizeBytes += weigher.apply(element);
+                        }
+                        if (currentBatch.size() == maxBatchSizeElements
+                            || (maxBatchSizeBytes != Long.MAX_VALUE
+                                && batchSizeBytes >= maxBatchSizeBytes)) {
+                          c.output(KV.of(c.element().getKey(), currentBatch));
+                          // Call clear() since that allows us to reuse the array memory for
+                          // subsequent batches.
+                          currentBatch.clear();

Review comment:
       looking at this, I don't understand how this is safe.  Isn't this mutating an element once it's been emitted? If fused step happens to store this list somewhere outside of the processElement call (state, buffer between start/end bundle, etc), it'll be mutated out from under it.
   
   Is there a copy happening somewhere else under the covers that makes this safe?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org