You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/10/08 08:14:45 UTC

[GitHub] [beam] je-ik commented on a change in pull request #15378: [RFC] Define and document per-key ordering semantics for runners

je-ik commented on a change in pull request #15378:
URL: https://github.com/apache/beam/pull/15378#discussion_r724797638



##########
File path: website/www/site/content/en/documentation/runtime/model.md
##########
@@ -77,6 +82,48 @@ element, and having to retry everything if there is a failure. For example, a
 streaming runner may prefer to process and commit small bundles, and a batch
 runner may prefer to process larger bundles.
 
+### Data partitioning and inter-stage execution
+
+Partitioning and parallelization of element processing within a Beam pipeline is
+dependent on two things:
+
+- Data source implementation
+- Inter-stage key parallelism
+
+Beam pipelines read data from a source (e.g. `KafkaIO`, `BigQueryIO`, `JdbcIO`,
+or your own source implementation). To implement a Source in Beam one must
+implement it as a Splittable `DoFn`. A Splittable `DoFn` provides the runner
+with interfaces to facilitate the splitting of work.
+
+When running key-based operations in Beam (e.g. `GroupByKey`, `Combine`,
+`Reshuffle.perKey`, and stateful `DoFn`s), Beam runners perform serialization
+and transfer of data known as *shuffle*<sup>1</sup>. Shuffle allows data
+elements of the same key to be processed together.
+
+The way in which runners *shuffle* data may be slightly different for Batch and
+Streaming execution modes.
+
+<sup>1</sup>Not to be confused with the `shuffle` operation in some runners.
+
+#### Data ordering in a pipeline execution
+The Beam model does not define strict guidelines regarding the order in which
+runners process elements or transport them across `PTransforms`. Runners are
+free to implement shuffling semantics in different forms.
+
+Some use cases exist where user pipelines may need to rely on specific ordering
+semantics in pipeline execution. The [capability matrix documents](/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/index.html)
+runner behavior for **key-ordered delivery**.
+
+Consider a single Beam worker processing a series of bundles from the same Beam
+stage, and consider a `PTransform` that outputs data from this Stage into a
+downstream `PCollection`. Finally, consider two events *with the same key*
+emitted in a certain order by this worker (within the same bundle or as part of
+different bundles).
+
+We say that the Beam runner supports **key-ordered delivery** if it guarantees
+that these two events will be observed downstream in the same order,
+independently of the kind of transmission.

Review comment:
       What I meant was that if we we have a chain of transforms as follows:
   ```
   A -> B -> C
   ```
   if all these transforms are stateful transforms, then there is no guarantee for ordering of elements emitted from A arriving at C. The only exception would be when there is no change in key, because then we can prove that the ordering will be preserved due to transitivity. If key between A and B changes, then there is no guarantee for ordering at C (even if the key changes back to the same as emitted from A).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org