You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/10/26 20:23:40 UTC

[GitHub] [beam] melap commented on a change in pull request #15781: [BEAM-11758] Update basics page: Splittable DoFn

melap commented on a change in pull request #15781:
URL: https://github.com/apache/beam/pull/15781#discussion_r736890434



##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -360,3 +364,42 @@ For more information about runners, see the following pages:
 
  * [Choosing a Runner](/documentation/#choosing-a-runner)
  * [Beam Capability Matrix](/documentation/runners/capability-matrix/)
+
+### Splittable DoFn
+
+Splittable `DoFn` (SDF) is a generalization of `DoFn` that lets you process
+elements in a non-monolithic way. Splittable `DoFn` makes it easier to create
+complex, modular I/O connectors in Beam.
+
+A regular `ParDo` processes an entire element at a time, applying your regular
+`DoFn` and waiting for the call to terminate. When you instead apply a
+splittable `DoFn` to each element, the runner has the option of splitting the
+element's processing into smaller tasks. You can checkpoint the processing of an
+element, and you can split the remaining work to yield additional parallelism.
+
+For example, imagine you want to read every line from very large text files.
+When you write your splittable `DoFn`, you can have separate pieces of logic to
+read a segment of a file, split a segment of a file into sub-segments, and
+report progress through the current segment. The runner can then invoke your
+splittable `DoFn` intelligently to split up each input and read portions
+separately, in parallel.
+
+A common computation pattern has the following steps:
+
+ 1. The runner splits an incoming element before starting any processing.
+ 2. The runner starts running your processing logic on each sub-element.
+ 3. If the runner notices that some sub-elements are taking longer than others,
+    the runner halts processing of those sub-elements and splits again.
+ 4. Repeat from step 2.

Review comment:
       Thanks for the explanation! I've addressed your comments -- I tried to tweak the step 4 suggestion slightly for clarity; let me know if it's no longer accurate.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org