You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/12/08 00:13:45 UTC

[GitHub] [beam] lostluck commented on a change in pull request #13456: Add a small announcement for Splittable DoFn.

lostluck commented on a change in pull request #13456:
URL: https://github.com/apache/beam/pull/13456#discussion_r537932640



##########
File path: website/www/site/content/en/blog/splittable-do-fn-is-available.md
##########
@@ -0,0 +1,85 @@
+---
+title:  "Splittable DoFn in Apache Beam is Ready to Use"
+date:   2020-12-16 00:00:01 -0800
+categories:
+  - blog
+aliases:
+  - /blog/2020/12/16/splittable-do-fn-is-available.html
+authors:
+  - boyuanzz
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+In 2017, [Splittable DoFn Blog Post](https://beam.apache.org/blog/splittable-do-fn/) proposed
+to build [Splittable DoFn](https://s.apache.org/splittable-do-fn) APIs as the new recommended way of
+building I/O connectors. Splittable DoFn is a generalization of `DoFn` that gives it the core
+capabilities of `Source` while retaining `DoFn`'s syntax, flexibility, modularity, and ease of
+coding. Thus, it becomes much easier to develop complex I/O connectors with simpler and reusable
+code.
+
+Splittable DoFn has three advantages over the existing `UnboundedSource` and `BoundedSource`:
+* Splittable DoFn provides a unified set of APIs to handle both unbounded and bounded cases.
+* Splittable DoFn enables reading from source descriptors dynamically.
+  - Taking Kafka IO as an example, within `UnboundedSource`/`BoundedSource` API, you must specify
+  the topic and partition you want to read from during pipeline construction time. There is no way
+  for `UnboundedSource`/`BoundedSource` to accept topics and partitions as inputs during execution
+  time. But it's native to Splittable DoFn.
+* Splittable DoFn fits in as any node on a pipeline freely with the ability of splitting.
+  - `UnboundedSource`/`BoundedSource` has to be the root node of the pipeline to gain performance
+  benefits from splitting strategies, which limits many real-world usages. This is no longer a limit
+  for a Splittable DoFn.
+
+As Splittable DoFn is now ready to use with all the mentioned improvements, it is the recommended
+way to build the new I/O connectors.Try out building your own Splittable DoFn by following the
+[programming guide](https://beam.apache.org/documentation/programming-guide/#splittable-dofns). We
+have provided tones of common utility classes such as common types of `RestrictionTracker` and
+`WatermarkEstimator` in Beam SDK, which will help you onboard easily. As for the existing I/O
+connectors, we have wrapped `UnboundedSource` and `BoundedSource` implementations into Splittable
+DoFns, yet we still encourage developers to convert `UnboundedSource`/`BoundedSource` into actual
+Splittable DoFn implementation to gain more performance benefits.
+
+Many thanks to every contributor who brought this highly expected design into the data processing
+world. We are really excited to see that users benefit from Splittable DoFn.
+
+At the end, hope you enjoy exploring some real-world Splittable DoFn examples.
+
+## Real world Splittable DoFn examples
+
+**Java Examples**
+
+* [Kafka](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFn.java#L118):
+An I/O connector for [Apache Kafka](https://kafka.apache.org/)
+(an open-source distributed event streaming platform).
+* [Watch](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L787):
+Uses a polling function producing a growing set of outputs for each input until a per-input
+termination condition is met.
+* [Parquet](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L365):
+An I/O connector for [Apache Parquet](https://parquet.apache.org/)
+(an open-source columnar storage format).
+* [HL7v2](https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/healthcare/HL7v2IO.java#L493):
+An I/O connector for HL7v2 messages (a clinical messaging format that provides data about events
+that occur inside an organization) part of
+[Google’s Cloud Healthcare API](https://cloud.google.com/healthcare).
+* [BoundedSource wrapper](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L248):
+A wrapper which converts an existing [BoundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/BoundedSource.html)
+implementation to a splittable DoFn.
+* [UnboundedSource wrapper](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L432):
+A wrapper which converts an existing [UnboundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/UnboundedSource.html)
+implementation to a splittable DoFn.
+
+**Python Examples**
+* [BoundedSourceWrapper](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/python/apache_beam/io/iobase.py#L1375):
+A wrapper which converts an existing [BoundedSource](https://beam.apache.org/releases/pydoc/current/apache_beam.io.iobase.html#apache_beam.io.iobase.BoundedSource)
+implementation to a splittable DoFn.

Review comment:
       ```suggestion
   implementation to a splittable DoFn.
   
   **Go Examples**
    *  [textio.ReadSdf](https://github.com/apache/beam/blob/ce190e11332469ea59b6c9acf16ee7c673ccefdd/sdks/go/pkg/beam/io/textio/sdf.go#L40) implements reading from text files using a splittable DoFn.
   ```
   **




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org