You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by bo...@apache.org on 2020/12/10 21:49:53 UTC

[beam] branch master updated: Add a small announcement for Splittable DoFn.

This is an automated email from the ASF dual-hosted git repository.

boyuanz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git


The following commit(s) were added to refs/heads/master by this push:
     new 379d796  Add a small announcement for Splittable DoFn.
     new c4af7f9  Merge pull request #13456 from [BEAM-10480] Add a small announcement for Splittable DoFn.
379d796 is described below

commit 379d7967444db0b7f5ab932fe331f9e0dea83d63
Author: Boyuan Zhang <bo...@google.com>
AuthorDate: Tue Dec 1 17:42:26 2020 -0800

    Add a small announcement for Splittable DoFn.
---
 .../en/blog/splittable-do-fn-is-available.md       | 91 ++++++++++++++++++++++
 1 file changed, 91 insertions(+)

diff --git a/website/www/site/content/en/blog/splittable-do-fn-is-available.md b/website/www/site/content/en/blog/splittable-do-fn-is-available.md
new file mode 100644
index 0000000..5e73a8c
--- /dev/null
+++ b/website/www/site/content/en/blog/splittable-do-fn-is-available.md
@@ -0,0 +1,91 @@
+---
+title:  "Splittable DoFn in Apache Beam is Ready to Use"
+date:   2020-12-14 00:00:01 -0800
+categories:
+  - blog
+aliases:
+  - /blog/2020/12/14/splittable-do-fn-is-available.html
+authors:
+  - boyuanzz
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+We are pleased to announce that Splittable DoFn (SDF) is ready for use in the Beam Python, Java,
+and Go SDKs for versions 2.25.0 and later.
+
+In 2017, [Splittable DoFn Blog Post](https://beam.apache.org/blog/splittable-do-fn/) proposed
+to build [Splittable DoFn](https://s.apache.org/splittable-do-fn) APIs as the new recommended way of
+building I/O connectors. Splittable DoFn is a generalization of `DoFn` that gives it the core
+capabilities of `Source` while retaining `DoFn`'s syntax, flexibility, modularity, and ease of
+coding. Thus, it becomes much easier to develop complex I/O connectors with simpler and reusable
+code.
+
+SDF has three advantages over the existing `UnboundedSource` and `BoundedSource`:
+* SDF provides a unified set of APIs to handle both unbounded and bounded cases.
+* SDF enables reading from source descriptors dynamically.
+  - Taking KafkaIO as an example, within `UnboundedSource`/`BoundedSource` API, you must specify
+  the topic and partition you want to read from during pipeline construction time. There is no way
+  for `UnboundedSource`/`BoundedSource` to accept topics and partitions as inputs during execution
+  time. But it's built-in to SDF.
+* SDF fits in as any node on a pipeline freely with the ability of splitting.
+  - `UnboundedSource`/`BoundedSource` has to be the root node of the pipeline to gain performance
+  benefits from splitting strategies, which limits many real-world usages. This is no longer a limit
+  for an SDF.
+
+As SDF is now ready to use with all the mentioned improvements, it is the recommended
+way to build the new I/O connectors. Try out building your own Splittable DoFn by following the
+[programming guide](https://beam.apache.org/documentation/programming-guide/#splittable-dofns). We
+have provided tonnes of common utility classes such as common types of `RestrictionTracker` and
+`WatermarkEstimator` in Beam SDK, which will help you onboard easily. As for the existing I/O
+connectors, we have wrapped `UnboundedSource` and `BoundedSource` implementations into Splittable
+DoFns, yet we still encourage developers to convert `UnboundedSource`/`BoundedSource` into actual
+Splittable DoFn implementation to gain more performance benefits.
+
+Many thanks to every contributor who brought this highly anticipated design into the data processing
+world. We are really excited to see that users benefit from SDF.
+
+Below are some real-world SDF examples for you to explore.
+
+## Real world Splittable DoFn examples
+
+**Java Examples**
+
+* [Kafka](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFn.java#L118):
+An I/O connector for [Apache Kafka](https://kafka.apache.org/)
+(an open-source distributed event streaming platform).
+* [Watch](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L787):
+Uses a polling function producing a growing set of outputs for each input until a per-input
+termination condition is met.
+* [Parquet](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L365):
+An I/O connector for [Apache Parquet](https://parquet.apache.org/)
+(an open-source columnar storage format).
+* [HL7v2](https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/healthcare/HL7v2IO.java#L493):
+An I/O connector for HL7v2 messages (a clinical messaging format that provides data about events
+that occur inside an organization) part of
+[Google’s Cloud Healthcare API](https://cloud.google.com/healthcare).
+* [BoundedSource wrapper](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L248):
+A wrapper which converts an existing [BoundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/BoundedSource.html)
+implementation to a splittable DoFn.
+* [UnboundedSource wrapper](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L432):
+A wrapper which converts an existing [UnboundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/UnboundedSource.html)
+implementation to a splittable DoFn.
+
+**Python Examples**
+* [BoundedSourceWrapper](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/python/apache_beam/io/iobase.py#L1375):
+A wrapper which converts an existing [BoundedSource](https://beam.apache.org/releases/pydoc/current/apache_beam.io.iobase.html#apache_beam.io.iobase.BoundedSource)
+implementation to a splittable DoFn.
+
+**Go Examples**
+ *  [textio.ReadSdf](https://github.com/apache/beam/blob/ce190e11332469ea59b6c9acf16ee7c673ccefdd/sdks/go/pkg/beam/io/textio/sdf.go#L40) implements reading from text files using a splittable DoFn.