You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by me...@apache.org on 2017/08/17 05:02:51 UTC

[beam-site] branch asf-site updated (0aabce4 -> b548e9b)

This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a change to branch asf-site
in repository https://gitbox.apache.org/repos/asf/beam-site.git.


    from 0aabce4  Prepare repository for deployment.
     add 7d6b15c  Splittable DoFn blog post.
     add 3d89790  This closes #292
     new b548e9b  Prepare repository for deployment.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 content/blog/index.html                            |  22 +
 content/feed.xml                                   | 599 +++++++++++++++++++--
 content/index.html                                 |  10 +-
 src/_data/authors.yml                              |   5 +-
 src/_posts/2017-08-04-splittable-do-fn.md          | 523 ++++++++++++++++++
 src/images/blog/splittable-do-fn/blocks.png        | Bin 0 -> 19493 bytes
 .../blog/splittable-do-fn/jdbcio-expansion.png     | Bin 0 -> 31429 bytes
 .../blog/splittable-do-fn/kafka-splitting.png      | Bin 0 -> 27762 bytes
 src/images/blog/splittable-do-fn/restrictions.png  | Bin 0 -> 34229 bytes
 .../blog/splittable-do-fn/transform-expansion.png  | Bin 0 -> 18690 bytes
 10 files changed, 1110 insertions(+), 49 deletions(-)
 create mode 100644 src/_posts/2017-08-04-splittable-do-fn.md
 create mode 100644 src/images/blog/splittable-do-fn/blocks.png
 create mode 100644 src/images/blog/splittable-do-fn/jdbcio-expansion.png
 create mode 100644 src/images/blog/splittable-do-fn/kafka-splitting.png
 create mode 100644 src/images/blog/splittable-do-fn/restrictions.png
 create mode 100644 src/images/blog/splittable-do-fn/transform-expansion.png

-- 
To stop receiving notification emails like this one, please contact
['"commits@beam.apache.org" <co...@beam.apache.org>'].

[beam-site] 01/01: Prepare repository for deployment.

Posted by me...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/beam-site.git

commit b548e9ba51b29c1447bf4c68326d2602933d75d1
Author: Mergebot <me...@apache.org>
AuthorDate: Thu Aug 17 05:02:44 2017 +0000

    Prepare repository for deployment.
---
 content/blog/index.html |  22 ++
 content/feed.xml        | 599 ++++++++++++++++++++++++++++++++++++++++++++----
 content/index.html      |  10 +-
 3 files changed, 583 insertions(+), 48 deletions(-)

diff --git a/content/blog/index.html b/content/blog/index.html
index 22e9b93..59f355b 100644
--- a/content/blog/index.html
+++ b/content/blog/index.html
@@ -142,6 +142,28 @@
 <p>This is the blog for the Apache Beam project. This blog contains news and updates
 for the project.</p>
 
+<h3 id="a-classpost-link-hrefblog20170816splittable-do-fnhtmlpowerful-and-modular-io-connectors-with-splittable-dofn-in-apache-beama"><a class="post-link" href="/blog/2017/08/16/splittable-do-fn.html">Powerful and modular IO connectors with Splittable DoFn in Apache Beam</a></h3>
+<p><i>Aug 16, 2017 •  Eugene Kirpichov 
+</i></p>
+
+<p>One of the most important parts of the Apache Beam ecosystem is its quickly
+growing set of connectors that allow Beam pipelines to read and write data to
+various data storage systems (“IOs”). Currently, Beam ships <a href="/documentation/io/built-in/">over 20 IO
+connectors</a> with many more in
+active development. As user demands for IO connectors grew, our work on
+improving the related Beam APIs (in particular, the Source API) produced an
+unexpected result: a generalization of Beam’s most basic primitive, <code class="highlighter-rouge">DoFn</code>.</p>
+
+<!-- Render a "read more" button if the post is longer than the excerpt -->
+
+<p>
+<a class="btn btn-default btn-sm" href="/blog/2017/08/16/splittable-do-fn.html" role="button">
+Read more&nbsp;<span class="glyphicon glyphicon-menu-right" aria-hidden="true"></span>
+</a>
+</p>
+
+<hr />
+
 <h3 id="a-classpost-link-hrefblog20170517beam-first-stable-releasehtmlapache-beam-publishes-the-first-stable-releasea"><a class="post-link" href="/blog/2017/05/17/beam-first-stable-release.html">Apache Beam publishes the first stable release</a></h3>
 <p><i>May 17, 2017 •  Davor Bonaci [<a href="https://twitter.com/BonaciDavor">@BonaciDavor</a>] &amp; Dan Halperin 
 </i></p>
diff --git a/content/feed.xml b/content/feed.xml
index 021ea40..c405891 100644
--- a/content/feed.xml
+++ b/content/feed.xml
@@ -9,6 +9,562 @@
     <generator>Jekyll v3.2.0</generator>
     
       <item>
+        <title>Powerful and modular IO connectors with Splittable DoFn in Apache Beam</title>
+        <description>&lt;p&gt;One of the most important parts of the Apache Beam ecosystem is its quickly
+growing set of connectors that allow Beam pipelines to read and write data to
+various data storage systems (“IOs”). Currently, Beam ships &lt;a href=&quot;/documentation/io/built-in/&quot;&gt;over 20 IO
+connectors&lt;/a&gt; with many more in
+active development. As user demands for IO connectors grew, our work on
+improving the related Beam APIs (in particular, the Source API) produced an
+unexpected result: a generalization of Beam’s most basic primitive, &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt;.&lt;/p&gt;
+
+&lt;!--more--&gt;
+
+&lt;h2 id=&quot;connectors-as-mini-pipelines&quot;&gt;Connectors as mini-pipelines&lt;/h2&gt;
+
+&lt;p&gt;One of the main reasons for this vibrant IO connector ecosystem is that
+developing a basic IO is relatively straightforward: many connector
+implementations are simply mini-pipelines (composite &lt;code class=&quot;highlighter-rouge&quot;&gt;PTransform&lt;/code&gt;s) made of the
+basic Beam &lt;code class=&quot;highlighter-rouge&quot;&gt;ParDo&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;GroupByKey&lt;/code&gt; primitives. For example,
+&lt;code class=&quot;highlighter-rouge&quot;&gt;ElasticsearchIO.write()&lt;/code&gt;
+&lt;a href=&quot;https://github.com/apache/beam/blob/f7e8f886c91ea9d0b51e00331eeb4484e2f6e000/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L783&quot;&gt;expands&lt;/a&gt;
+into a single &lt;code class=&quot;highlighter-rouge&quot;&gt;ParDo&lt;/code&gt; with some batching for performance; &lt;code class=&quot;highlighter-rouge&quot;&gt;JdcbIO.read()&lt;/code&gt;
+&lt;a href=&quot;https://github.com/apache/beam/blob/f7e8f886c91ea9d0b51e00331eeb4484e2f6e000/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L329&quot;&gt;expands&lt;/a&gt;
+into &lt;code class=&quot;highlighter-rouge&quot;&gt;Create.of(query)&lt;/code&gt;, a reshuffle to &lt;a href=&quot;https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion&quot;&gt;prevent
+fusion&lt;/a&gt;,
+and &lt;code class=&quot;highlighter-rouge&quot;&gt;ParDo(execute sub-query)&lt;/code&gt;.  Some IOs
+&lt;a href=&quot;https://github.com/apache/beam/blob/8503adbbc3a590cd0dc2939f6a45d335682a9442/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1139&quot;&gt;construct&lt;/a&gt;
+considerably more complicated pipelines.&lt;/p&gt;
+
+&lt;p&gt;&lt;img class=&quot;center-block&quot; src=&quot;/images/blog/splittable-do-fn/jdbcio-expansion.png&quot; alt=&quot;Expansion of the JdbcIO.read() composite transform&quot; width=&quot;600&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;This “mini-pipeline” approach is flexible, modular, and generalizes to data
+sources that read from a dynamically computed &lt;code class=&quot;highlighter-rouge&quot;&gt;PCollection&lt;/code&gt; of locations, such
+as
+&lt;a href=&quot;https://github.com/apache/beam/blob/f7e8f886c91ea9d0b51e00331eeb4484e2f6e000/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L222&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;SpannerIO.readAll()&lt;/code&gt;&lt;/a&gt;
+which reads the results of a &lt;code class=&quot;highlighter-rouge&quot;&gt;PCollection&lt;/code&gt; of queries from Cloud Spanner,
+compared to
+&lt;a href=&quot;https://github.com/apache/beam/blob/f7e8f886c91ea9d0b51e00331eeb4484e2f6e000/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L318&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;SpannerIO.read()&lt;/code&gt;&lt;/a&gt;
+which executes a single query. We believe such dynamic data sources are a very
+useful capability, often overlooked by other data processing frameworks.&lt;/p&gt;
+
+&lt;h2 id=&quot;when-pardo-and-groupbykey-are-not-enough&quot;&gt;When ParDo and GroupByKey are not enough&lt;/h2&gt;
+
+&lt;p&gt;Despite the flexibility of &lt;code class=&quot;highlighter-rouge&quot;&gt;ParDo&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;GroupByKey&lt;/code&gt; and their derivatives, in some
+cases building an efficient IO connector requires extra capabilities.&lt;/p&gt;
+
+&lt;p&gt;For example, imagine reading files using the sequence &lt;code class=&quot;highlighter-rouge&quot;&gt;ParDo(filepattern →
+expand into files)&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;ParDo(filename → read records)&lt;/code&gt;, or reading a Kafka topic
+using &lt;code class=&quot;highlighter-rouge&quot;&gt;ParDo(topic → list partitions)&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;ParDo(topic, partition → read
+records)&lt;/code&gt;. This approach has two big issues:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;In the file example, some files might be much larger than others, so the
+second &lt;code class=&quot;highlighter-rouge&quot;&gt;ParDo&lt;/code&gt; may have very long individual &lt;code class=&quot;highlighter-rouge&quot;&gt;@ProcessElement&lt;/code&gt; calls. As a
+result, the pipeline can suffer from poor performance due to stragglers.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;In the Kafka example, implementing the second &lt;code class=&quot;highlighter-rouge&quot;&gt;ParDo&lt;/code&gt; is &lt;em&gt;simply impossible&lt;/em&gt;
+with a regular &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt;, because it would need to output an infinite number of
+records per each input element &lt;code class=&quot;highlighter-rouge&quot;&gt;topic, partition&lt;/code&gt; &lt;em&gt;(&lt;a href=&quot;/blog/2017/02/13/stateful-processing.html&quot;&gt;stateful processing&lt;/a&gt; comes close, but it
+has other limitations that make it insufficient for this task&lt;/em&gt;).&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h2 id=&quot;beam-source-api&quot;&gt;Beam Source API&lt;/h2&gt;
+
+&lt;p&gt;Apache Beam historically provides a Source API
+(&lt;a href=&quot;/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/io/BoundedSource.html&quot;&gt;BoundedSource&lt;/a&gt;
+and
+&lt;a href=&quot;/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/io/UnboundedSource.html&quot;&gt;UnboundedSource&lt;/a&gt;) which does
+not have these limitations and allows development of efficient data sources for
+batch and streaming systems. Pipelines use this API via the
+&lt;a href=&quot;/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/io/Read.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;Read.from(Source)&lt;/code&gt;&lt;/a&gt; built-in &lt;code class=&quot;highlighter-rouge&quot;&gt;PTransform&lt;/code&gt;.&lt;/p&gt;
+
+&lt;p&gt;The Source API is largely similar to that of most other data processing
+frameworks, and allows the system to read data in parallel using multiple
+workers, as well as checkpoint and resume reading from an unbounded data source.
+Additionally, the Beam
+&lt;a href=&quot;/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/io/BoundedSource.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;BoundedSource&lt;/code&gt;&lt;/a&gt;
+API provides advanced features such as progress reporting and &lt;a href=&quot;/blog/2016/05/18/splitAtFraction-method.html&quot;&gt;dynamic
+rebalancing&lt;/a&gt;
+(which together enable autoscaling), and
+&lt;a href=&quot;/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/io/UnboundedSource.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;UnboundedSource&lt;/code&gt;&lt;/a&gt; supports
+reporting the source’s watermark and backlog &lt;em&gt;(until SDF, we believed that
+“batch” and “streaming” data sources are fundamentally different and thus
+require fundamentally different APIs)&lt;/em&gt;.&lt;/p&gt;
+
+&lt;p&gt;Unfortunately, these features come at a price. Coding against the Source API
+involves a lot of boilerplate and is error-prone, and it does not compose well
+with the rest of the Beam model because a &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt; can appear only at the root
+of a pipeline. For example:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;Using the Source API, it is not possible to read a &lt;code class=&quot;highlighter-rouge&quot;&gt;PCollection&lt;/code&gt; of
+filepatterns.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;A &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt; can not read a side input, or wait on another pipeline step to
+produce the data.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;A &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt; can not emit an additional output (for example, records that failed to
+parse) and so on.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;The Source API is not composable even with itself. For example, suppose Alice
+implements an unbounded &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt; that watches a directory for new matching
+files, and Bob implements an unbounded &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt; that tails a file. The Source
+API does not let them simply chain the sources together and obtain a &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt;
+that returns new records in new log files in a directory (a very common user
+request). Instead, such a source would have to be developed mostly from
+scratch, and our experience shows that a full-featured monolithic
+implementation of such a &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt; is incredibly difficult and error-prone.&lt;/p&gt;
+
+&lt;p&gt;Another class of issues with the &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt; API comes from its strict
+bounded/unbounded dichotomy:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;It is difficult or impossible to reuse code between seemingly very similar
+bounded and unbounded sources, for example, the &lt;code class=&quot;highlighter-rouge&quot;&gt;BoundedSource&lt;/code&gt; that generates
+a sequence &lt;code class=&quot;highlighter-rouge&quot;&gt;[a, b)&lt;/code&gt; and the &lt;code class=&quot;highlighter-rouge&quot;&gt;UnboundedSource&lt;/code&gt; that generates a sequence &lt;code class=&quot;highlighter-rouge&quot;&gt;[a,
+inf)&lt;/code&gt; &lt;a href=&quot;https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java&quot;&gt;don’t share any
+code&lt;/a&gt;
+in the Beam Java SDK.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;It is not clear how to classify the ingestion of a very large and
+continuously growing dataset. Ingesting its “already available” part seems to
+require a &lt;code class=&quot;highlighter-rouge&quot;&gt;BoundedSource&lt;/code&gt;: the runner could benefit from knowing its size, and
+could perform dynamic rebalancing. However, ingesting the continuously arriving
+new data seems to require an &lt;code class=&quot;highlighter-rouge&quot;&gt;UnboundedSource&lt;/code&gt; for providing watermarks. From
+this angle, the &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt; API has &lt;a href=&quot;https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101&quot;&gt;the same issues as Lambda
+Architecture&lt;/a&gt;.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;About two years ago we began thinking about how to address the limitations of
+the Source API, and ended up, surprisingly, addressing the limitations of
+&lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; instead.&lt;/p&gt;
+
+&lt;h2 id=&quot;enter-splittable-dofn&quot;&gt;Enter Splittable DoFn&lt;/h2&gt;
+
+&lt;p&gt;&lt;a href=&quot;http://s.apache.org/splittable-do-fn&quot;&gt;Splittable DoFn&lt;/a&gt; (SDF) is a
+generalization of &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; that gives it the core capabilities of &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt; while
+retaining &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt;’s syntax, flexibility, modularity, and ease of coding.  As a
+result, it becomes possible to develop more powerful IO connectors than before,
+with shorter, simpler, more reusable code.&lt;/p&gt;
+
+&lt;p&gt;Note that, unlike &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt;, SDF &lt;em&gt;does not&lt;/em&gt; have distinct bounded/unbounded APIs,
+just as regular &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt;s don’t: there is only one API, which covers both of these
+use cases and anything in between. Thus, SDF closes the final gap in the unified
+batch/streaming programming model of Apache Beam.&lt;/p&gt;
+
+&lt;p&gt;When reading the explanation of SDF below, keep in mind the running example of a
+&lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; that takes a filename as input and outputs the records in that file.
+People familiar with the &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt; API may find it useful to think of SDF as a
+way to read a &lt;code class=&quot;highlighter-rouge&quot;&gt;PCollection&lt;/code&gt; of sources, treating the source itself as just
+another piece of data in the pipeline &lt;em&gt;(this, in fact, was one of the early
+design iterations among the work that led to creation of SDF)&lt;/em&gt;.&lt;/p&gt;
+
+&lt;p&gt;The two aspects where &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt; has an advantage over a regular &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; are:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;strong&gt;Splittability:&lt;/strong&gt; applying a &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; to a single element is &lt;em&gt;monolithic&lt;/em&gt;, but
+reading from a &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt; is &lt;em&gt;non-monolithic&lt;/em&gt;. The whole &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt; doesn’t have to
+be read at once; rather, it is read in parts, called &lt;em&gt;bundles&lt;/em&gt;. For example, a
+large file is usually read in several bundles, each reading some sub-range of
+offsets within the file. Likewise, a Kafka topic (which, of course, can never
+be read “fully”) is read over an infinite number of bundles, each reading some
+finite number of elements.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;strong&gt;Interaction with the runner:&lt;/strong&gt; runners apply a &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; to a single element as
+a “black box”, but interact quite richly with &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt;. &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt; provides the
+runner with information such as its estimated size (or its generalization,
+“backlog”), progress through reading the bundle, watermarks etc. The runner
+uses this information to tune the execution and control the breakdown of the
+&lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt; into bundles. For example, a slowly progressing large bundle of a file
+may be &lt;a href=&quot;https://cloud.google.com/blog/big-data/2016/05/no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow&quot;&gt;dynamically
+split&lt;/a&gt;
+by a batch-focused runner before it becomes a straggler, and a latency-focused
+streaming runner may control how many elements it reads from a source in each
+bundle to optimize for latency vs. per-bundle overhead.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h3 id=&quot;non-monolithic-element-processing-with-restrictions&quot;&gt;Non-monolithic element processing with restrictions&lt;/h3&gt;
+
+&lt;p&gt;Splittable &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; supports &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt;-like features by allowing the processing of
+a single element to be non-monolithic.&lt;/p&gt;
+
+&lt;p&gt;The processing of one element by an SDF is decomposed into a (potentially
+infinite) number of &lt;em&gt;restrictions&lt;/em&gt;, each describing some part of the work to be
+done for the whole element. The input to an SDF’s &lt;code class=&quot;highlighter-rouge&quot;&gt;@ProcessElement&lt;/code&gt; call is a
+pair of an element and a restriction (compared to a regular &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt;, which takes
+just the element).&lt;/p&gt;
+
+&lt;p&gt;Processing of every element starts by creating an &lt;em&gt;initial restriction&lt;/em&gt; that
+describes the entire work, and the initial restriction is then split further
+into sub-restrictions which must logically add up to the original. For example,
+for a splittable &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; called &lt;code class=&quot;highlighter-rouge&quot;&gt;ReadFn&lt;/code&gt; that takes a filename and outputs
+records in the file, the restriction may be a pair of starting and ending byte
+offset, and &lt;code class=&quot;highlighter-rouge&quot;&gt;ReadFn&lt;/code&gt; may interpret it as &lt;em&gt;read records whose starting offsets
+are in the given range&lt;/em&gt;.&lt;/p&gt;
+
+&lt;p&gt;&lt;img class=&quot;center-block&quot; src=&quot;/images/blog/splittable-do-fn/restrictions.png&quot; alt=&quot;Specifying parts of work for an element using restrictions&quot; width=&quot;600&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;The idea of restrictions provides non-monolithic execution - the first
+ingredient for parity with &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt;. The other ingredient is &lt;em&gt;interaction with
+the runner&lt;/em&gt;: the runner has access to the restriction of each active
+&lt;code class=&quot;highlighter-rouge&quot;&gt;@ProcessElement&lt;/code&gt; call of an SDF, can inquire about the progress of the call,
+and most importantly, can &lt;em&gt;split&lt;/em&gt; the restriction while it is being processed
+(hence the name &lt;em&gt;Splittable DoFn&lt;/em&gt;).&lt;/p&gt;
+
+&lt;p&gt;Splitting produces a &lt;em&gt;primary&lt;/em&gt; and &lt;em&gt;residual&lt;/em&gt; restriction that add up to the
+original restriction being split: the current &lt;code class=&quot;highlighter-rouge&quot;&gt;@ProcessElement&lt;/code&gt; call keeps
+processing the primary, and the residual will be processed by another
+&lt;code class=&quot;highlighter-rouge&quot;&gt;@ProcessElement&lt;/code&gt; call. For example, a runner may schedule the residual to be
+processed in parallel on another worker.&lt;/p&gt;
+
+&lt;p&gt;Splitting of a running &lt;code class=&quot;highlighter-rouge&quot;&gt;@ProcessElement&lt;/code&gt; call has two critically important uses:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;&lt;strong&gt;Supporting infinite work per element.&lt;/strong&gt; A restriction is, in general, not
+required to describe a finite amount of work. For example, reading from a Kafka
+topic starting from offset &lt;em&gt;100&lt;/em&gt; can be represented by the
+restriction &lt;em&gt;[100, inf)&lt;/em&gt;. A &lt;code class=&quot;highlighter-rouge&quot;&gt;@ProcessElement&lt;/code&gt; call processing this
+entire restriction would, of course, never complete. However, while such a call
+runs, a runner can split the restriction into a &lt;em&gt;finite&lt;/em&gt; primary &lt;em&gt;[100, 150)&lt;/em&gt;
+(letting the current call complete this part) and an &lt;em&gt;infinite&lt;/em&gt; residual &lt;em&gt;[150,
+inf)&lt;/em&gt; to be processed later, effectively checkpointing and resuming the call;
+this can be repeated forever.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;img class=&quot;center-block&quot; src=&quot;/images/blog/splittable-do-fn/kafka-splitting.png&quot; alt=&quot;Splitting an infinite restriction into a finite primary and infinite residual&quot; width=&quot;400&quot; /&gt;&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;&lt;strong&gt;Dynamic rebalancing.&lt;/strong&gt; When a (typically batch-focused) runner detects that
+a &lt;code class=&quot;highlighter-rouge&quot;&gt;@ProcessElement&lt;/code&gt; call is going to take too long and become a straggler, it
+can split the restriction in some proportion so that the primary is short enough
+to not be a straggler, and can schedule the residual in parallel on another
+worker. For details, see &lt;a href=&quot;https://cloud.google.com/blog/big-data/2016/05/no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow&quot;&gt;No Shard Left
+Behind&lt;/a&gt;.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Logically, the execution of an SDF on an element works according to the
+following diagram, where “magic” stands for the runner-specific ability to split
+the restrictions and schedule processing of residuals.&lt;/p&gt;
+
+&lt;p&gt;&lt;img class=&quot;center-block&quot; src=&quot;/images/blog/splittable-do-fn/transform-expansion.png&quot; alt=&quot;Execution of an SDF - pairing with a restriction, splitting     restrictions, processing element/restriction pairs&quot; width=&quot;600&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;This diagram emphasizes that splittability is an implementation detail of the
+particular &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt;: a splittable &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; still looks like a &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&amp;lt;A, B&amp;gt;&lt;/code&gt; to its
+user, and can be applied via a &lt;code class=&quot;highlighter-rouge&quot;&gt;ParDo&lt;/code&gt; to a &lt;code class=&quot;highlighter-rouge&quot;&gt;PCollection&amp;lt;A&amp;gt;&lt;/code&gt; producing a
+&lt;code class=&quot;highlighter-rouge&quot;&gt;PCollection&amp;lt;B&amp;gt;&lt;/code&gt;.&lt;/p&gt;
+
+&lt;h3 id=&quot;which-dofns-need-to-be-splittable&quot;&gt;Which DoFns need to be splittable&lt;/h3&gt;
+
+&lt;p&gt;Note that decomposition of an element into element/restriction pairs is not
+automatic or “magical”: SDF is a new API for &lt;em&gt;authoring&lt;/em&gt; a &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt;, rather than a
+new way to &lt;em&gt;execute&lt;/em&gt; an existing &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt;. When making a &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; splittable, the
+author needs to:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;Consider the structure of the work it does for every element.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Come up with a scheme for describing parts of this work using restrictions.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Write code for creating the initial restriction, splitting it, and executing
+an element/restriction pair.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;An overwhelming majority of &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt;s found in user pipelines do not need to be
+made splittable: SDF is an advanced, powerful API, primarily targeting authors
+of new IO connectors &lt;em&gt;(though it has interesting non-IO applications as well:
+see &lt;a href=&quot;http://s.apache.org/splittable-do-fn#heading=h.5cep9s8k4fxv&quot;&gt;Non-IO examples&lt;/a&gt;)&lt;/em&gt;.&lt;/p&gt;
+
+&lt;h3 id=&quot;execution-of-a-restriction-and-data-consistency&quot;&gt;Execution of a restriction and data consistency&lt;/h3&gt;
+
+&lt;p&gt;One of the most important parts of the Splittable &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; design is related to
+how it achieves data consistency while splitting. For example, while the runner
+is preparing to split the restriction of an active &lt;code class=&quot;highlighter-rouge&quot;&gt;@ProcessElement&lt;/code&gt; call, how
+can it be sure that the call has not concurrently progressed past the point of
+splitting?&lt;/p&gt;
+
+&lt;p&gt;This is achieved by requiring the processing of a restriction to follow a
+certain pattern. We think of a restriction as a sequence of &lt;em&gt;blocks&lt;/em&gt; -
+elementary indivisible units of work, identified by a &lt;em&gt;position&lt;/em&gt;. A
+&lt;code class=&quot;highlighter-rouge&quot;&gt;@ProcessElement&lt;/code&gt; call processes the blocks one by one, first &lt;em&gt;claiming&lt;/em&gt; the
+block’s position to atomically check if it’s still within the range of the
+restriction, until the whole restriction is processed.&lt;/p&gt;
+
+&lt;p&gt;The diagram below illustrates this for &lt;code class=&quot;highlighter-rouge&quot;&gt;ReadFn&lt;/code&gt; (a splittable &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; that reads
+Avro files) processing the element &lt;code class=&quot;highlighter-rouge&quot;&gt;foo.avro&lt;/code&gt; with restriction &lt;code class=&quot;highlighter-rouge&quot;&gt;[30, 70)&lt;/code&gt;. This
+&lt;code class=&quot;highlighter-rouge&quot;&gt;@ProcessElement&lt;/code&gt; call scans the Avro file for &lt;a href=&quot;https://avro.apache.org/docs/current/spec.html#Object+Container+Files&quot;&gt;data
+blocks&lt;/a&gt;
+starting from offset &lt;code class=&quot;highlighter-rouge&quot;&gt;30&lt;/code&gt; and claims the position of each block in this range.
+If a block is claimed successfully, then the call outputs all records in this
+data block, otherwise, it terminates.&lt;/p&gt;
+
+&lt;p&gt;&lt;img class=&quot;center-block&quot; src=&quot;/images/blog/splittable-do-fn/blocks.png&quot; alt=&quot;Processing a restriction by claiming blocks inside it&quot; width=&quot;400&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;For more details, see &lt;a href=&quot;http://s.apache.org/splittable-do-fn#heading=h.vjs7pzbb7kw&quot;&gt;Restrictions, blocks and
+positions&lt;/a&gt; in the
+design proposal document.&lt;/p&gt;
+
+&lt;h3 id=&quot;code-example&quot;&gt;Code example&lt;/h3&gt;
+
+&lt;p&gt;Let us look at some examples of SDF code. The examples use the Beam Java SDK,
+which &lt;a href=&quot;https://github.com/apache/beam/blob/f7e8f886c91ea9d0b51e00331eeb4484e2f6e000/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L527&quot;&gt;represents splittable
+&lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt;s&lt;/a&gt;
+as part of the flexible &lt;a href=&quot;http://s.apache.org/a-new-dofn&quot;&gt;annotation-based
+&lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt;&lt;/a&gt; machinery, and the &lt;a href=&quot;https://s.apache.org/splittable-do-fn-python&quot;&gt;proposed SDF syntax
+for Python&lt;/a&gt;.&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;A splittable &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; is a &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; - no new base class needed. Any SDF derives
+from the &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; class and has a &lt;code class=&quot;highlighter-rouge&quot;&gt;@ProcessElement&lt;/code&gt; method.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;@ProcessElement&lt;/code&gt; method takes an additional
+&lt;a href=&quot;https://github.com/apache/beam/blob/f7e8f886c91ea9d0b51e00331eeb4484e2f6e000/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/splittabledofn/RestrictionTracker.java&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;RestrictionTracker&lt;/code&gt;&lt;/a&gt;
+parameter that gives access to the current restriction in addition to the
+current element.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;An SDF needs to define a &lt;code class=&quot;highlighter-rouge&quot;&gt;@GetInitialRestriction&lt;/code&gt; method that can create a
+restriction describing the complete work for a given element.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;There are several less important optional methods, such as
+&lt;code class=&quot;highlighter-rouge&quot;&gt;@SplitRestriction&lt;/code&gt; for pre-splitting the initial restriction into several
+smaller restrictions, and a few others.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;The “Hello World” of SDF is a counter, which takes pairs &lt;em&gt;(x, N)&lt;/em&gt; as input and
+produces pairs &lt;em&gt;(x, 0), (x, 1), …, (x, N-1)&lt;/em&gt; as output.&lt;/p&gt;
+
+&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;CountFn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;extends&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DoFn&lt;/span&gt;&lt;span class=&quo [...]
+  &lt;span class=&quot;nd&quot;&gt;@ProcessElement&lt;/span&gt;
+  &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;process&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ProcessContext&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OffsetRangeTracker&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tracker&lt;/span&gt;&lt;span class=&qu [...]
+    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;long&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tracker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;currentRestriction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getFr [...]
+      &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;KV&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&q [...]
+    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
+  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
+
+  &lt;span class=&quot;nd&quot;&gt;@GetInitialRestriction&lt;/span&gt;
+  &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OffsetRange&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;getInitialRange&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;KV&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Long&lt;/span&gt;&lt;span class=&quot;o&quot;&gt [...]
+    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;OffsetRange&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0L&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;element&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;());& [...]
+  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
+&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
+
+&lt;span class=&quot;n&quot;&gt;PCollection&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;KV&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Long&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input&lt;/span&gt; &lt;span class=&quot;o& [...]
+&lt;span class=&quot;n&quot;&gt;PCollection&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;KV&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Long&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;output&lt;/span&gt; &lt;span class=&quot;o [...]
+    &lt;span class=&quot;n&quot;&gt;ParDo&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CountFn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;());&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;div class=&quot;language-py highlighter-rouge&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;CountFn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DoFn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;process&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;element&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tracker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DoFn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RestrictionTracker [...]
+    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;xrange&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tracker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;current_restriction&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()):& [...]
+      &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tracker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;try_claim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
+        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt;
+      &lt;span class=&quot;k&quot;&gt;yield&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;element&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;
+        
+  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;get_initial_restriction&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;element&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
+    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;element&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;This short &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; subsumes the functionality of
+&lt;a href=&quot;https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java&quot;&gt;CountingSource&lt;/a&gt;,
+but is more flexible: &lt;code class=&quot;highlighter-rouge&quot;&gt;CountingSource&lt;/code&gt; generates only one sequence specified at
+pipeline construction time, while this &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; can generate a dynamic family of
+sequences, one per element in the input collection (it does not matter whether
+the input collection is bounded or unbounded).&lt;/p&gt;
+
+&lt;p&gt;However, the &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt;-specific capabilities of &lt;code class=&quot;highlighter-rouge&quot;&gt;CountingSource&lt;/code&gt; are still
+available in &lt;code class=&quot;highlighter-rouge&quot;&gt;CountFn&lt;/code&gt;. For example, if a sequence has a lot of elements, a
+batch-focused runner can still apply dynamic rebalancing to it and generate
+different subranges of the sequence in parallel by splitting the &lt;code class=&quot;highlighter-rouge&quot;&gt;OffsetRange&lt;/code&gt;.
+Likewise, a streaming-focused runner can use the same splitting logic to
+checkpoint and resume the generation of the sequence even if it is, for
+practical purposes, infinite (for example, when applied to a &lt;code class=&quot;highlighter-rouge&quot;&gt;KV(...,
+Long.MAX_VALUE)&lt;/code&gt;).&lt;/p&gt;
+
+&lt;p&gt;A slightly more complex example is the &lt;code class=&quot;highlighter-rouge&quot;&gt;ReadFn&lt;/code&gt; considered above, which reads
+data from Avro files and illustrates the idea of &lt;em&gt;blocks&lt;/em&gt;: we provide pseudocode
+to illustrate the approach.&lt;/p&gt;
+
+&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ReadFn&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;extends&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DoFn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot; [...]
+  &lt;span class=&quot;nd&quot;&gt;@ProcessElement&lt;/span&gt;
+  &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;process&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ProcessContext&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OffsetRangeTracker&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tracker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&q [...]
+    &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AvroReader&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;reader&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Avro&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filename&lt;/sp [...]
+      &lt;span class=&quot;c1&quot;&gt;// Seek to the first block starting at or after the start offset.&lt;/span&gt;
+      &lt;span class=&quot;n&quot;&gt;reader&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;seek&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tracker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;currentRestriction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getFrom&lt;/span&gt;&lt;span class=&quot;o&quot;&g [...]
+      &lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reader&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;readNextBlock&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;())&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
+        &lt;span class=&quot;c1&quot;&gt;// Claim the position of the current Avro block&lt;/span&gt;
+        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tracker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;tryClaim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reader&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;currentBlockOffset&lt;/span&gt;&lt;span class=&quot;o&quot;&g [...]
+          &lt;span class=&quot;c1&quot;&gt;// Out of range of the current restriction - we're done.&lt;/span&gt;
+          &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
+        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
+        &lt;span class=&quot;c1&quot;&gt;// Emit all records in this block&lt;/span&gt;
+        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AvroRecord&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;record&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;reader&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;currentBlock&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;())&lt;/span&gt; &lt;span class=&quot;o&quot;&g [...]
+          &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
+        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
+      &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
+    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
+  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
+
+  &lt;span class=&quot;nd&quot;&gt;@GetInitialRestriction&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;OffsetRange&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;getInitialRestriction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;filename&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
+    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;OffsetRange&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;File&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filename&lt;/s [...]
+  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
+&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;div class=&quot;language-py highlighter-rouge&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;AvroReader&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DoFn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;process&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filename&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tracker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DoFn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RestrictionTracke [...]
+    &lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fileio&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ChannelFactory&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filename&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/s [...]
+      &lt;span class=&quot;n&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stop&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tracker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;current_restriction&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
+      &lt;span class=&quot;c&quot;&gt;# Seek to the first block starting at or after the start offset.&lt;/span&gt;
+      &lt;span class=&quot;nb&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seek&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+      &lt;span class=&quot;n&quot;&gt;block&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AvroUtils&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_next_block&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+      &lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;block&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
+        &lt;span class=&quot;c&quot;&gt;# Claim the position of the current Avro block&lt;/span&gt;
+        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tracker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;try_claim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;block&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()):&lt;/span&gt;
+          &lt;span class=&quot;c&quot;&gt;# Out of range of the current restriction - we're done.&lt;/span&gt;
+          &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt;
+        &lt;span class=&quot;c&quot;&gt;# Emit all records in this block&lt;/span&gt;
+        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;record&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;block&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;records&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
+          &lt;span class=&quot;k&quot;&gt;yield&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;
+        &lt;span class=&quot;n&quot;&gt;block&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AvroUtils&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_next_block&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+        
+  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;get_initial_restriction&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;filename&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
+    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fileio&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ChannelFactory&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size_in_bytes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt [...]
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;This hypothetical &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; reads records from a single Avro file. Notably missing
+is the code for expanding a filepattern: it no longer needs to be part of this
+&lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt;! Instead, the SDK includes a
+&lt;a href=&quot;https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Match.java&quot;&gt;Match.filepatterns()&lt;/a&gt;
+transform for expanding a filepattern into a &lt;code class=&quot;highlighter-rouge&quot;&gt;PCollection&lt;/code&gt; of filenames, and
+different file format IOs can reuse the same transform, reading the files with
+different &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt;s.&lt;/p&gt;
+
+&lt;p&gt;This example demonstrates the benefits of increased modularity allowed by SDF:
+&lt;code class=&quot;highlighter-rouge&quot;&gt;Match&lt;/code&gt; supports continuous ingestion of new files in streaming pipelines using
+&lt;code class=&quot;highlighter-rouge&quot;&gt;.continuously()&lt;/code&gt;, and this functionality becomes automatically available to
+various file format IOs. For example, &lt;code class=&quot;highlighter-rouge&quot;&gt;TextIO.read().watchForNewFiles()&lt;/code&gt; &lt;a href=&quot;https://github.com/apache/beam/blob/f7e8f886c91ea9d0b51e00331eeb4484e2f6e000/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L480&quot;&gt;uses
+&lt;code class=&quot;highlighter-rouge&quot;&gt;Match&lt;/code&gt; under the
+hood)&lt;/a&gt;.&lt;/p&gt;
+
+&lt;h2 id=&quot;current-status&quot;&gt;Current status&lt;/h2&gt;
+
+&lt;p&gt;Splittable &lt;code class=&quot;highlighter-rouge&quot;&gt;DoFn&lt;/code&gt; is a major new API, and its delivery and widespread adoption
+involves a lot of work in different parts of the Apache Beam ecosystem.  Some
+of that work is already complete and provides direct benefit to users via new
+IO connectors. However, a large amount of work is in progress or planned.&lt;/p&gt;
+
+&lt;p&gt;As of August 2017, SDF is available for use in the Beam Java Direct runner and
+Dataflow Streaming runner, and implementation is in progress in the Flink and
+Apex runners; see &lt;a href=&quot;/documentation/runners/capability-matrix/&quot;&gt;capability matrix&lt;/a&gt; for the current status. Support
+for SDF in the Python SDK is &lt;a href=&quot;https://s.apache.org/splittable-do-fn-python&quot;&gt;in active
+development&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;Several SDF-based transforms and IO connectors are available for Beam users at
+HEAD and will be included in Beam 2.2.0. &lt;code class=&quot;highlighter-rouge&quot;&gt;TextIO&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;AvroIO&lt;/code&gt; finally provide
+continuous ingestion of files (one of the most frequently requested features)
+via &lt;code class=&quot;highlighter-rouge&quot;&gt;.watchForNewFiles()&lt;/code&gt; which is backed by the utility transforms
+&lt;code class=&quot;highlighter-rouge&quot;&gt;Match.filepatterns().continuously()&lt;/code&gt; and the more general
+&lt;a href=&quot;https://github.com/apache/beam/blob/f7e8f886c91ea9d0b51e00331eeb4484e2f6e000/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;Watch.growthOf()&lt;/code&gt;&lt;/a&gt;.
+These utility transforms are also independently useful for “power user” use cases.&lt;/p&gt;
+
+&lt;p&gt;To enable more flexible use cases for IOs currently based on the Source API, we
+will change them to use SDF. This transition is &lt;a href=&quot;http://s.apache.org/textio-sdf&quot;&gt;pioneered by
+TextIO&lt;/a&gt; and involves temporarily &lt;a href=&quot;http://s.apache.org/sdf-via-source&quot;&gt;executing SDF
+via the Source API&lt;/a&gt; to support runners
+lacking the ability to run SDF directly.&lt;/p&gt;
+
+&lt;p&gt;In addition to enabling new IOs, work on SDF has influenced our thinking about
+other parts of the Beam programming model:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;SDF unified the final remaining part of the Beam programming model that was
+not batch/streaming agnostic (the &lt;code class=&quot;highlighter-rouge&quot;&gt;Source&lt;/code&gt; API). This led us to consider use
+cases that cannot be described as purely batch or streaming (for example,
+ingesting a large amount of historical data and carrying on with more data
+arriving in real time) and to develop a &lt;a href=&quot;http://s.apache.org/beam-fn-api-progress-reporting&quot;&gt;unified notion of “progress” and
+“backlog”&lt;/a&gt;.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;The &lt;a href=&quot;http://s.apache.org/beam-fn-api&quot;&gt;Fn API&lt;/a&gt; - the foundation of Beam’s
+future support for cross-language pipelines - uses SDF as &lt;em&gt;the only&lt;/em&gt; concept
+representing data ingestion.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Implementation of SDF has lead to &lt;a href=&quot;https://lists.apache.org/thread.html/86831496a08fe148e3b982cdb904f828f262c0b571543a9fed7b915d@%3Cdev.beam.apache.org%3E&quot;&gt;formalizing pipeline termination
+semantics&lt;/a&gt;
+and making it consistent between runners.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;SDF set a new standard for how modular IO connectors can be, inspiring
+creation of similar APIs for some non-SDF-based connectors (for example,
+&lt;code class=&quot;highlighter-rouge&quot;&gt;SpannerIO.readAll()&lt;/code&gt; and the
+&lt;a href=&quot;https://issues.apache.org/jira/browse/BEAM-2706&quot;&gt;planned&lt;/a&gt; &lt;code class=&quot;highlighter-rouge&quot;&gt;JdbcIO.readAll()&lt;/code&gt;).&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h2 id=&quot;call-to-action&quot;&gt;Call to action&lt;/h2&gt;
+
+&lt;p&gt;Apache Beam thrives on having a large community of contributors. Here are some
+ways you can get involved in the SDF effort and help make the Beam IO connector
+ecosystem more modular:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;Use the currently available SDF-based IO connectors, provide feedback, file
+bugs, and suggest or implement improvements.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Propose or develop a new IO connector based on SDF.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Implement or improve support for SDF in your favorite runner.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Subscribe and contribute to the occasional SDF-related discussions on
+&lt;a href=&quot;mailto:user@beam.apache.org&quot;&gt;user@beam.apache.org&lt;/a&gt; (mailing list for Beam
+users) and &lt;a href=&quot;mailto:dev@beam.apache.org&quot;&gt;dev@beam.apache.org&lt;/a&gt; (mailing list for
+Beam developers)!&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+</description>
+        <pubDate>Wed, 16 Aug 2017 01:00:01 -0700</pubDate>
+        <link>https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html</link>
+        <guid isPermaLink="true">https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html</guid>
+        
+        
+        <category>blog</category>
+        
+      </item>
+    
+      <item>
         <title>Apache Beam publishes the first stable release</title>
         <description>&lt;p&gt;The Apache Beam community is pleased to &lt;a href=&quot;https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces12&quot;&gt;announce the availability of version 2.0.0&lt;/a&gt;. This is the first stable release of Apache Beam, signifying a statement from the community that it intends to maintain API stability with all releases for the foreseeable future, and making Beam suitable for enterprise deployment.&lt;/p&gt;
 
@@ -1342,48 +1898,5 @@ Java SDK. If you have questions or comments, we’d love to hear them on the
         
       </item>
     
-      <item>
-        <title>The first release of Apache Beam!</title>
-        <description>&lt;p&gt;I’m happy to announce that Apache Beam has officially released its first
-version – 0.1.0-incubating. This is an exciting milestone for the project,
-which joined the Apache Software Foundation and the Apache Incubator earlier
-this year.&lt;/p&gt;
-
-&lt;!--more--&gt;
-
-&lt;p&gt;This release publishes the first set of Apache Beam binaries and source code,
-making them readily available for our users. The initial release includes the
-SDK for Java, along with three runners: Apache Flink, Apache Spark and Google
-Cloud Dataflow, a fully-managed cloud service. The release is available both
-in the &lt;a href=&quot;http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.beam%22&quot;&gt;Maven Central Repository&lt;/a&gt;,
-as well as a download from the &lt;a href=&quot;/get-started/downloads/&quot;&gt;project’s website&lt;/a&gt;.&lt;/p&gt;
-
-&lt;p&gt;The goal of this release was process-oriented. In particular, the Beam
-community wanted to release existing functionality to our users, build and
-validate the release processes, and obtain validation from the Apache Software
-Foundation and the Apache Incubator.&lt;/p&gt;
-
-&lt;p&gt;I’d like to encourage everyone to try out this release. Please keep in mind
-that this is the first incubating release – significant changes are to be
-expected. As we march toward stability, a rapid cadence of future releases is
-anticipated, perhaps one every 1-2 months.&lt;/p&gt;
-
-&lt;p&gt;As always, the Beam community welcomes feedback. Stabilization, usability and
-the developer experience will be our focus for the next several months. If you
-have any comments or discover any issues, I’d like to invite you to reach out
-to us via &lt;a href=&quot;/get-started/support/&quot;&gt;user’s mailing list&lt;/a&gt; or the
-&lt;a href=&quot;https://issues.apache.org/jira/browse/BEAM/&quot;&gt;Apache JIRA issue tracker&lt;/a&gt;.&lt;/p&gt;
-</description>
-        <pubDate>Wed, 15 Jun 2016 00:00:01 -0700</pubDate>
-        <link>https://beam.apache.org/beam/release/2016/06/15/first-release.html</link>
-        <guid isPermaLink="true">https://beam.apache.org/beam/release/2016/06/15/first-release.html</guid>
-        
-        
-        <category>beam</category>
-        
-        <category>release</category>
-        
-      </item>
-    
   </channel>
 </rss>
diff --git a/content/index.html b/content/index.html
index 3773e8d..3d64303 100644
--- a/content/index.html
+++ b/content/index.html
@@ -164,6 +164,11 @@
           </div>
           <div class="hero__blog__cards">
             
+            <a class="hero__blog__cards__card" href="/blog/2017/08/16/splittable-do-fn.html">
+              <div class="hero__blog__cards__card__title">Powerful and modular IO connectors with Splittable DoFn in Apache Beam</div>
+              <div class="hero__blog__cards__card__date">Aug 16, 2017</div>
+            </a>
+            
             <a class="hero__blog__cards__card" href="/blog/2017/05/17/beam-first-stable-release.html">
               <div class="hero__blog__cards__card__title">Apache Beam publishes the first stable release</div>
               <div class="hero__blog__cards__card__date">May 17, 2017</div>
@@ -174,11 +179,6 @@
               <div class="hero__blog__cards__card__date">Mar 16, 2017</div>
             </a>
             
-            <a class="hero__blog__cards__card" href="/blog/2017/02/13/stateful-processing.html">
-              <div class="hero__blog__cards__card__title">Stateful processing with Apache Beam</div>
-              <div class="hero__blog__cards__card__date">Feb 13, 2017</div>
-            </a>
-            
           </div>
         </div>
       </div>

-- 
To stop receiving notification emails like this one, please contact
"commits@beam.apache.org" <co...@beam.apache.org>.