You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by gi...@apache.org on 2021/10/26 00:03:32 UTC
[beam] branch asf-site updated: Publishing website 2021/10/26 00:02:53 at commit fd4eb15

This is an automated email from the ASF dual-hosted git repository.

git-site-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/beam.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 0f45b27  Publishing website 2021/10/26 00:02:53 at commit fd4eb15
0f45b27 is described below

commit 0f45b27f189caa801c02b0865af71b944b1425d5
Author: jenkins <bu...@apache.org>
AuthorDate: Tue Oct 26 00:02:54 2021 +0000

    Publishing website 2021/10/26 00:02:53 at commit fd4eb15
---
 .../documentation/basics/index.html                | 113 ++++++++----
 website/generated-content/documentation/index.xml  | 203 ++++++++++++++++-----
 .../documentation/programming-guide/index.html     |  16 +-
 website/generated-content/images/aggregation.png   | Bin 0 -> 14065 bytes
 website/generated-content/sitemap.xml              |   2 +-
 5 files changed, 253 insertions(+), 81 deletions(-)

diff --git a/website/generated-content/documentation/basics/index.html b/website/generated-content/documentation/basics/index.html
index 3de8287..14a033f 100644
--- a/website/generated-content/documentation/basics/index.html
+++ b/website/generated-content/documentation/basics/index.html
@@ -18,21 +18,23 @@
 function addPlaceholder(){$('input:text').attr('placeholder',"What are you looking for?");}
 function endSearch(){var search=document.querySelector(".searchBar");search.classList.add("disappear");var icons=document.querySelector("#iconsBar");icons.classList.remove("disappear");}
 function blockScroll(){$("body").toggleClass("fixedPosition");}
-function openMenu(){addPlaceholder();blockScroll();}</script><div class="clearfix container-main-content"><div class="section-nav closed" data-offset-top=90 data-offset-bottom=500><span class="section-nav-back glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list data-section-nav><li><span class=section-nav-list-main-title>Documentation</span></li><li><a href=/documentation>Using the Documentation</a></li><li class=section-nav-item--collapsible><span class=section-nav-lis [...]
-of operations. You want to integrate it with the Beam ecosystem to get access
-to other languages, great event time processing, and a library of connectors.
-You need to know the core vocabulary:</p><ul><li><a href=#pipeline><em>Pipeline</em></a> - A pipeline is a user-constructed graph of
+function openMenu(){addPlaceholder();blockScroll();}</script><div class="clearfix container-main-content"><div class="section-nav closed" data-offset-top=90 data-offset-bottom=500><span class="section-nav-back glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list data-section-nav><li><span class=section-nav-list-main-title>Documentation</span></li><li><a href=/documentation>Using the Documentation</a></li><li class=section-nav-item--collapsible><span class=section-nav-lis [...]
+data-parallel processing pipelines. To get started with Beam, you&rsquo;ll need to
+understand an important set of core concepts:</p><ul><li><a href=#pipeline><em>Pipeline</em></a> - A pipeline is a user-constructed graph of
 transformations that defines the desired data processing operations.</li><li><a href=#pcollection><em>PCollection</em></a> - A <code>PCollection</code> is a data set or data
 stream. The data that a pipeline processes is part of a PCollection.</li><li><a href=#ptransform><em>PTransform</em></a> - A <code>PTransform</code> (or <em>transform</em>) represents a
 data processing operation, or a step, in your pipeline. A transform is
 applied to zero or more <code>PCollection</code> objects, and produces zero or more
-<code>PCollection</code> objects.</li><li><em>SDK</em> - A language-specific library for pipeline authors (we often call them
-&ldquo;users&rdquo; even though we have many kinds of users) to build transforms,
-construct their pipelines and submit them to a runner</li><li><em>Runner</em> - You are going to write a piece of software called a runner that
-takes a Beam pipeline and executes it using the capabilities of your data
-processing engine.</li></ul><p>These concepts may be very similar to your processing engine&rsquo;s concepts. Since
-Beam&rsquo;s design is for cross-language operation and reusable libraries of
-transforms, there are some special features worth highlighting.</p><h3 id=pipeline>Pipeline</h3><p>A Beam pipeline is a graph (specifically, a
+<code>PCollection</code> objects.</li><li><a href=#aggregation><em>Aggregation</em></a> - Aggregation is computing a value from
+multiple (1 or more) input elements.</li><li><a href=#user-defined-function-udf><em>User-defined function (UDF)</em></a> - Some Beam
+operations allow you to run user-defined code as a way to configure the
+transform.</li><li><a href=#schema><em>Schema</em></a> - A schema is a language-independent type definition for
+a <code>PCollection</code>. The schema for a <code>PCollection</code> defines elements of that
+<code>PCollection</code> as an ordered list of named fields.</li><li><a href=/documentation/sdks/java/><em>SDK</em></a> - A language-specific library that lets
+pipeline authors build transforms, construct their pipelines, and submit
+them to a runner.</li><li><a href=#runner><em>Runner</em></a> - A runner runs a Beam pipeline using the capabilities of
+your chosen data processing engine.</li></ul><p>The following sections cover these concepts in more detail and provide links to
+additional documentation.</p><h3 id=pipeline>Pipeline</h3><p>A Beam pipeline is a graph (specifically, a
 <a href=https://en.wikipedia.org/wiki/Directed_acyclic_graph>directed acyclic graph</a>)
 of all the data and computations in your data processing task. This includes
 reading input data, transforming that data, and writing output data. A pipeline
@@ -112,26 +114,75 @@ frequently used, but there are a few common key formats (such as key-value pairs
 and timestamps) so the runner can understand them.</p><h4 id=windowing-strategy>Windowing strategy</h4><p>Every <code>PCollection</code> has a windowing strategy, which is a specification of
 essential information for grouping and triggering operations. The <code>Window</code>
 transform sets up the windowing strategy, and the <code>GroupByKey</code> transform has
-behavior that is governed by the windowing strategy.</p><br><p>For more information about PCollections, see the following page:</p><ul><li><a href=/documentation/programming-guide/#pcollections>Beam Programming Guide: PCollections</a></li></ul><h3 id=user-defined-functions-udfs>User-Defined Functions (UDFs)</h3><p>Beam has seven varieties of user-defined function (UDF). A Beam pipeline
-may contain UDFs written in a language other than your runner, or even multiple
-languages in the same pipeline (see the <a href=#the-runner-api>Runner API</a>) so the
-definitions are language-independent (see the <a href=#the-fn-api>Fn API</a>).</p><p>The UDFs of Beam are:</p><ul><li><em>DoFn</em> - per-element processing function (used in ParDo)</li><li><em>WindowFn</em> - places elements in windows and merges windows (used in Window
-and GroupByKey)</li><li><em>Source</em> - emits data read from external sources, including initial and
-dynamic splitting for parallelism (used in Read)</li><li><em>ViewFn</em> - adapts a materialized PCollection to a particular interface (used
-in side inputs)</li><li><em>WindowMappingFn</em> - maps one element&rsquo;s window to another, and specifies
-bounds on how far in the past the result window will be (used in side
-inputs)</li><li><em>CombineFn</em> - associative and commutative aggregation (used in Combine and
-state)</li><li><em>Coder</em> - encodes user data; some coders have standard formats and are not really UDFs</li></ul><p>The various types of user-defined functions will be described further alongside
-the <a href=#ptransforms><em>PTransforms</em></a> that use them.</p><h3 id=runner>Runner</h3><p>The term &ldquo;runner&rdquo; is used for a couple of things. It generally refers to the
-software that takes a Beam pipeline and executes it somehow. Often, this is the
-translation code that you write. It usually also includes some customized
-operators for your data processing engine, and is sometimes used to refer to
-the full stack.</p><p>A runner has just a single method <code>run(Pipeline)</code>. From here on, I will often
-use code font for proper nouns in our APIs, whether or not the identifiers
-match across all SDKs.</p><p>The <code>run(Pipeline)</code> method should be asynchronous and results in a
-PipelineResult which generally will be a job descriptor for your data
-processing engine, providing methods for checking its status, canceling it, and
-waiting for it to terminate.</p><div class=feedback><p class=update>Last updated on 2021/10/21</p><h3>Have you found everything you were looking for?</h3><p class=description>Was it all useful and clear? Is there anything that you would like to change? Let us know!</p><button class=load-button><a href="mailto:dev@beam.apache.org?subject=Beam Website Feedback">SEND FEEDBACK</a></button></div></div></div><footer class=footer><div class=footer__contained><div class=footer__cols><div class=" [...]
+behavior that is governed by the windowing strategy.</p><br><p>For more information about PCollections, see the following page:</p><ul><li><a href=/documentation/programming-guide/#pcollections>Beam Programming Guide: PCollections</a></li></ul><h3 id=aggregation>Aggregation</h3><p>Aggregation is computing a value from multiple (1 or more) input elements. In
+Beam, the primary computational pattern for aggregation is to group all elements
+with a common key and window then combine each group of elements using an
+associative and commutative operation. This is similar to the &ldquo;Reduce&rdquo; operation
+in the <a href=https://en.wikipedia.org/wiki/MapReduce>MapReduce</a> model, though it is
+enhanced to work with unbounded input streams as well as bounded data sets.</p><img src=/images/aggregation.png alt="Aggregation of elements." width=120px><p><em>Figure 1: Aggregation of elements. Elements with the same color represent those
+with a common key and window.</em></p><p>Some simple aggregation transforms include <code>Count</code> (computes the count of all
+elements in the aggregation), <code>Max</code> (computes the maximum element in the
+aggregation), and <code>Sum</code> (computes the sum of all elements in the aggregation).</p><p>When elements are grouped and emitted as a bag, the aggregation is known as
+<code>GroupByKey</code> (the associative/commutative operation is bag union). In this case,
+the output is no smaller than the input. Often, you will apply an operation such
+as summation, called a <code>CombineFn</code>, in which the output is significantly smaller
+than the input. In this case the aggregation is called <code>CombinePerKey</code>.</p><p>In a real application, you might have millions of keys and/or windows; that is
+why this is still an &ldquo;embarassingly parallel&rdquo; computational pattern. In those
+cases where you have fewer keys, you can add parallelism by adding a
+supplementary key, splitting each of your problem&rsquo;s natural keys into many
+sub-keys. After these sub-keys are aggregated, the results can be further
+combined into a result for the original natural key for your problem. The
+associativity of your aggregation function ensures that this yields the same
+answer, but with more parallelism.</p><p>When your input is unbounded, the computational pattern of grouping elements by
+key and window is roughly the same, but governing when and how to emit the
+results of aggregation involves three concepts:</p><ul><li>Windowing, which partitions your input into bounded subsets that can be
+complete.</li><li>Watermarks, which estimate the completeness of your input.</li><li>Triggers, which govern when and how to emit aggregated results.</li></ul><p>For more information about available aggregation transforms, see the following
+pages:</p><ul><li><a href=/documentation/programming-guide/#core-beam-transforms>Beam Programming Guide: Core Beam transforms</a></li><li>Beam Transform catalog
+(<a href=/documentation/transforms/java/overview/#aggregation>Java</a>,
+<a href=/documentation/transforms/python/overview/#aggregation>Python</a>)</li></ul><h3 id=user-defined-function-udf>User-defined function (UDF)</h3><p>Some Beam operations allow you to run user-defined code as a way to configure
+the transform. For example, when using <code>ParDo</code>, user-defined code specifies what
+operation to apply to every element. For <code>Combine</code>, it specifies how values
+should be combined. By using <a href=/documentation/patterns/cross-language/>cross-language transforms</a>,
+a Beam pipeline can contain UDFs written in a different language, or even
+multiple languages in the same pipeline.</p><p>Beam has several varieties of UDFs:</p><ul><li><a href=/programming-guide/#pardo><em>DoFn</em></a> - per-element processing function (used
+in <code>ParDo</code>)</li><li><a href=/programming-guide/#setting-your-pcollections-windowing-function><em>WindowFn</em></a> -
+places elements in windows and merges windows (used in <code>Window</code> and
+<code>GroupByKey</code>)</li><li><a href=/documentation/programming-guide/#side-inputs><em>ViewFn</em></a> - adapts a
+materialized <code>PCollection</code> to a particular interface (used in side inputs)</li><li><a href=/documentation/programming-guide/#side-inputs-windowing><em>WindowMappingFn</em></a> -
+maps one element&rsquo;s window to another, and specifies bounds on how far in the
+past the result window will be (used in side inputs)</li><li><a href=/documentation/programming-guide/#combine><em>CombineFn</em></a> - associative and
+commutative aggregation (used in <code>Combine</code> and state)</li><li><a href=/documentation/programming-guide/#data-encoding-and-type-safety><em>Coder</em></a> -
+encodes user data; some coders have standard formats and are not really UDFs</li></ul><p>Each language SDK has its own idiomatic way of expressing the user-defined
+functions in Beam, but there are common requirements. When you build user code
+for a Beam transform, you should keep in mind the distributed nature of
+execution. For example, there might be many copies of your function running on a
+lot of different machines in parallel, and those copies function independently,
+without communicating or sharing state with any of the other copies. Each copy
+of your user code function might be retried or run multiple times, depending on
+the pipeline runner and the processing backend that you choose for your
+pipeline. Beam also supports stateful processing through the
+<a href=/blog/stateful-processing/>stateful processing API</a>.</p><p>For more information about user-defined functions, see the following pages:</p><ul><li><a href=/documentation/programming-guide/#requirements-for-writing-user-code-for-beam-transforms>Requirements for writing user code for Beam transforms</a></li><li><a href=/documentation/programming-guide/#pardo>Beam Programming Guide: ParDo</a></li><li><a href=/programming-guide/#setting-your-pcollections-windowing-function>Beam Pro [...]
+schema for a <code>PCollection</code> defines elements of that <code>PCollection</code> as an ordered
+list of named fields. Each field has a name, a type, and possibly a set of user
+options.</p><p>In many cases, the element type in a <code>PCollection</code> has a structure that can be
+introspected. Some examples are JSON, Protocol Buffer, Avro, and database row
+objects. All of these formats can be converted to Beam Schemas. Even within a
+SDK pipeline, Simple Java POJOs (or equivalent structures in other languages)
+are often used as intermediate types, and these also have a clear structure that
+can be inferred by inspecting the class. By understanding the structure of a
+pipeline’s records, we can provide much more concise APIs for data processing.</p><p>Beam provides a collection of transforms that operate natively on schemas. For
+example, <a href=/documentation/dsls/sql/overview/>Beam SQL</a> is a common transform
+that operates on schemas. These transforms allow selections and aggregations in
+terms of named schema fields. Another advantage of schemas is that they allow
+referencing of element fields by name. Beam provides a selection syntax for
+referencing fields, including nested and repeated fields.</p><p>For more information about schemas, see the following pages:</p><ul><li><a href=/documentation/programming-guide/#schemas>Beam Programming Guide: Schemas</a></li><li><a href=/documentation/patterns/schema/>Schema Patterns</a></li></ul><h3 id=runner>Runner</h3><p>A Beam runner runs a Beam pipeline on a specific platform. Most runners are
+translators or adapters to massively parallel big data processing systems, such
+as Apache Flink, Apache Spark, Google Cloud Dataflow, and more. For example, the
+Flink runner translates a Beam pipeline into a Flink job. The Direct Runner runs
+pipelines locally so you can test, debug, and validate that your pipeline
+adheres to the Apache Beam model as closely as possible.</p><p>For an up-to-date list of Beam runners and which features of the Apache Beam
+model they support, see the runner
+<a href=/documentation/runners/capability-matrix/>capability matrix</a>.</p><p>For more information about runners, see the following pages:</p><ul><li><a href=/documentation/#choosing-a-runner>Choosing a Runner</a></li><li><a href=/documentation/runners/capability-matrix/>Beam Capability Matrix</a></li></ul><div class=feedback><p class=update>Last updated on 2021/10/25</p><h3>Have you found everything you were looking for?</h3><p class=description>Was it all useful and clear? Is there an [...]
 <a href=http://www.apache.org>The Apache Software Foundation</a>
 | <a href=/privacy_policy>Privacy Policy</a>
 | <a href=/feed.xml>RSS Feed</a><br><br>Apache Beam, Apache, Beam, the Beam logo, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation. All other products or name brands are trademarks of their respective holders, including The Apache Software Foundation.</div></div></div></div></footer></body></html>
\ No newline at end of file
diff --git a/website/generated-content/documentation/index.xml b/website/generated-content/documentation/index.xml
index 48ba7f5..69e5ee9 100644
--- a/website/generated-content/documentation/index.xml
+++ b/website/generated-content/documentation/index.xml
@@ -3180,10 +3180,9 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
 &lt;h1 id="basics-of-the-beam-model">Basics of the Beam model&lt;/h1>
-&lt;p>Suppose you have a data processing engine that can pretty easily process graphs
-of operations. You want to integrate it with the Beam ecosystem to get access
-to other languages, great event time processing, and a library of connectors.
-You need to know the core vocabulary:&lt;/p>
+&lt;p>Apache Beam is a unified model for defining both batch and streaming
+data-parallel processing pipelines. To get started with Beam, you&amp;rsquo;ll need to
+understand an important set of core concepts:&lt;/p>
 &lt;ul>
 &lt;li>&lt;a href="#pipeline">&lt;em>Pipeline&lt;/em>&lt;/a> - A pipeline is a user-constructed graph of
 transformations that defines the desired data processing operations.&lt;/li>
@@ -3193,16 +3192,22 @@ stream. The data that a pipeline processes is part of a PCollection.&lt;/li>
 data processing operation, or a step, in your pipeline. A transform is
 applied to zero or more &lt;code>PCollection&lt;/code> objects, and produces zero or more
 &lt;code>PCollection&lt;/code> objects.&lt;/li>
-&lt;li>&lt;em>SDK&lt;/em> - A language-specific library for pipeline authors (we often call them
-&amp;ldquo;users&amp;rdquo; even though we have many kinds of users) to build transforms,
-construct their pipelines and submit them to a runner&lt;/li>
-&lt;li>&lt;em>Runner&lt;/em> - You are going to write a piece of software called a runner that
-takes a Beam pipeline and executes it using the capabilities of your data
-processing engine.&lt;/li>
+&lt;li>&lt;a href="#aggregation">&lt;em>Aggregation&lt;/em>&lt;/a> - Aggregation is computing a value from
+multiple (1 or more) input elements.&lt;/li>
+&lt;li>&lt;a href="#user-defined-function-udf">&lt;em>User-defined function (UDF)&lt;/em>&lt;/a> - Some Beam
+operations allow you to run user-defined code as a way to configure the
+transform.&lt;/li>
+&lt;li>&lt;a href="#schema">&lt;em>Schema&lt;/em>&lt;/a> - A schema is a language-independent type definition for
+a &lt;code>PCollection&lt;/code>. The schema for a &lt;code>PCollection&lt;/code> defines elements of that
+&lt;code>PCollection&lt;/code> as an ordered list of named fields.&lt;/li>
+&lt;li>&lt;a href="/documentation/sdks/java/">&lt;em>SDK&lt;/em>&lt;/a> - A language-specific library that lets
+pipeline authors build transforms, construct their pipelines, and submit
+them to a runner.&lt;/li>
+&lt;li>&lt;a href="#runner">&lt;em>Runner&lt;/em>&lt;/a> - A runner runs a Beam pipeline using the capabilities of
+your chosen data processing engine.&lt;/li>
 &lt;/ul>
-&lt;p>These concepts may be very similar to your processing engine&amp;rsquo;s concepts. Since
-Beam&amp;rsquo;s design is for cross-language operation and reusable libraries of
-transforms, there are some special features worth highlighting.&lt;/p>
+&lt;p>The following sections cover these concepts in more detail and provide links to
+additional documentation.&lt;/p>
 &lt;h3 id="pipeline">Pipeline&lt;/h3>
 &lt;p>A Beam pipeline is a graph (specifically, a
 &lt;a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph">directed acyclic graph&lt;/a>)
@@ -3344,42 +3349,130 @@ behavior that is governed by the windowing strategy.&lt;/p>
 &lt;ul>
 &lt;li>&lt;a href="/documentation/programming-guide/#pcollections">Beam Programming Guide: PCollections&lt;/a>&lt;/li>
 &lt;/ul>
-&lt;h3 id="user-defined-functions-udfs">User-Defined Functions (UDFs)&lt;/h3>
-&lt;p>Beam has seven varieties of user-defined function (UDF). A Beam pipeline
-may contain UDFs written in a language other than your runner, or even multiple
-languages in the same pipeline (see the &lt;a href="#the-runner-api">Runner API&lt;/a>) so the
-definitions are language-independent (see the &lt;a href="#the-fn-api">Fn API&lt;/a>).&lt;/p>
-&lt;p>The UDFs of Beam are:&lt;/p>
+&lt;h3 id="aggregation">Aggregation&lt;/h3>
+&lt;p>Aggregation is computing a value from multiple (1 or more) input elements. In
+Beam, the primary computational pattern for aggregation is to group all elements
+with a common key and window then combine each group of elements using an
+associative and commutative operation. This is similar to the &amp;ldquo;Reduce&amp;rdquo; operation
+in the &lt;a href="https://en.wikipedia.org/wiki/MapReduce">MapReduce&lt;/a> model, though it is
+enhanced to work with unbounded input streams as well as bounded data sets.&lt;/p>
+&lt;img src="/images/aggregation.png" alt="Aggregation of elements." width="120px">
+&lt;p>&lt;em>Figure 1: Aggregation of elements. Elements with the same color represent those
+with a common key and window.&lt;/em>&lt;/p>
+&lt;p>Some simple aggregation transforms include &lt;code>Count&lt;/code> (computes the count of all
+elements in the aggregation), &lt;code>Max&lt;/code> (computes the maximum element in the
+aggregation), and &lt;code>Sum&lt;/code> (computes the sum of all elements in the aggregation).&lt;/p>
+&lt;p>When elements are grouped and emitted as a bag, the aggregation is known as
+&lt;code>GroupByKey&lt;/code> (the associative/commutative operation is bag union). In this case,
+the output is no smaller than the input. Often, you will apply an operation such
+as summation, called a &lt;code>CombineFn&lt;/code>, in which the output is significantly smaller
+than the input. In this case the aggregation is called &lt;code>CombinePerKey&lt;/code>.&lt;/p>
+&lt;p>In a real application, you might have millions of keys and/or windows; that is
+why this is still an &amp;ldquo;embarassingly parallel&amp;rdquo; computational pattern. In those
+cases where you have fewer keys, you can add parallelism by adding a
+supplementary key, splitting each of your problem&amp;rsquo;s natural keys into many
+sub-keys. After these sub-keys are aggregated, the results can be further
+combined into a result for the original natural key for your problem. The
+associativity of your aggregation function ensures that this yields the same
+answer, but with more parallelism.&lt;/p>
+&lt;p>When your input is unbounded, the computational pattern of grouping elements by
+key and window is roughly the same, but governing when and how to emit the
+results of aggregation involves three concepts:&lt;/p>
 &lt;ul>
-&lt;li>&lt;em>DoFn&lt;/em> - per-element processing function (used in ParDo)&lt;/li>
-&lt;li>&lt;em>WindowFn&lt;/em> - places elements in windows and merges windows (used in Window
-and GroupByKey)&lt;/li>
-&lt;li>&lt;em>Source&lt;/em> - emits data read from external sources, including initial and
-dynamic splitting for parallelism (used in Read)&lt;/li>
-&lt;li>&lt;em>ViewFn&lt;/em> - adapts a materialized PCollection to a particular interface (used
-in side inputs)&lt;/li>
-&lt;li>&lt;em>WindowMappingFn&lt;/em> - maps one element&amp;rsquo;s window to another, and specifies
-bounds on how far in the past the result window will be (used in side
-inputs)&lt;/li>
-&lt;li>&lt;em>CombineFn&lt;/em> - associative and commutative aggregation (used in Combine and
-state)&lt;/li>
-&lt;li>&lt;em>Coder&lt;/em> - encodes user data; some coders have standard formats and are not really UDFs&lt;/li>
+&lt;li>Windowing, which partitions your input into bounded subsets that can be
+complete.&lt;/li>
+&lt;li>Watermarks, which estimate the completeness of your input.&lt;/li>
+&lt;li>Triggers, which govern when and how to emit aggregated results.&lt;/li>
+&lt;/ul>
+&lt;p>For more information about available aggregation transforms, see the following
+pages:&lt;/p>
+&lt;ul>
+&lt;li>&lt;a href="/documentation/programming-guide/#core-beam-transforms">Beam Programming Guide: Core Beam transforms&lt;/a>&lt;/li>
+&lt;li>Beam Transform catalog
+(&lt;a href="/documentation/transforms/java/overview/#aggregation">Java&lt;/a>,
+&lt;a href="/documentation/transforms/python/overview/#aggregation">Python&lt;/a>)&lt;/li>
+&lt;/ul>
+&lt;h3 id="user-defined-function-udf">User-defined function (UDF)&lt;/h3>
+&lt;p>Some Beam operations allow you to run user-defined code as a way to configure
+the transform. For example, when using &lt;code>ParDo&lt;/code>, user-defined code specifies what
+operation to apply to every element. For &lt;code>Combine&lt;/code>, it specifies how values
+should be combined. By using &lt;a href="/documentation/patterns/cross-language/">cross-language transforms&lt;/a>,
+a Beam pipeline can contain UDFs written in a different language, or even
+multiple languages in the same pipeline.&lt;/p>
+&lt;p>Beam has several varieties of UDFs:&lt;/p>
+&lt;ul>
+&lt;li>&lt;a href="/programming-guide/#pardo">&lt;em>DoFn&lt;/em>&lt;/a> - per-element processing function (used
+in &lt;code>ParDo&lt;/code>)&lt;/li>
+&lt;li>&lt;a href="/programming-guide/#setting-your-pcollections-windowing-function">&lt;em>WindowFn&lt;/em>&lt;/a> -
+places elements in windows and merges windows (used in &lt;code>Window&lt;/code> and
+&lt;code>GroupByKey&lt;/code>)&lt;/li>
+&lt;li>&lt;a href="/documentation/programming-guide/#side-inputs">&lt;em>ViewFn&lt;/em>&lt;/a> - adapts a
+materialized &lt;code>PCollection&lt;/code> to a particular interface (used in side inputs)&lt;/li>
+&lt;li>&lt;a href="/documentation/programming-guide/#side-inputs-windowing">&lt;em>WindowMappingFn&lt;/em>&lt;/a> -
+maps one element&amp;rsquo;s window to another, and specifies bounds on how far in the
+past the result window will be (used in side inputs)&lt;/li>
+&lt;li>&lt;a href="/documentation/programming-guide/#combine">&lt;em>CombineFn&lt;/em>&lt;/a> - associative and
+commutative aggregation (used in &lt;code>Combine&lt;/code> and state)&lt;/li>
+&lt;li>&lt;a href="/documentation/programming-guide/#data-encoding-and-type-safety">&lt;em>Coder&lt;/em>&lt;/a> -
+encodes user data; some coders have standard formats and are not really UDFs&lt;/li>
+&lt;/ul>
+&lt;p>Each language SDK has its own idiomatic way of expressing the user-defined
+functions in Beam, but there are common requirements. When you build user code
+for a Beam transform, you should keep in mind the distributed nature of
+execution. For example, there might be many copies of your function running on a
+lot of different machines in parallel, and those copies function independently,
+without communicating or sharing state with any of the other copies. Each copy
+of your user code function might be retried or run multiple times, depending on
+the pipeline runner and the processing backend that you choose for your
+pipeline. Beam also supports stateful processing through the
+&lt;a href="/blog/stateful-processing/">stateful processing API&lt;/a>.&lt;/p>
+&lt;p>For more information about user-defined functions, see the following pages:&lt;/p>
+&lt;ul>
+&lt;li>&lt;a href="/documentation/programming-guide/#requirements-for-writing-user-code-for-beam-transforms">Requirements for writing user code for Beam transforms&lt;/a>&lt;/li>
+&lt;li>&lt;a href="/documentation/programming-guide/#pardo">Beam Programming Guide: ParDo&lt;/a>&lt;/li>
+&lt;li>&lt;a href="/programming-guide/#setting-your-pcollections-windowing-function">Beam Programming Guide: WindowFn&lt;/a>&lt;/li>
+&lt;li>&lt;a href="/documentation/programming-guide/#combine">Beam Programming Guide: CombineFn&lt;/a>&lt;/li>
+&lt;li>&lt;a href="/documentation/programming-guide/#data-encoding-and-type-safety">Beam Programming Guide: Coder&lt;/a>&lt;/li>
+&lt;li>&lt;a href="/documentation/programming-guide/#side-inputs">Beam Programming Guide: Side inputs&lt;/a>&lt;/li>
+&lt;/ul>
+&lt;h3 id="schema">Schema&lt;/h3>
+&lt;p>A schema is a language-independent type definition for a &lt;code>PCollection&lt;/code>. The
+schema for a &lt;code>PCollection&lt;/code> defines elements of that &lt;code>PCollection&lt;/code> as an ordered
+list of named fields. Each field has a name, a type, and possibly a set of user
+options.&lt;/p>
+&lt;p>In many cases, the element type in a &lt;code>PCollection&lt;/code> has a structure that can be
+introspected. Some examples are JSON, Protocol Buffer, Avro, and database row
+objects. All of these formats can be converted to Beam Schemas. Even within a
+SDK pipeline, Simple Java POJOs (or equivalent structures in other languages)
+are often used as intermediate types, and these also have a clear structure that
+can be inferred by inspecting the class. By understanding the structure of a
+pipeline’s records, we can provide much more concise APIs for data processing.&lt;/p>
+&lt;p>Beam provides a collection of transforms that operate natively on schemas. For
+example, &lt;a href="/documentation/dsls/sql/overview/">Beam SQL&lt;/a> is a common transform
+that operates on schemas. These transforms allow selections and aggregations in
+terms of named schema fields. Another advantage of schemas is that they allow
+referencing of element fields by name. Beam provides a selection syntax for
+referencing fields, including nested and repeated fields.&lt;/p>
+&lt;p>For more information about schemas, see the following pages:&lt;/p>
+&lt;ul>
+&lt;li>&lt;a href="/documentation/programming-guide/#schemas">Beam Programming Guide: Schemas&lt;/a>&lt;/li>
+&lt;li>&lt;a href="/documentation/patterns/schema/">Schema Patterns&lt;/a>&lt;/li>
 &lt;/ul>
-&lt;p>The various types of user-defined functions will be described further alongside
-the &lt;a href="#ptransforms">&lt;em>PTransforms&lt;/em>&lt;/a> that use them.&lt;/p>
 &lt;h3 id="runner">Runner&lt;/h3>
-&lt;p>The term &amp;ldquo;runner&amp;rdquo; is used for a couple of things. It generally refers to the
-software that takes a Beam pipeline and executes it somehow. Often, this is the
-translation code that you write. It usually also includes some customized
-operators for your data processing engine, and is sometimes used to refer to
-the full stack.&lt;/p>
-&lt;p>A runner has just a single method &lt;code>run(Pipeline)&lt;/code>. From here on, I will often
-use code font for proper nouns in our APIs, whether or not the identifiers
-match across all SDKs.&lt;/p>
-&lt;p>The &lt;code>run(Pipeline)&lt;/code> method should be asynchronous and results in a
-PipelineResult which generally will be a job descriptor for your data
-processing engine, providing methods for checking its status, canceling it, and
-waiting for it to terminate.&lt;/p></description></item><item><title>Documentation: Beam glossary</title><link>/documentation/glossary/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/documentation/glossary/</guid><description>
+&lt;p>A Beam runner runs a Beam pipeline on a specific platform. Most runners are
+translators or adapters to massively parallel big data processing systems, such
+as Apache Flink, Apache Spark, Google Cloud Dataflow, and more. For example, the
+Flink runner translates a Beam pipeline into a Flink job. The Direct Runner runs
+pipelines locally so you can test, debug, and validate that your pipeline
+adheres to the Apache Beam model as closely as possible.&lt;/p>
+&lt;p>For an up-to-date list of Beam runners and which features of the Apache Beam
+model they support, see the runner
+&lt;a href="/documentation/runners/capability-matrix/">capability matrix&lt;/a>.&lt;/p>
+&lt;p>For more information about runners, see the following pages:&lt;/p>
+&lt;ul>
+&lt;li>&lt;a href="/documentation/#choosing-a-runner">Choosing a Runner&lt;/a>&lt;/li>
+&lt;li>&lt;a href="/documentation/runners/capability-matrix/">Beam Capability Matrix&lt;/a>&lt;/li>
+&lt;/ul></description></item><item><title>Documentation: Beam glossary</title><link>/documentation/glossary/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/documentation/glossary/</guid><description>
 &lt;!--
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
@@ -4304,6 +4397,11 @@ Depending on the pipeline runner and back-end that you choose, many different
 workers across a cluster may execute instances of your user code in parallel.
 The user code running on each worker generates the output elements that are
 ultimately added to the final output &lt;code>PCollection&lt;/code> that the transform produces.&lt;/p>
+&lt;blockquote>
+&lt;p>Aggregation is an important concept to understand when learning about Beam&amp;rsquo;s
+transforms. For an introduction to aggregation, see the Basics of the Beam
+model &lt;a href="/documentation/basics/#aggregation">Aggregation section&lt;/a>.&lt;/p>
+&lt;/blockquote>
 &lt;p>The Beam SDKs contain a number of different transforms that you can apply to
 your pipeline&amp;rsquo;s &lt;code>PCollection&lt;/code>s. These include general-purpose core transforms,
 such as &lt;a href="#pardo">ParDo&lt;/a> or &lt;a href="#combine">Combine&lt;/a>. There are also pre-written
@@ -5285,6 +5383,19 @@ and max.&lt;/p>
 function. More complex combination operations might require you to create a
 &lt;span class="language-java language-py">subclass of&lt;/span> &lt;code>CombineFn&lt;/code>
 that has an accumulation type distinct from the input/output type.&lt;/p>
+&lt;p>The associativity and commutativity of a &lt;code>CombineFn&lt;/code> allows runners to
+automatically apply some optimizations:&lt;/p>
+&lt;ul>
+&lt;li>&lt;strong>Combiner lifting&lt;/strong>: This is the most significant optimization. Input
+elements are combined per key and window before they are shuffled, so the
+volume of data shuffled might be reduced by many orders of magnitude. Another
+term for this optimization is &amp;ldquo;mapper-side combine.&amp;rdquo;&lt;/li>
+&lt;li>&lt;strong>Incremental combining&lt;/strong>: When you have a &lt;code>CombineFn&lt;/code> that reduces the data
+size by a lot, it is useful to combine elements as they emerge from a
+streaming shuffle. This spreads out the cost of doing combines over the time
+that your streaming computation might be idle. Incremental combining also
+reduces the storage of intermediate accumulators.&lt;/li>
+&lt;/ul>
 &lt;h5 id="simple-combines">4.2.4.1. Simple combinations using simple functions&lt;/h5>
 &lt;p>The following example code shows a simple combine function.&lt;/p>
 &lt;div class='language-java snippet'>
diff --git a/website/generated-content/documentation/programming-guide/index.html b/website/generated-content/documentation/programming-guide/index.html
index 022fc90..1756efa 100644
--- a/website/generated-content/documentation/programming-guide/index.html
+++ b/website/generated-content/documentation/programming-guide/index.html
@@ -317,7 +317,9 @@ to each element of an input <code>PCollection</code> (or more than one <code>PCo
 Depending on the pipeline runner and back-end that you choose, many different
 workers across a cluster may execute instances of your user code in parallel.
 The user code running on each worker generates the output elements that are
-ultimately added to the final output <code>PCollection</code> that the transform produces.</p><p>The Beam SDKs contain a number of different transforms that you can apply to
+ultimately added to the final output <code>PCollection</code> that the transform produces.</p><blockquote><p>Aggregation is an important concept to understand when learning about Beam&rsquo;s
+transforms. For an introduction to aggregation, see the Basics of the Beam
+model <a href=/documentation/basics/#aggregation>Aggregation section</a>.</p></blockquote><p>The Beam SDKs contain a number of different transforms that you can apply to
 your pipeline&rsquo;s <code>PCollection</code>s. These include general-purpose core transforms,
 such as <a href=#pardo>ParDo</a> or <a href=#combine>Combine</a>. There are also pre-written
 <a href=#composite-transforms>composite transforms</a> included in the SDKs, which
@@ -914,7 +916,15 @@ combine functions for common numeric combination operations such as sum, min,
 and max.</p><p>Simple combine operations, such as sums, can usually be implemented as a simple
 function. More complex combination operations might require you to create a
 <span class="language-java language-py">subclass of</span> <code>CombineFn</code>
-that has an accumulation type distinct from the input/output type.</p><h5 id=simple-combines>4.2.4.1. Simple combinations using simple functions</h5><p>The following example code shows a simple combine function.</p><div class="language-java snippet"><div class="notebook-skip code-snippet"><a class=copy type=button data-bs-toggle=tooltip data-bs-placement=bottom title="Copy to clipboard"><img src=/images/copy-icon.svg></a><div class=highlight><pre class=chroma><code class=language-java da [...]
+that has an accumulation type distinct from the input/output type.</p><p>The associativity and commutativity of a <code>CombineFn</code> allows runners to
+automatically apply some optimizations:</p><ul><li><strong>Combiner lifting</strong>: This is the most significant optimization. Input
+elements are combined per key and window before they are shuffled, so the
+volume of data shuffled might be reduced by many orders of magnitude. Another
+term for this optimization is &ldquo;mapper-side combine.&rdquo;</li><li><strong>Incremental combining</strong>: When you have a <code>CombineFn</code> that reduces the data
+size by a lot, it is useful to combine elements as they emerge from a
+streaming shuffle. This spreads out the cost of doing combines over the time
+that your streaming computation might be idle. Incremental combining also
+reduces the storage of intermediate accumulators.</li></ul><h5 id=simple-combines>4.2.4.1. Simple combinations using simple functions</h5><p>The following example code shows a simple combine function.</p><div class="language-java snippet"><div class="notebook-skip code-snippet"><a class=copy type=button data-bs-toggle=tooltip data-bs-placement=bottom title="Copy to clipboard"><img src=/images/copy-icon.svg></a><div class=highlight><pre class=chroma><code class=language-java data-lang=jav [...]
 </span><span class=c1></span><span class=kd>public</span> <span class=kd>static</span> <span class=kd>class</span> <span class=nc>SumInts</span> <span class=kd>implements</span> <span class=n>SerializableFunction</span><span class=o>&lt;</span><span class=n>Iterable</span><span class=o>&lt;</span><span class=n>Integer</span><span class=o>&gt;,</span> <span class=n>Integer</span><span class=o>&gt;</span> <span class=o>{</span>
   <span class=nd>@Override</span>
   <span class=kd>public</span> <span class=n>Integer</span> <span class=nf>apply</span><span class=o>(</span><span class=n>Iterable</span><span class=o>&lt;</span><span class=n>Integer</span><span class=o>&gt;</span> <span class=n>input</span><span class=o>)</span> <span class=o>{</span>
@@ -4245,7 +4255,7 @@ expansionAddr := &#34;localhost:8097&#34;
 outT := beam.UnnamedOutput(typex.New(reflectx.String))
 res := beam.CrossLanguage(s, urn, payload, expansionAddr, beam.UnnamedInput(inputPCol), outT)
    </code></pre></div></div></li><li><p>After the job has been submitted to the Beam runner, shutdown the expansion service by
-terminating the expansion service process.</p></li></ol><h3 id=x-lang-transform-runner-support>13.3. Runner Support</h3><p>Currently, portable runners such as Flink, Spark, and the Direct runner can be used with multi-language pipelines.</p><p>Google Cloud Dataflow supports multi-language pipelines through the Dataflow Runner v2 backend architecture.</p><div class=feedback><p class=update>Last updated on 2021/10/12</p><h3>Have you found everything you were looking for?</h3><p class=descr [...]
+terminating the expansion service process.</p></li></ol><h3 id=x-lang-transform-runner-support>13.3. Runner Support</h3><p>Currently, portable runners such as Flink, Spark, and the Direct runner can be used with multi-language pipelines.</p><p>Google Cloud Dataflow supports multi-language pipelines through the Dataflow Runner v2 backend architecture.</p><div class=feedback><p class=update>Last updated on 2021/10/25</p><h3>Have you found everything you were looking for?</h3><p class=descr [...]
 <a href=http://www.apache.org>The Apache Software Foundation</a>
 | <a href=/privacy_policy>Privacy Policy</a>
 | <a href=/feed.xml>RSS Feed</a><br><br>Apache Beam, Apache, Beam, the Beam logo, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation. All other products or name brands are trademarks of their respective holders, including The Apache Software Foundation.</div></div></div></div></footer></body></html>
\ No newline at end of file
diff --git a/website/generated-content/images/aggregation.png b/website/generated-content/images/aggregation.png
new file mode 100755
index 0000000..c26cc9f
Binary files /dev/null and b/website/generated-content/images/aggregation.png differ
diff --git a/website/generated-content/sitemap.xml b/website/generated-content/sitemap.xml
index 5fed307..32613c6 100644
--- a/website/generated-content/sitemap.xml
+++ b/website/generated-content/sitemap.xml
@@ -1 +1 @@
-<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/blog/beam-2.33.0/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/categories/blog/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/blog/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/categories/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/blog/b [...]
\ No newline at end of file
+<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/blog/beam-2.33.0/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/categories/blog/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/blog/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/categories/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/blog/b [...]
\ No newline at end of file