You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by mb...@apache.org on 2021/12/16 19:09:54 UTC

[beam] branch master updated: [BEAM-11545] State & timer for batched RPC calls pattern (#13643)

This is an automated email from the ASF dual-hosted git repository.

mbae pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git


The following commit(s) were added to refs/heads/master by this push:
     new 251dd0c  [BEAM-11545] State & timer for batched RPC calls pattern (#13643)
251dd0c is described below

commit 251dd0c0f0eccf83ed7346e385981180d562e1b1
Author: Matthias Baetens <ba...@gmail.com>
AuthorDate: Thu Dec 16 20:08:53 2021 +0100

    [BEAM-11545] State & timer for batched RPC calls pattern (#13643)
    
    * [BEAM-11545] State & timer for batched RPC calls pattern
---
 ...lements-for-efficient-external-service-calls.md | 56 ++++++++++++++++++++++
 .../content/en/documentation/patterns/overview.md  |  3 ++
 .../partials/section-menu/en/documentation.html    |  1 +
 3 files changed, 60 insertions(+)

diff --git a/website/www/site/content/en/documentation/patterns/grouping-elements-for-efficient-external-service-calls.md b/website/www/site/content/en/documentation/patterns/grouping-elements-for-efficient-external-service-calls.md
new file mode 100644
index 0000000..9409e83
--- /dev/null
+++ b/website/www/site/content/en/documentation/patterns/grouping-elements-for-efficient-external-service-calls.md
@@ -0,0 +1,56 @@
+---
+title: "Pattern for grouping elements for efficient external service calls"
+---
+
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Grouping elements for efficient external service calls using the `GroupIntoBatches`-transform
+
+{{< language-switcher java py >}}
+
+Usually, authoring an Apache Beam pipeline can be done with out-of-the-box tools and transforms like _ParDo_'s, _Window_'s and _GroupByKey_'s. However, when you want more tight control, you can keep state in an otherwise stateless _DoFn_.
+
+State is kept on a per-key and per-windows basis, and as such, the input to your stateful DoFn needs to be keyed (e.g. by the customer identifier if you're tracking clicks from an e-commerce website).
+
+Examples of use cases are: assigning a unique ID to each element, joining streams of data in 'more exotic' ways, or batching up API calls to external services. In this section we'll go over the last one in particular.
+
+Make sure to check the [docs](https://beam.apache.org/documentation/programming-guide/#state-and-timers) for deeper understanding on state and timers.
+
+The `GroupIntoBatches`-transform uses state and timers under the hood to allow the user to exercise tight control over the following parameters:
+
+- `maxBufferDuration`: limits the amount of waitingtime for a batch to be emitted.
+- `batchSize`: limits the number of elements in one batch.
+- `batchSizeBytes`: (in Java only) limits the bytesize of one batch (using input coder to determine elementsize).
+- `elementByteSize`: (in Java only) limits the bytesize of one batch (using a user defined function to determine elementsize).
+
+while abstracting away the implementation details from users.
+
+The `withShardedKey()` functionality increases parallellism by spreading one key over multiple threads.
+
+The transforms are used in the following way in Java & Python:
+
+{{< highlight java >}}
+input.apply(
+          "Batch Contents",
+          GroupIntoBatches.<String, GenericJson>ofSize(batchSize)
+              .withMaxBufferingDuration(maxBufferingDuration)
+              .withShardedKey())
+{{< /highlight >}}
+
+{{< highlight py >}}
+input | GroupIntoBatches.WithShardedKey(batchSize, maxBufferingDuration)
+{{< /highlight >}}
+
+Applying these transforms will output groups of elements in a batch on a per-key basis, which you can then use to call an external API in bulk rather than on a per-element basis, resulting in a lower overhead in your pipeline.
diff --git a/website/www/site/content/en/documentation/patterns/overview.md b/website/www/site/content/en/documentation/patterns/overview.md
index b13c5d4..c5e6084 100644
--- a/website/www/site/content/en/documentation/patterns/overview.md
+++ b/website/www/site/content/en/documentation/patterns/overview.md
@@ -51,6 +51,9 @@ Pipeline patterns demonstrate common Beam use cases. Pipeline patterns are based
 **Cross-language patterns** - Patterns for creating cross-language pipelines
 * [Cross-language patterns](/documentation/patterns/cross-language/#cross-language-transforms)
 
+**State & timers patterns** - Patterns for using state & timers
+* [Grouping elements for efficient external service calls](/documentation/patterns/grouping-elements-for-efficient-external-service-calls/#grouping-elements-for-efficient-external-service-calls-using-the-`GroupIntoBatches`-transform)
+
 ## Contributing a pattern
 
 To contribute a new pipeline pattern, create an issue with the [`pipeline-patterns` label](https://issues.apache.org/jira/browse/BEAM-7449?jql=labels%20%3D%20pipeline-patterns) and add details to the issue description. See [Get started contributing](/contribute/) for more information.
diff --git a/website/www/site/layouts/partials/section-menu/en/documentation.html b/website/www/site/layouts/partials/section-menu/en/documentation.html
index 6d1e664..3f09b0e 100644
--- a/website/www/site/layouts/partials/section-menu/en/documentation.html
+++ b/website/www/site/layouts/partials/section-menu/en/documentation.html
@@ -206,6 +206,7 @@
     <li><a href="/documentation/patterns/schema/">Schema</a></li>
     <li><a href="/documentation/patterns/bqml/">BigQuery ML</a></li>
     <li><a href="/documentation/patterns/cross-language/">Cross-language transforms</a></li>
+    <li><a href="/documentation/patterns/grouping-elements-for-efficient-external-service-calls/">Grouping elements for efficient external service calls</a></li>
   </ul>
 </li>
 <li class="section-nav-item--collapsible">