You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@beam.apache.org by me...@apache.org on 2018/07/18 21:52:37 UTC

[beam-site] branch asf-site updated (d153419 -> 10148e1)

This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a change to branch asf-site
in repository https://gitbox.apache.org/repos/asf/beam-site.git.


    from d153419  Prepare repository for deployment.
     add f570391  [BEAM-2977] Improve unbounded prose in wordcount example
     add a49ee1c  This closes #377
     new 10148e1  Prepare repository for deployment.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 content/get-started/wordcount-example/index.html | 40 ++++++++++++++----------
 src/get-started/wordcount-example.md             | 32 ++++++++++++-------
 2 files changed, 44 insertions(+), 28 deletions(-)

[beam-site] 01/01: Prepare repository for deployment.

Posted by me...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/beam-site.git

commit 10148e1d402e4c8c31e20f89f9ae1ed72b782387
Author: Mergebot <me...@apache.org>
AuthorDate: Wed Jul 18 21:52:35 2018 +0000

    Prepare repository for deployment.
---
 content/get-started/wordcount-example/index.html | 40 ++++++++++++++----------
 1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/content/get-started/wordcount-example/index.html b/content/get-started/wordcount-example/index.html
index 7844c32..57b5597 100644
--- a/content/get-started/wordcount-example/index.html
+++ b/content/get-started/wordcount-example/index.html
@@ -199,7 +199,7 @@
   </li>
   <li><a href="#windowedwordcount-example">WindowedWordCount example</a>
     <ul>
-      <li><a href="#unbounded-and-bounded-pipeline-input-modes">Unbounded and bounded pipeline input modes</a></li>
+      <li><a href="#unbounded-and-bounded-datasets">Unbounded and bounded datasets</a></li>
       <li><a href="#adding-timestamps-to-data">Adding timestamps to data</a></li>
       <li><a href="#windowing">Windowing</a></li>
       <li><a href="#reusing-ptransforms-over-windowed-pcollections">Reusing PTransforms over windowed PCollections</a></li>
@@ -207,7 +207,7 @@
   </li>
   <li><a href="#streamingwordcount-example">StreamingWordCount example</a>
     <ul>
-      <li><a href="#reading-an-unbounded-data-set">Reading an unbounded data set</a></li>
+      <li><a href="#reading-an-unbounded-dataset">Reading an unbounded dataset</a></li>
       <li><a href="#writing-unbounded-results">Writing unbounded results</a></li>
     </ul>
   </li>
@@ -259,14 +259,14 @@ limitations under the License.
     </ul>
   </li>
   <li><a href="#windowedwordcount-example" id="markdown-toc-windowedwordcount-example">WindowedWordCount example</a>    <ul>
-      <li><a href="#unbounded-and-bounded-pipeline-input-modes" id="markdown-toc-unbounded-and-bounded-pipeline-input-modes">Unbounded and bounded pipeline input modes</a></li>
+      <li><a href="#unbounded-and-bounded-datasets" id="markdown-toc-unbounded-and-bounded-datasets">Unbounded and bounded datasets</a></li>
       <li><a href="#adding-timestamps-to-data" id="markdown-toc-adding-timestamps-to-data">Adding timestamps to data</a></li>
       <li><a href="#windowing" id="markdown-toc-windowing">Windowing</a></li>
       <li><a href="#reusing-ptransforms-over-windowed-pcollections" id="markdown-toc-reusing-ptransforms-over-windowed-pcollections">Reusing PTransforms over windowed PCollections</a></li>
     </ul>
   </li>
   <li><a href="#streamingwordcount-example" id="markdown-toc-streamingwordcount-example">StreamingWordCount example</a>    <ul>
-      <li><a href="#reading-an-unbounded-data-set" id="markdown-toc-reading-an-unbounded-data-set">Reading an unbounded data set</a></li>
+      <li><a href="#reading-an-unbounded-dataset" id="markdown-toc-reading-an-unbounded-dataset">Reading an unbounded dataset</a></li>
       <li><a href="#writing-unbounded-results" id="markdown-toc-writing-unbounded-results">Writing unbounded results</a></li>
     </ul>
   </li>
@@ -414,7 +414,7 @@ nested transforms (which is a <a href="/documentation/programming-guide#composit
 <p>Each transform takes some kind of input data and produces some output data. The
 input and output data is often represented by the SDK class <code class="highlighter-rouge">PCollection</code>.
 <code class="highlighter-rouge">PCollection</code> is a special class, provided by the Beam SDK, that you can use to
-represent a data set of virtually any size, including unbounded data sets.</p>
+represent a dataset of virtually any size, including unbounded datasets.</p>
 
 <p><img src="/images/wordcount-pipeline.png" alt="The MinimalWordCount pipeline data flow." width="800px" /></p>
 
@@ -1173,12 +1173,11 @@ or DEBUG significantly increases the amount of logs output.</p>
 <p class="language-java language-py"><span class="language-java"><code class="highlighter-rouge">PAssert</code></span><span class="language-py"><code class="highlighter-rouge">assert_that</code></span>
 is a set of convenient PTransforms in the style of Hamcrest’s collection
 matchers that can be used when writing pipeline level tests to validate the
-contents of PCollections. Asserts are best used in unit tests with small data
-sets.</p>
+contents of PCollections. Asserts are best used in unit tests with small datasets.</p>
 
 <p class="language-go">The <code class="highlighter-rouge">passert</code> package contains convenient PTransforms that can be used when
 writing pipeline level tests to validate the contents of PCollections. Asserts
-are best used in unit tests with small data sets.</p>
+are best used in unit tests with small datasets.</p>
 
 <p class="language-java">The following example verifies that the set of filtered words matches our
 expected counts. The assert does not produce any output, and the pipeline only
@@ -1223,7 +1222,7 @@ examples did, but introduces several advanced concepts.</p>
 <p><strong>New Concepts:</strong></p>
 
 <ul>
-  <li>Unbounded and bounded pipeline input modes</li>
+  <li>Unbounded and bounded datasets</li>
   <li>Adding timestamps to data</li>
   <li>Windowing</li>
   <li>Reusing PTransforms over windowed PCollections</li>
@@ -1360,12 +1359,21 @@ $ windowed_wordcount --input gs://dataflow-samples/shakespeare/kinglear.txt \
 <p>To view the full code in Go, see
 <strong><a href="https://github.com/apache/beam/blob/master/sdks/go/examples/windowed_wordcount/windowed_wordcount.go">windowed_wordcount.go</a>.</strong></p>
 
-<h3 id="unbounded-and-bounded-pipeline-input-modes">Unbounded and bounded pipeline input modes</h3>
+<h3 id="unbounded-and-bounded-datasets">Unbounded and bounded datasets</h3>
 
 <p>Beam allows you to create a single pipeline that can handle both bounded and
-unbounded types of input. If your input has a fixed number of elements, it’s
-considered a ‘bounded’ data set. If your input is continuously updating, then
-it’s considered ‘unbounded’ and you must use a runner that supports streaming.</p>
+unbounded datasets. If your dataset has a fixed number of elements, it is a bounded
+dataset and all of the data can be processed together. For bounded datasets,
+the question to ask is “Do I have all of the data?” If data continuously
+arrives (such as an endless stream of game scores in the
+<a href="https://beam.apache.org/get-started/mobile-gaming-example/">Mobile gaming example</a>,
+it is an unbounded dataset. An unbounded dataset is never available for
+processing at any one time, so the data must be processed using a streaming
+pipeline that runs continuously. The dataset will only be complete up to a
+certain point, so the question to ask is “Up until what point do I have all of
+the data?” Beam uses <a href="/documentation/programming-guide/#windowing">windowing</a>
+to divide a continuously updating dataset into logical windows of finite size.
+If your input is unbounded, you must use a runner that supports streaming.</p>
 
 <p>If your pipeline’s input is bounded, then all downstream PCollections will also be
 bounded. Similarly, if the input is unbounded, then all downstream PCollections
@@ -1532,7 +1540,7 @@ frequency count of the words seen in each 15 second window.</p>
 <p><strong>New Concepts:</strong></p>
 
 <ul>
-  <li>Reading an unbounded data set</li>
+  <li>Reading an unbounded dataset</li>
   <li>Writing unbounded results</li>
 </ul>
 
@@ -1593,9 +1601,9 @@ python -m apache_beam.examples.streaming_wordcount \
 (<a href="https://issues.apache.org/jira/browse/BEAM-4292">BEAM-4292</a>).</p>
 </blockquote>
 
-<h3 id="reading-an-unbounded-data-set">Reading an unbounded data set</h3>
+<h3 id="reading-an-unbounded-dataset">Reading an unbounded dataset</h3>
 
-<p>This example uses an unbounded data set as input. The code reads Pub/Sub
+<p>This example uses an unbounded dataset as input. The code reads Pub/Sub
 messages from a Pub/Sub subscription or topic using
 <a href="/documentation/sdks/pydoc/2.5.0/apache_beam.io.gcp.pubsub.html#apache_beam.io.gcp.pubsub.ReadStringsFromPubSub"><code class="highlighter-rouge">beam.io.ReadStringsFromPubSub</code></a>.</p>