You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by gi...@apache.org on 2019/09/19 23:25:11 UTC
[beam] branch asf-site updated: Publishing website 2019/09/19 23:24:57 at commit 6a81cdc

This is an automated email from the ASF dual-hosted git repository.

git-site-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/beam.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new e783406  Publishing website 2019/09/19 23:24:57 at commit 6a81cdc
e783406 is described below

commit e783406ddab6743cd7c060251032248e325b15b0
Author: jenkins <bu...@apache.org>
AuthorDate: Thu Sep 19 23:24:57 2019 +0000

    Publishing website 2019/09/19 23:24:57 at commit 6a81cdc
---
 .../documentation/runners/flink/index.html         | 24 ++++++----
 .../documentation/runners/spark/index.html         | 25 ++++++----
 .../roadmap/portability/index.html                 | 53 ++++++++++++++--------
 3 files changed, 65 insertions(+), 37 deletions(-)

diff --git a/website/generated-content/documentation/runners/flink/index.html b/website/generated-content/documentation/runners/flink/index.html
index 4363dd1..886409c 100644
--- a/website/generated-content/documentation/runners/flink/index.html
+++ b/website/generated-content/documentation/runners/flink/index.html
@@ -461,13 +461,10 @@ If you have a Flink <code class="highlighter-rouge">JobManager</code> running on
 <p><span class="language-py">
 As of now you will need a copy of Apache Beam’s source code. You can
 download it on the <a href="/get-started/downloads/">Downloads page</a>. In the future there will be pre-built Docker images
-available.
+available. To run a pipeline on an embedded Flink cluster:
 </span></p>
 
-<p><span class="language-py">1. <em>Only required once:</em> Build the SDK harness container (optionally replace py35 with the Python version of your choice): <code class="highlighter-rouge">./gradlew :sdks:python:container:py35:docker</code>
-</span></p>
-
-<p><span class="language-py">2. Start the JobService endpoint: <code class="highlighter-rouge">./gradlew :runners:flink:1.5:job-server:runShadow</code>
+<p><span class="language-py">1. Start the JobService endpoint: <code class="highlighter-rouge">./gradlew :runners:flink:1.5:job-server:runShadow</code>
 </span></p>
 
 <p><span class="language-py">
@@ -477,13 +474,17 @@ To execute the job on a Flink cluster, the Beam JobService needs to be
 provided with the Flink JobManager address.
 </span></p>
 
-<p><span class="language-py">3. Submit the Python pipeline to the above endpoint by using the <code class="highlighter-rouge">PortableRunner</code> and <code class="highlighter-rouge">job_endpoint</code> set to <code class="highlighter-rouge">localhost:8099</code> (this is the default address of the JobService). For example:
+<p><span class="language-py">2. Submit the Python pipeline to the above endpoint by using the <code class="highlighter-rouge">PortableRunner</code>, <code class="highlighter-rouge">job_endpoint</code> set to <code class="highlighter-rouge">localhost:8099</code> (this is the default address of the JobService), and <code class="highlighter-rouge">environment_type</code> set to <code class="highlighter-rouge">LOOPBACK</code>. For example:
 </span></p>
 
 <div class="language-py highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">apache_beam</span> <span class="kn">as</span> <span class="nn">beam</span>
 <span class="kn">from</span> <span class="nn">apache_beam.options.pipeline_options</span> <span class="kn">import</span> <span class="n">PipelineOptions</span>
 
-<span class="n">options</span> <span class="o">=</span> <span class="n">PipelineOptions</span><span class="p">([</span><span class="s">"--runner=PortableRunner"</span><span class="p">,</span> <span class="s">"--job_endpoint=localhost:8099"</span><span class="p">])</span>
+<span class="n">options</span> <span class="o">=</span> <span class="n">PipelineOptions</span><span class="p">([</span>
+    <span class="s">"--runner=PortableRunner"</span><span class="p">,</span>
+    <span class="s">"--job_endpoint=localhost:8099"</span><span class="p">,</span>
+    <span class="s">"--environment_type=LOOPBACK"</span>
+<span class="p">])</span>
 <span class="k">with</span> <span class="n">beam</span><span class="o">.</span><span class="n">Pipeline</span><span class="p">(</span><span class="n">options</span><span class="p">)</span> <span class="k">as</span> <span class="n">p</span><span class="p">:</span>
     <span class="o">...</span>
 </code></pre>
@@ -500,6 +501,8 @@ To run on a separate <a href="https://ci.apache.org/projects/flink/flink-docs-re
 </span></p>
 
 <p><span class="language-py">3. Submit the pipeline as above.
+Note however that <code class="highlighter-rouge">environment_type=LOOPBACK</code> is only intended for local testing.
+See <a href="/roadmap/portability/#sdk-harness-config">here</a> for details.
 </span></p>
 
 <p><span class="language-py">As of Beam 2.15.0, steps 2 and 3 can be automated in Python by using the <code class="highlighter-rouge">FlinkRunner</code>,
@@ -509,7 +512,12 @@ plus the optional <code class="highlighter-rouge">flink_version</code> and <code
 <div class="language-py highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">apache_beam</span> <span class="kn">as</span> <span class="nn">beam</span>
 <span class="kn">from</span> <span class="nn">apache_beam.options.pipeline_options</span> <span class="kn">import</span> <span class="n">PipelineOptions</span>
 
-<span class="n">options</span> <span class="o">=</span> <span class="n">PipelineOptions</span><span class="p">([</span><span class="s">"--runner=FlinkRunner"</span><span class="p">,</span> <span class="s">"--flink_version=1.8"</span><span class="p">,</span> <span class="s">"--flink_master_url=localhost:8081"</span><span class="p">])</span>
+<span class="n">options</span> <span class="o">=</span> <span class="n">PipelineOptions</span><span class="p">([</span>
+    <span class="s">"--runner=FlinkRunner"</span><span class="p">,</span>
+    <span class="s">"--flink_version=1.8"</span><span class="p">,</span>
+    <span class="s">"--flink_master_url=localhost:8081"</span><span class="p">,</span>
+    <span class="s">"--environment_type=LOOPBACK"</span>
+<span class="p">])</span>
 <span class="k">with</span> <span class="n">beam</span><span class="o">.</span><span class="n">Pipeline</span><span class="p">(</span><span class="n">options</span><span class="p">)</span> <span class="k">as</span> <span class="n">p</span><span class="p">:</span>
     <span class="o">...</span>
 </code></pre>
diff --git a/website/generated-content/documentation/runners/spark/index.html b/website/generated-content/documentation/runners/spark/index.html
index 9171417..f90ea91 100644
--- a/website/generated-content/documentation/runners/spark/index.html
+++ b/website/generated-content/documentation/runners/spark/index.html
@@ -382,10 +382,7 @@ download it on the <a href="/get-started/downloads/">Downloads page</a>. In the
 available.
 </span></p>
 
-<p><span class="language-py">1. <em>Only required once:</em> Build the SDK harness container (optionally replace py35 with the Python version of your choice): <code class="highlighter-rouge">./gradlew :sdks:python:container:py35:docker</code>
-</span></p>
-
-<p><span class="language-py">2. Start the JobService endpoint: <code class="highlighter-rouge">./gradlew :runners:spark:job-server:runShadow</code>
+<p><span class="language-py">1. Start the JobService endpoint: <code class="highlighter-rouge">./gradlew :runners:spark:job-server:runShadow</code>
 </span></p>
 
 <p><span class="language-py">
@@ -395,16 +392,19 @@ job. To execute the job on a Spark cluster, the Beam JobService needs to be
 provided with the Spark master address.
 </span></p>
 
-<p><span class="language-py">3. Submit the Python pipeline to the above endpoint by using the <code class="highlighter-rouge">PortableRunner</code> and <code class="highlighter-rouge">job_endpoint</code> set to <code class="highlighter-rouge">localhost:8099</code> (this is the default address of the JobService). For example:
+<p><span class="language-py">2. Submit the Python pipeline to the above endpoint by using the <code class="highlighter-rouge">PortableRunner</code>, <code class="highlighter-rouge">job_endpoint</code> set to <code class="highlighter-rouge">localhost:8099</code> (this is the default address of the JobService), and <code class="highlighter-rouge">environment_type</code> set to <code class="highlighter-rouge">LOOPBACK</code>. For example:
 </span></p>
 
 <div class="language-py highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">apache_beam</span> <span class="kn">as</span> <span class="nn">beam</span>
 <span class="kn">from</span> <span class="nn">apache_beam.options.pipeline_options</span> <span class="kn">import</span> <span class="n">PipelineOptions</span>
 
-<span class="n">options</span> <span class="o">=</span> <span class="n">PipelineOptions</span><span class="p">([</span><span class="s">"--runner=PortableRunner"</span><span class="p">,</span> <span class="s">"--job_endpoint=localhost:8099"</span><span class="p">])</span>
-<span class="n">p</span> <span class="o">=</span> <span class="n">beam</span><span class="o">.</span><span class="n">Pipeline</span><span class="p">(</span><span class="n">options</span><span class="p">)</span>
-<span class="o">..</span>
-<span class="n">p</span><span class="o">.</span><span class="n">run</span><span class="p">()</span>
+<span class="n">options</span> <span class="o">=</span> <span class="n">PipelineOptions</span><span class="p">([</span>
+    <span class="s">"--runner=PortableRunner"</span><span class="p">,</span>
+    <span class="s">"--job_endpoint=localhost:8099"</span><span class="p">,</span>
+    <span class="s">"--environment_type=LOOPBACK"</span>
+<span class="p">])</span>
+<span class="k">with</span> <span class="n">beam</span><span class="o">.</span><span class="n">Pipeline</span><span class="p">(</span><span class="n">options</span><span class="p">)</span> <span class="k">as</span> <span class="n">p</span><span class="p">:</span>
+    <span class="o">...</span>
 </code></pre>
 </div>
 
@@ -420,6 +420,13 @@ For more details on the different deployment modes see: <a href="http://spark.ap
 </span></p>
 
 <p><span class="language-py">3. Submit the pipeline as above.
+Note however that <code class="highlighter-rouge">environment_type=LOOPBACK</code> is only intended for local testing.
+See <a href="/roadmap/portability/#sdk-harness-config">here</a> for details.
+</span></p>
+
+<p><span class="language-py">
+(Note that, depending on your cluster setup, you may need to change the <code class="highlighter-rouge">environment_type</code> option.
+See <a href="/roadmap/portability/#sdk-harness-config">here</a> for details.)
 </span></p>
 
 <h2 id="pipeline-options-for-the-spark-runner">Pipeline options for the Spark Runner</h2>
diff --git a/website/generated-content/roadmap/portability/index.html b/website/generated-content/roadmap/portability/index.html
index b8a4b53..5ec2aed 100644
--- a/website/generated-content/roadmap/portability/index.html
+++ b/website/generated-content/roadmap/portability/index.html
@@ -233,6 +233,7 @@
       <li><a href="#python-on-spark">Running Python wordcount on Spark</a></li>
     </ul>
   </li>
+  <li><a href="#sdk-harness-config">SDK Harness Configuration</a></li>
 </ul>
 
 
@@ -397,34 +398,46 @@ for details.</p>
 
 <h3 id="python-on-flink">Running Python wordcount on Flink</h3>
 
-<p>To run a basic Python wordcount (in batch mode) with embedded Flink:</p>
-
-<ol>
-  <li>Run once to build the SDK harness container (optionally replace py35 with the Python version of your choice): <code class="highlighter-rouge">./gradlew :sdks:python:container:py35:docker</code></li>
-  <li>Start the Flink portable JobService endpoint: <code class="highlighter-rouge">./gradlew :runners:flink:1.5:job-server:runShadow</code></li>
-  <li>In a new terminal, submit the wordcount pipeline to above endpoint: <code class="highlighter-rouge">./gradlew portableWordCount -PjobEndpoint=localhost:8099 -PenvironmentType=LOOPBACK</code></li>
-</ol>
-
-<p>To run the pipeline in streaming mode: <code class="highlighter-rouge">./gradlew portableWordCount -PjobEndpoint=localhost:8099 -Pstreaming</code></p>
-
-<p>Please see the <a href="/documentation/runners/flink/">Flink Runner page</a> for more information on
+<p>The Beam Flink runner can run Python pipelines in batch and streaming modes.
+Please see the <a href="/documentation/runners/flink/">Flink Runner page</a> for more information on
 how to run portable pipelines on top of Flink.</p>
 
 <h3 id="python-on-spark">Running Python wordcount on Spark</h3>
 
-<p>To run a basic Python wordcount (in batch mode) with embedded Spark:</p>
-
-<ol>
-  <li>Run once to build the SDK harness container: <code class="highlighter-rouge">./gradlew :sdks:python:container:docker</code></li>
-  <li>Start the Spark portable JobService endpoint: <code class="highlighter-rouge">./gradlew :runners:spark:job-server:runShadow</code></li>
-  <li>In a new terminal, submit the wordcount pipeline to above endpoint: <code class="highlighter-rouge">./gradlew portableWordCount -PjobEndpoint=localhost:8099 -PenvironmentType=LOOPBACK</code></li>
-</ol>
+<p>The Beam Spark runner can run Python pipelines in batch mode.
+Please see the <a href="/documentation/runners/spark/">Spark Runner page</a> for more information on
+how to run portable pipelines on top of Spark.</p>
 
 <p>Python streaming mode is not yet supported on Spark.</p>
 
-<p>Please see the <a href="/documentation/runners/spark/">Spark Runner page</a> for more information on
-how to run portable pipelines on top of Spark.</p>
+<h2 id="sdk-harness-config">SDK Harness Configuration</h2>
 
+<p>The Beam Python SDK allows configuration of the SDK harness to accommodate varying cluster setups.</p>
+
+<ul>
+  <li><code class="highlighter-rouge">environment_type</code> determines where user code will be executed.
+    <ul>
+      <li><code class="highlighter-rouge">LOOPBACK</code>: User code is executed within the same process that submitted the pipeline. This
+option is useful for local testing. However, it is not suitable for a production environment,
+as it requires a connection between the original Python process and the worker nodes, and
+performs work on the machine the job originated from, not the worker nodes.</li>
+      <li><code class="highlighter-rouge">PROCESS</code>: User code is executed by processes that are automatically started by the runner on
+each worker node.</li>
+      <li><code class="highlighter-rouge">DOCKER</code> (default): User code is executed within a container started on each worker node.
+This requires docker to be installed on worker nodes. For more information, see
+<a href="https://github.com/apache/beam/blob/master/sdks/CONTAINERS.md">here</a>.</li>
+    </ul>
+  </li>
+  <li><code class="highlighter-rouge">environment_config</code> configures the environment depending on the value of <code class="highlighter-rouge">environment_type</code>.
+    <ul>
+      <li>When <code class="highlighter-rouge">environment_type=DOCKER</code>: URL for the Docker container image.</li>
+      <li>When <code class="highlighter-rouge">environment_type=PROCESS</code>: JSON of the form <code class="highlighter-rouge"><span class="p">{</span><span class="nt">"os"</span><span class="p">:</span><span class="w"> </span><span class="s2">"&lt;OS&gt;"</span><span class="p">,</span><span class="w"> </span><span class="nt">"arch"</span><span class="p">:</span><span class="w"> </span><span class="s2">"&lt;ARCHITECTURE&gt;"</span><span class="p">,</span><span class="w">
+</span><span class="nt">"command"</span><span class="p">:</span><span class="w"> </span><span class="s2">"&lt;process to execute&gt;"</span><span class="p">,</span><span class="w"> </span><span class="nt">"env"</span><span class="p">:{</span><span class="nt">"&lt;Environment variables 1&gt;"</span><span class="p">:</span><span class="w"> </span><span class="s2">"&lt;ENV_VAL&gt;"</span><span class="p">}</span><span class="w"> </span><span class="p">}</span></code>. All
+fields in the JSON are optional except <code class="highlighter-rouge">command</code>.</li>
+    </ul>
+  </li>
+  <li><code class="highlighter-rouge">sdk_worker_parallelism</code> sets the number of SDK workers that will run on each worker node.</li>
+</ul>
 
       </div>
     </div>