You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@beam.apache.org by dh...@apache.org on 2016/12/01 22:31:33 UTC

[1/3] incubator-beam-site git commit: Closes #97

Repository: incubator-beam-site
Updated Branches:
  refs/heads/asf-site 1b458f102 -> f439af099


Closes #97


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/commit/7e96f7b9
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/tree/7e96f7b9
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/diff/7e96f7b9

Branch: refs/heads/asf-site
Commit: 7e96f7b90d569e57e3c9711a51014b5e072d7188
Parents: 1b458f1 ac0c4e0
Author: Dan Halperin <dh...@google.com>
Authored: Thu Dec 1 14:30:22 2016 -0800
Committer: Dan Halperin <dh...@google.com>
Committed: Thu Dec 1 14:30:22 2016 -0800

----------------------------------------------------------------------
 src/documentation/runners/flink.md | 136 +++++++++++++++++++++++++++++++-
 1 file changed, 135 insertions(+), 1 deletion(-)
----------------------------------------------------------------------

[3/3] incubator-beam-site git commit: Regenerate website

Posted by dh...@apache.org.

Regenerate website


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/commit/f439af09
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/tree/f439af09
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/diff/f439af09

Branch: refs/heads/asf-site
Commit: f439af099412e73da73a288cd212ff8e93221e35
Parents: 7e96f7b
Author: Dan Halperin <dh...@google.com>
Authored: Thu Dec 1 14:31:10 2016 -0800
Committer: Dan Halperin <dh...@google.com>
Committed: Thu Dec 1 14:31:10 2016 -0800

----------------------------------------------------------------------
 content/documentation/runners/flink/index.html | 138 +++++++++++++++++++-
 1 file changed, 137 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/f439af09/content/documentation/runners/flink/index.html
----------------------------------------------------------------------
diff --git a/content/documentation/runners/flink/index.html b/content/documentation/runners/flink/index.html
index 6ccaff7..edd5bcd 100644
--- a/content/documentation/runners/flink/index.html
+++ b/content/documentation/runners/flink/index.html
@@ -146,7 +146,143 @@
       <div class="row">
         <h1 id="using-the-apache-flink-runner">Using the Apache Flink Runner</h1>
 
-<p>This page is under construction (<a href="https://issues.apache.org/jira/browse/BEAM-506">BEAM-506</a>).</p>
+<p>The Apache Flink Runner can be used to execute Beam pipelines using <a href="https://flink.apache.org">Apache Flink</a>. When using the Flink Runner you will create a jar file containing your job that can be executed on a regular Flink cluster. It\u2019s also possible to execute a Beam pipeline using Flink\u2019s local execution mode without setting up a cluster. This is helpful for development and debugging of your pipeline.</p>
+
+<p>The Flink Runner and Flink are suitable for large scale, continuous jobs, and provide:</p>
+
+<ul>
+  <li>A streaming-first runtime that supports both batch processing and data streaming programs</li>
+  <li>A runtime that supports very high throughput and low event latency at the same time</li>
+  <li>Fault-tolerance with <em>exactly-once</em> processing guarantees</li>
+  <li>Natural back-pressure in streaming programs</li>
+  <li>Custom memory management for efficient and robust switching between in-memory and out-of-core data processing algorithms</li>
+  <li>Integration with YARN and other components of the Apache Hadoop ecosystem</li>
+</ul>
+
+<p>The <a href="/documentation/runners/capability-matrix/">Beam Capability Matrix</a> documents the supported capabilities of the Flink Runner.</p>
+
+<h2 id="flink-runner-prerequisites-and-setup">Flink Runner prerequisites and setup</h2>
+
+<p>If you want to use the local execution mode with the Flink runner to don\u2019t have to complete any setup.</p>
+
+<p>To use the Flink Runner for executing on a cluster, you have to setup a Flink cluster by following the Flink <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.1/quickstart/setup_quickstart.html">setup quickstart</a>.</p>
+
+<p>To find out which version of Flink you need you can run this command to check the version of the Flink dependency that your project is using:</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>$ mvn dependency:tree -Pflink-runner |grep flink
+...
+[INFO] |  +- org.apache.flink:flink-streaming-java_2.10:jar:1.1.2:runtime
+...
+</code></pre>
+</div>
+<p>Here, we would need Flink 1.1.2.</p>
+
+<p>For more information, the <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.1/">Flink Documentation</a> can be helpful.</p>
+
+<h3 id="specify-your-dependency">Specify your dependency</h3>
+
+<p>You must specify your dependency on the Flink Runner.</p>
+
+<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="o">&lt;</span><span class="n">dependency</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">groupId</span><span class="o">&gt;</span><span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">beam</span><span class="o">&lt;/</span><span class="n">groupId</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">artifactId</span><span class="o">&gt;</span><span class="n">beam</span><span class="o">-</span><span class="n">runners</span><span class="o">-</span><span class="n">flink_2</span><span class="o">.</span><span class="mi">10</span><span class="o">&lt;/</span><span class="n">artifactId</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">version</span><span class="o">&gt;</span><span class="mf">0.3</span><span class="o">.</span><span class="mi">0</span><span class="o">-</span><span class="n">incubating</span><span class="o">&lt;/</span><span class="n">version</span><span class="o">&gt;</span>
+  <span class="o">&lt;</span><span class="n">scope</span><span class="o">&gt;</span><span class="n">runtime</span><span class="o">&lt;/</span><span class="n">scope</span><span class="o">&gt;</span>
+<span class="o">&lt;/</span><span class="n">dependency</span><span class="o">&gt;</span>
+</code></pre>
+</div>
+
+<h2 id="executing-a-pipeline-on-a-flink-cluster">Executing a pipeline on a Flink cluster</h2>
+
+<p>For executing a pipeline on a Flink cluster you need to package your program along will all dependencies in a so-called fat jar. How you do this depends on your build system but if you follow along the <a href="/get-started/quickstart/">Beam Quickstart</a> this is the command that you have to run:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>$ mvn package -Pflink-runner
+</code></pre>
+</div>
+<p>The Beam Quickstart Maven project is setup to use the Maven Shade plugin to create a fat jar and the <code class="highlighter-rouge">-Pflink-runner</code> argument makes sure to include the dependency on the Flink Runner.</p>
+
+<p>For actually running the pipeline you would use this command</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>$ mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
+    -Pflink-runner \
+    -Dexec.args="--runner=FlinkRunner \
+      --inputFile=/path/to/pom.xml \
+      --output=/path/to/counts \
+      --flinkMaster=&lt;flink master url&gt; \
+      --filesToStage=target/word-count-beam--bundled-0.1.jar"
+</code></pre>
+</div>
+<p>If you have a Flink <code class="highlighter-rouge">JobManager</code> running on your local machine you can give <code class="highlighter-rouge">localhost:6123</code> for
+<code class="highlighter-rouge">flinkMaster</code>.</p>
+
+<h2 id="pipeline-options-for-the-flink-runner">Pipeline options for the Flink Runner</h2>
+
+<p>When executing your pipeline with the Flink Runner, you can set these pipeline options.</p>
+
+<table class="table table-bordered">
+<tr>
+  <th>Field</th>
+  <th>Description</th>
+  <th>Default Value</th>
+</tr>
+<tr>
+  <td><code>runner</code></td>
+  <td>The pipeline runner to use. This option allows you to determine the pipeline runner at runtime.</td>
+  <td>Set to <code>FlinkRunner</code> to run using Flink.</td>
+</tr>
+<tr>
+  <td><code>streaming</code></td>
+  <td>Whether streaming mode is enabled or disabled; <code>true</code> if enabled. Set to <code>true</code> if running pipelines with unbounded <code>PCollection</code>s.</td>
+  <td><code>false</code></td>
+</tr>
+<tr>
+  <td><code>flinkMaster</code></td>
+  <td>The url of the Flink JobManager on which to execute pipelines. This can either be the address of a cluster JobManager, in the form <code>"host:port"</code> or one of the special Strings <code>"[local]"</code> or <code>"[auto]"</code>. <code>"[local]"</code> will start a local Flink Cluster in the JVM while <code>"[auto]"</code> will let the system decide where to execute the pipeline based on the environment.</td>
+  <td><code>[auto]</code></td>
+</tr>
+<tr>
+  <td><code>filesToStage</code></td>
+  <td>Jar Files to send to all workers and put on the classpath. Here you have to put the fat jar that contains your program along with all dependencies.</td>
+  <td>empty</td>
+</tr>
+
+<tr>
+  <td><code>parallelism</code></td>
+  <td>The degree of parallelism to be used when distributing operations onto workers.</td>
+  <td><code>1</code></td>
+</tr>
+<tr>
+  <td><code>checkpointingInterval</code></td>
+  <td>The interval between consecutive checkpoints (i.e. snapshots of the current pipeline state used for fault tolerance).</td>
+  <td><code>-1L</code>, i.e. disabled</td>
+</tr>
+<tr>
+  <td><code>numberOfExecutionRetries</code></td>
+  <td>Sets the number of times that failed tasks are re-executed. A value of <code>0</code> effectively disables fault tolerance. A value of <code>-1</code> indicates that the system default value (as defined in the configuration) should be used.</td>
+  <td><code>-1</code></td>
+</tr>
+<tr>
+  <td><code>executionRetryDelay</code></td>
+  <td>Sets the delay between executions. A value of <code>-1</code> indicates that the default value should be used.</td>
+  <td><code>-1</code></td>
+</tr>
+<tr>
+  <td><code>stateBackend</code></td>
+  <td>Sets the state backend to use in streaming mode. The default is to read this setting from the Flink config.</td>
+  <td><code>empty</code>, i.e. read from Flink config</td>
+</tr>
+</table>
+
+<p>See the reference documentation for the  <span class="language-java"><a href="/documentation/sdks/javadoc/0.3.0-incubating/index.html?org/apache/beam/runners/flink/FlinkPipelineOptions.html">FlinkPipelineOptions</a></span><span class="language-python"><a href="https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/utils/options.py">PipelineOptions</a></span> interface (and its subinterfaces) for the complete list of pipeline configuration options.</p>
+
+<h2 id="additional-information-and-caveats">Additional information and caveats</h2>
+
+<h3 id="monitoring-your-job">Monitoring your job</h3>
+
+<p>You can monitor a running Flink job using the Flink JobManager Dashboard. By default, this is available at port <code class="highlighter-rouge">8081</code> of the JobManager node. If you have a Flink installation on your local machine that would be <code class="highlighter-rouge">http://localhost:8081</code>.</p>
+
+<h3 id="streaming-execution">Streaming Execution</h3>
+
+<p>If your pipeline uses an unbounded data source or sink, the Flink Runner will automatically switch to streaming mode. You can enforce streaming mode by using the <code class="highlighter-rouge">streaming</code> setting mentioned above.</p>
+
 
       </div>

[2/3] incubator-beam-site git commit: [BEAM-506] Fill in the documentation/runners/flink portion of the website

Posted by dh...@apache.org.

[BEAM-506] Fill in the documentation/runners/flink portion of the website


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/commit/ac0c4e06
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/tree/ac0c4e06
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/diff/ac0c4e06

Branch: refs/heads/asf-site
Commit: ac0c4e063459ca251354b94eed866c0934548fec
Parents: 1b458f1
Author: Aljoscha Krettek <al...@gmail.com>
Authored: Tue Nov 29 16:23:03 2016 +0100
Committer: Dan Halperin <dh...@google.com>
Committed: Thu Dec 1 14:30:22 2016 -0800

----------------------------------------------------------------------
 src/documentation/runners/flink.md | 136 +++++++++++++++++++++++++++++++-
 1 file changed, 135 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/ac0c4e06/src/documentation/runners/flink.md
----------------------------------------------------------------------
diff --git a/src/documentation/runners/flink.md b/src/documentation/runners/flink.md
index 4145be6..a984bb4 100644
--- a/src/documentation/runners/flink.md
+++ b/src/documentation/runners/flink.md
@@ -6,4 +6,138 @@ redirect_from: /learn/runners/flink/
 ---
 # Using the Apache Flink Runner
 
-This page is under construction ([BEAM-506](https://issues.apache.org/jira/browse/BEAM-506)).
+The Apache Flink Runner can be used to execute Beam pipelines using [Apache Flink](https://flink.apache.org). When using the Flink Runner you will create a jar file containing your job that can be executed on a regular Flink cluster. It's also possible to execute a Beam pipeline using Flink's local execution mode without setting up a cluster. This is helpful for development and debugging of your pipeline.
+
+The Flink Runner and Flink are suitable for large scale, continuous jobs, and provide:
+
+* A streaming-first runtime that supports both batch processing and data streaming programs
+* A runtime that supports very high throughput and low event latency at the same time
+* Fault-tolerance with *exactly-once* processing guarantees
+* Natural back-pressure in streaming programs
+* Custom memory management for efficient and robust switching between in-memory and out-of-core data processing algorithms
+* Integration with YARN and other components of the Apache Hadoop ecosystem
+
+The [Beam Capability Matrix]({{ site.baseurl }}/documentation/runners/capability-matrix/) documents the supported capabilities of the Flink Runner.
+
+## Flink Runner prerequisites and setup
+
+If you want to use the local execution mode with the Flink runner to don't have to complete any setup.
+
+To use the Flink Runner for executing on a cluster, you have to setup a Flink cluster by following the Flink [setup quickstart](https://ci.apache.org/projects/flink/flink-docs-release-1.1/quickstart/setup_quickstart.html).
+
+To find out which version of Flink you need you can run this command to check the version of the Flink dependency that your project is using:
+```
+$ mvn dependency:tree -Pflink-runner |grep flink
+...
+[INFO] |  +- org.apache.flink:flink-streaming-java_2.10:jar:1.1.2:runtime
+...
+```
+Here, we would need Flink 1.1.2.
+
+For more information, the [Flink Documentation](https://ci.apache.org/projects/flink/flink-docs-release-1.1/) can be helpful.
+
+### Specify your dependency
+
+You must specify your dependency on the Flink Runner.
+
+```java
+<dependency>
+  <groupId>org.apache.beam</groupId>
+  <artifactId>beam-runners-flink_2.10</artifactId>
+  <version>{{ site.release_latest }}</version>
+  <scope>runtime</scope>
+</dependency>
+```
+
+## Executing a pipeline on a Flink cluster
+
+For executing a pipeline on a Flink cluster you need to package your program along will all dependencies in a so-called fat jar. How you do this depends on your build system but if you follow along the [Beam Quickstart]({{ site.baseurl }}/get-started/quickstart/) this is the command that you have to run:
+
+```
+$ mvn package -Pflink-runner
+```
+The Beam Quickstart Maven project is setup to use the Maven Shade plugin to create a fat jar and the `-Pflink-runner` argument makes sure to include the dependency on the Flink Runner.
+
+For actually running the pipeline you would use this command
+```
+$ mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
+    -Pflink-runner \
+    -Dexec.args="--runner=FlinkRunner \
+      --inputFile=/path/to/pom.xml \
+      --output=/path/to/counts \
+      --flinkMaster=<flink master url> \
+      --filesToStage=target/word-count-beam--bundled-0.1.jar"
+```
+If you have a Flink `JobManager` running on your local machine you can give `localhost:6123` for
+`flinkMaster`.
+
+## Pipeline options for the Flink Runner
+
+When executing your pipeline with the Flink Runner, you can set these pipeline options.
+
+<table class="table table-bordered">
+<tr>
+  <th>Field</th>
+  <th>Description</th>
+  <th>Default Value</th>
+</tr>
+<tr>
+  <td><code>runner</code></td>
+  <td>The pipeline runner to use. This option allows you to determine the pipeline runner at runtime.</td>
+  <td>Set to <code>FlinkRunner</code> to run using Flink.</td>
+</tr>
+<tr>
+  <td><code>streaming</code></td>
+  <td>Whether streaming mode is enabled or disabled; <code>true</code> if enabled. Set to <code>true</code> if running pipelines with unbounded <code>PCollection</code>s.</td>
+  <td><code>false</code></td>
+</tr>
+<tr>
+  <td><code>flinkMaster</code></td>
+  <td>The url of the Flink JobManager on which to execute pipelines. This can either be the address of a cluster JobManager, in the form <code>"host:port"</code> or one of the special Strings <code>"[local]"</code> or <code>"[auto]"</code>. <code>"[local]"</code> will start a local Flink Cluster in the JVM while <code>"[auto]"</code> will let the system decide where to execute the pipeline based on the environment.</td>
+  <td><code>[auto]</code></td>
+</tr>
+<tr>
+  <td><code>filesToStage</code></td>
+  <td>Jar Files to send to all workers and put on the classpath. Here you have to put the fat jar that contains your program along with all dependencies.</td>
+  <td>empty</td>
+</tr>
+
+<tr>
+  <td><code>parallelism</code></td>
+  <td>The degree of parallelism to be used when distributing operations onto workers.</td>
+  <td><code>1</code></td>
+</tr>
+<tr>
+  <td><code>checkpointingInterval</code></td>
+  <td>The interval between consecutive checkpoints (i.e. snapshots of the current pipeline state used for fault tolerance).</td>
+  <td><code>-1L</code>, i.e. disabled</td>
+</tr>
+<tr>
+  <td><code>numberOfExecutionRetries</code></td>
+  <td>Sets the number of times that failed tasks are re-executed. A value of <code>0</code> effectively disables fault tolerance. A value of <code>-1</code> indicates that the system default value (as defined in the configuration) should be used.</td>
+  <td><code>-1</code></td>
+</tr>
+<tr>
+  <td><code>executionRetryDelay</code></td>
+  <td>Sets the delay between executions. A value of <code>-1</code> indicates that the default value should be used.</td>
+  <td><code>-1</code></td>
+</tr>
+<tr>
+  <td><code>stateBackend</code></td>
+  <td>Sets the state backend to use in streaming mode. The default is to read this setting from the Flink config.</td>
+  <td><code>empty</code>, i.e. read from Flink config</td>
+</tr>
+</table>
+
+See the reference documentation for the  <span class="language-java">[FlinkPipelineOptions]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/flink/FlinkPipelineOptions.html)</span><span class="language-python">[PipelineOptions](https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/utils/options.py)</span> interface (and its subinterfaces) for the complete list of pipeline configuration options.
+
+## Additional information and caveats
+
+### Monitoring your job
+
+You can monitor a running Flink job using the Flink JobManager Dashboard. By default, this is available at port `8081` of the JobManager node. If you have a Flink installation on your local machine that would be `http://localhost:8081`.
+
+### Streaming Execution
+
+If your pipeline uses an unbounded data source or sink, the Flink Runner will automatically switch to streaming mode. You can enforce streaming mode by using the `streaming` setting mentioned above.
+