You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by da...@apache.org on 2017/04/21 18:11:45 UTC

[2/4] beam-site git commit: Fix Apex runner instructions (pending review comments and other changes) closes #98

Fix Apex runner instructions (pending review comments and other changes)
closes #98


Project: http://git-wip-us.apache.org/repos/asf/beam-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam-site/commit/0da810c9
Tree: http://git-wip-us.apache.org/repos/asf/beam-site/tree/0da810c9
Diff: http://git-wip-us.apache.org/repos/asf/beam-site/diff/0da810c9

Branch: refs/heads/asf-site
Commit: 0da810c9ca062bbe85fd17581aeacac04b5d2ffe
Parents: 01641c2
Author: Thomas Weise <th...@apache.org>
Authored: Fri Apr 21 00:55:52 2017 -0700
Committer: Davor Bonaci <da...@google.com>
Committed: Fri Apr 21 11:10:46 2017 -0700

----------------------------------------------------------------------
 src/documentation/runners/apex.md | 44 ++++++++++++++++++++++------------
 1 file changed, 29 insertions(+), 15 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/beam-site/blob/0da810c9/src/documentation/runners/apex.md
----------------------------------------------------------------------
diff --git a/src/documentation/runners/apex.md b/src/documentation/runners/apex.md
index a74c310..9e97d77 100644
--- a/src/documentation/runners/apex.md
+++ b/src/documentation/runners/apex.md
@@ -7,44 +7,58 @@ permalink: /documentation/runners/apex/
 
 The Apex Runner executes Apache Beam pipelines using [Apache Apex](http://apex.apache.org/) as an underlying engine. The runner has broad support for the [Beam model and supports streaming and batch pipelines]({{ site.baseurl }}/documentation/runners/capability-matrix/).
 
-[Apache Apex](http://apex.apache.org/) is a stream processing platform and framework for low-latency, high-throughput and fault-tolerant analytics applications on Apache Hadoop. Apex has a unified streaming architecture and can be used for real-time and batch processing. With its stateful stream processing architecture, Apex can support all of the concepts in the Beam model (event time, triggers, watermarks etc.).
+[Apache Apex](http://apex.apache.org/) is a stream processing platform and framework for low-latency, high-throughput and fault-tolerant analytics applications on Apache Hadoop. Apex has a unified streaming architecture and can be used for real-time and batch processing.
 
-## Apex-Runner prerequisites and setup
+## Apex Runner prerequisites
 
-You may set up your own Hadoop cluster,  and [setup Apache Apex on top of it](http://apex.apache.org/docs/apex/apex_development_setup/) or choose any vendor-specific distribution that includes Hadoop and Apex pre-installed. Please see the [distribution information on the Apache Apex website](http://apex.apache.org/downloads.html).
+You may set up your own Hadoop cluster. Beam does not require anything extra to launch the pipelines on YARN.
+An optional Apex installation may be useful for monitoring and troubleshooting.
+The Apex CLI can be [built](http://apex.apache.org/docs/apex/apex_development_setup/) or
+obtained as [binary build](http://www.atrato.io/blog/2017/04/08/apache-apex-cli/).
+For more download options see [distribution information on the Apache Apex website](http://apex.apache.org/downloads.html).
 
-## Running wordcount using Apex-Runner
+## Running wordcount using Apex Runner
 
-Download some data for processing and put it on HDFS
+Put data for processing into HDFS:
 ```
-curl http://www.gutenberg.org/cache/epub/1128/pg1128.txt > /tmp/kinglear.txt
 hdfs dfs -mkdir -p /tmp/input/
-hdfs dfs -put /tmp/kinglear.txt /tmp/input/
+hdfs dfs -put pom.xml /tmp/input/
 ```
 
-The output directory should not exist on HDFS. Delete it if it exists.
+The output directory should not exist on HDFS:
 ```
 hdfs dfs -rm -r -f /tmp/output/
 ```
 
-Run the wordcount example
+Run the wordcount example (*example project needs to be modified to include HDFS file provider*)
 ```
-mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--inputFile=/tmp/input/ --output=/tmp/output/ --runner=ApexRunner --embeddedExecution=false --configFile=beam-runners-apex.properties" -Papex-runner
+mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--inputFile=/tmp/input/pom.xml --output=/tmp/output/ --runner=ApexRunner --embeddedExecution=false --configFile=beam-runners-apex.properties" -Papex-runner
+```
+
+The application will run asynchronously. Check status with `yarn application -list -appStates ALL`
+
+The configuration file is optional, it can be used to influence how Apex operators are deployed into YARN containers.
+The following example will reduce the number of required containers by collocating the operators into the same container
+and lower the heap memory per operator - suitable for execution in a single node Hadoop sandbox.
+
+```
+dt.application.*.operator.*.attr.MEMORY_MB=64
+dt.stream.*.prop.locality=CONTAINER_LOCAL
+dt.application.*.operator.*.attr.TIMEOUT_WINDOW_COUNT=1200
 ```
 
-This will launch an Apex application.
 
 ## Checking output
 
-The sample program which is processing small amount of data would finish quickly. You can check contents on /tmp/output/ on HDFS
+Check the output of the pipeline in the HDFS location.
 ```
 hdfs dfs -ls /tmp/output/
 ```
 
 ## Montoring progress of your job
 
-Depending on your installation, you may be able to monitor the progress of your job on the Hadoop cluster. Alternately, you have folloing optoins:
+Depending on your installation, you may be able to monitor the progress of your job on the Hadoop cluster. Alternatively, you have following options:
 
-* YARN : Using YARN web UI generally running on 8088 on the node running resource manager
-* Apex cli: [Using apex cli to get running application information](http://apex.apache.org/docs/apex/apex_cli/#apex-cli-commands)
+* YARN : Using YARN web UI generally running on 8088 on the node running resource manager.
+* Apex command-line interface: [Using the Apex CLI to get running application information](http://apex.apache.org/docs/apex/apex_cli/#apex-cli-commands).