You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by fr...@apache.org on 2016/11/06 05:30:31 UTC

[2/3] incubator-beam-site git commit: Initial pass on a real Quickstart

Initial pass on a real Quickstart

* Uses the maven archetypes to create and run a simple WordCount.
* Has a basic structure for runner-specific commands, though details (BEAM-899, BEAM-900, BEAM-904) and better formatting (BEAM-902) are left for future work.
* Adds some runner basics to the documentation landing page.


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/commit/73e5d3f1
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/tree/73e5d3f1
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/diff/73e5d3f1

Branch: refs/heads/asf-site
Commit: 73e5d3f146182eaaa888575f7e6805d9cbf61ba3
Parents: bb5cccf
Author: Frances Perry <fj...@google.com>
Authored: Thu Nov 3 13:09:52 2016 -0700
Committer: Frances Perry <fj...@google.com>
Committed: Sat Nov 5 22:20:28 2016 -0700

----------------------------------------------------------------------
 src/documentation/index.md         |  37 +++++--
 src/documentation/runners/index.md |  15 ---
 src/documentation/sdks/index.md    |   9 --
 src/get-started/quickstart.md      | 181 ++++++++++++++++++++++++++++----
 4 files changed, 188 insertions(+), 54 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/73e5d3f1/src/documentation/index.md
----------------------------------------------------------------------
diff --git a/src/documentation/index.md b/src/documentation/index.md
index 6bd6be8..2ed18f3 100644
--- a/src/documentation/index.md
+++ b/src/documentation/index.md
@@ -7,24 +7,39 @@ redirect_from:
   - /docs/learn/
 ---
 
-# Learn about the Apache Beam Model
+# Apache Beam Documentation
 
 Get in-depth conceptual information and reference material for the Beam Model, SDKs and Runners:
 
-#### [Beam Programming Guide]({{ site.baseurl }}/learn/programming-guide/) 
+## Concepts 
+
 Learn about the Beam Programming Model and the concepts common to all Beam SDKs and Runners.
 
-#### Beam SDKs
+* The [Programming Guide]({{ site.baseurl }}/documentation/programming-guide/) introduces all the key Beam concepts.
+* Visit [Additional Resources]({{ site.baseurl }}/documentation/resources/) for some of our favorite articles and talks about Beam. 
+
+## SDKs
+
 Find status and reference information on all of the available Beam SDKs.
 
-* [Java SDK]({{ site.baseurl }}/learn/sdks/java/) 
+* [Java SDK]({{ site.baseurl }}/documentation/sdks/java/) 
+* _[Under Development]_ [Python SDK]({{ site.baseurl }}/contribute/work-in-progress/#feature-branches)
+
+## Runners
+
+A Beam Runner runs a Beam pipeline on a specific (often distributed) data processing system.
+
+### Available Runners
+
+* [DirectRunner]({{ site.baseurl }}/documentation/runners/direct/): Runs locally on your machine -- great for developing, testing, and debugging.
+* [FlinkRunner]({{ site.baseurl }}/documentation/runners/flink/): Runs on [Apache Flink](http://flink.apache.org).
+* [SparkRunner]({{ site.baseurl }}/documentation/runners/spark/): Runs on [Apache Spark](http://spark.apache.org).
+* [DataflowRunner]({{ site.baseurl }}/documentation/runners/dataflow/): Runs on [Google Cloud Dataflow](https://cloud.google.com/dataflow), a fully managed service within [Google Cloud Platform](https://cloud.google.com/).
+* _[Under Development]_ [ApexRunner]({{ site.baseurl }}/contribute/work-in-progress/#feature-branches): Runs on [Apache Apex](http://apex.apache.org).
+* _[Under Development]_ [GearpumpRunner]({{ site.baseurl }}/contribute/work-in-progress/#feature-branches): Runs on [Apache Gearpump (incubating)](http://gearpump.apache.org). 
 
-####  Runners
-Learn about the [Capability Matrix]({{ site.baseurl }}/learn/runners/capability-matrix/) and find status and reference information on all of the available Beam Runners:
+### Choosing a Runner
 
-* [Direct Runner]({{ site.baseurl }}/learn/runners/direct/)
-* [Apache Flink]({{ site.baseurl }}/learn/runners/flink/)
-* [Apache Spark]({{ site.baseurl }}/learn/runners/spark/)
-* [Cloud Dataflow]({{ site.baseurl }}/learn/runners/dataflow/)
+Beam is designed to enable pipelines to be portable across different runners. However, given every runner has different capabilities, they also have different abilities to implement the core concepts in the Beam model. The [Capability Matrix]({{ site.baseurl }}/documentation/runners/capability-matrix) provides a detailed comparison of runner functionality.
 
-#### [Additional Resources]({{ site.baseurl }}/learn/resources/)
+Once you have chosen which runner to use, see that runner's page for more information about any initial runner-specific setup as well as any required or optional `PipelineOptions` for configuring it's execution. You may also want to refer back to the [Quickstart]({{ site.baseurl }}/get-started/quickstart) for instructions on executing the sample WordCount pipeline.

http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/73e5d3f1/src/documentation/runners/index.md
----------------------------------------------------------------------
diff --git a/src/documentation/runners/index.md b/src/documentation/runners/index.md
deleted file mode 100644
index d6d2211..0000000
--- a/src/documentation/runners/index.md
+++ /dev/null
@@ -1,15 +0,0 @@
----
-layout: default
-title: "Beam Runners"
-permalink: /documentation/runners/
-redirect_from: /learn/runners/
----
-# Apache Beam Runners
-
-#### [Direct Runner]({{ site.baseurl }}/learn/runners/direct/) 
-
-#### [Apache Flink Runner]({{ site.baseurl }}/learn/runners/flink/) 
-
-#### [Apache Spark Runner]({{ site.baseurl }}/learn/runners/spark/) 
-
-#### [Cloud Dataflow Runner]({{ site.baseurl }}/learn/runners/dataflow/) 

http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/73e5d3f1/src/documentation/sdks/index.md
----------------------------------------------------------------------
diff --git a/src/documentation/sdks/index.md b/src/documentation/sdks/index.md
deleted file mode 100644
index 1a5a08e..0000000
--- a/src/documentation/sdks/index.md
+++ /dev/null
@@ -1,9 +0,0 @@
----
-layout: default
-title: "Beam SDKs"
-permalink: /documentation/sdks/
-redirect_from: /learn/sdks/
----
-# Apache Beam SDKs
-
-#### [Java SDK]({{ site.baseurl }}/learn/sdks/java/) 

http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/73e5d3f1/src/get-started/quickstart.md
----------------------------------------------------------------------
diff --git a/src/get-started/quickstart.md b/src/get-started/quickstart.md
index f0cd5c3..8402c55 100644
--- a/src/get-started/quickstart.md
+++ b/src/get-started/quickstart.md
@@ -7,28 +7,171 @@ redirect_from:
   - /getting-started/
 ---
 
-# Apache Beam Quickstart
+# Apache Beam Java SDK Quickstart 
 
-The Apache Beam project is in the process of bootstrapping. This includes the creation of project resources, the refactoring of the initial code submission, and the formulation of project documentation, planning, and design documents. Until the project is fully initialized, this page contains useful resources to learn more about the model and tools which comprise Apache Beam.
+This Quickstart will walk you through executing your first Beam pipeline to run [WordCount]({{ site.baseurl }}/get-started/wordcount-example), written using Beam's [Java SDK]({{ site.baseurl }}/documentation/sdks/java), on a [runner]({{ site.baseurl }}/documentation#runners) of your choice.
 
-## Articles & slides
-* [The world beyond batch: Streaming 101](https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101)
-* [The world beyong batch: Streaming 102](https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102)
-* [Dataflow/Beam & Spark: A Programming Model Comparison](https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison)
-* [Dataflow and open source - proposal to join the Apache Incubator](http://googlecloudplatform.blogspot.com/2016/01/Dataflow-and-open-source-proposal-to-join-the-Apache-Incubator.html)
+* TOC
+{:toc}
 
-## Current code
-The following GitHub repositories contain code which will be incorporated into Apache Beam.
 
-* [Dataflow Java SDK](https://github.com/GoogleCloudPlatform/DataflowJavaSDK)
-* [Flink Dataflow runner](https://github.com/dataArtisans/flink-dataflow)
-* [Spark Dataflow runner](https://github.com/cloudera/spark-dataflow)
+## Set up your Development Environment
+ 
+1. Download and install the [Java Development Kit (JDK)](http://www.oracle.com/technetwork/java/javase/downloads/index.html) version 1.7 or later. Verify that the [JAVA_HOME](https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/envvars001.html) environment variable is set and points to your JDK installation.
 
-These code repositories will be refactored and managed together (along with other code and new contributions) into a single repository.
+1. Download and install [Apache Maven](http://maven.apache.org/download.cgi) by following Maven's [installation guide](http://maven.apache.org/install.html) for your specific operating system.
 
-## Documentation
-* [Apache Beam incubation proposal](https://goo.gl/KJrEl7)
-* *Apache Beam technical vision*
-    * [Detailed](https://goo.gl/5qZt3d)
-    * [Summary](https://goo.gl/nk5OM0)
-* [Apache Beam technical documentation](https://goo.gl/ps8twC)
+
+## Get the WordCount Code
+
+The easiest way to get a copy of the WordCount pipeline is to use the following command to generate a simple Maven project that contains Beam's WordCount examples and builds against the most recent Beam release: 
+
+```
+$ mvn archetype:generate \
+      -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
+      -DarchetypeVersion=LATEST \
+      -DarchetypeGroupId=org.apache.beam \
+      -DgroupId=org.example \
+      -DartifactId=word-count-beam \
+      -Dversion="0.1" \
+      -DinteractiveMode=false \
+      -Dpackage=org.apache.beam.examples
+```
+
+This will create a directory `word-count-beam` that contains a simple `pom.xml` and a series of example pipelines that count words in text files. 
+
+```
+$ cd beam-word-count/
+
+$ ls
+pom.xml	src
+
+$ ls src/main/java/org/apache/beam/examples/
+DebuggingWordCount.java	WindowedWordCount.java	common
+MinimalWordCount.java	WordCount.java
+```
+
+For a detailed introduction to the Beam concepts used in these examples, see the [WordCount Example Walkthrough]({{ site.baseurl }}/get-started/wordcount-example). Here, we'll just focus on executing `WordCount.java`.
+
+
+## Run WordCount
+
+A single Beam pipeline can run on multiple Beam [runners]({{ site.baseurl }}/documentation#runners), including the [SparkRunner]({{ site.baseurl }}/documentation/runners/spark), [FlinkRunner]({{ site.baseurl }}/documentation/runners/flink), or [DataflowRunner]({{ site.baseurl }}/documentation/runners/dataflow). The [DirectRunner]({{ site.baseurl }}/documentation/runners/direct) is a common runner for getting started, as it runs locally on your machine and requires no specific setup.
+
+After you've chosen which runner you'd like to use:
+
+1.  Ensure you've done any runner-specific setup.
+1.  Build your commandline by:
+    1. Specifying a specific runner with `--runner=<runner>` (defaults to the [DirectRunner]({{ site.baseurl }}/documentation/runners/direct))
+    1. Adding any runner-specific required options 
+    1. Choosing input files and an output location are accessible on the chosen runner. (For example, you can't access a local file if you are running the pipeline on an external cluster.)
+1.  Run your first WordCount pipeline.
+
+	1.  [DirectRunner]({{ site.baseurl }}/documentation/runners/direct)
+	
+		```
+		$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
+		     -Dexec.args="--inputFile=pom.xml --output=counts"
+		```
+	
+	1.  [FlinkRunner]({{ site.baseurl }}/documentation/runners/flink)
+	
+		``` 
+		TODO BEAM-899
+		```
+	
+	1.  [SparkRunner]({{ site.baseurl }}/documentation/runners/spark)
+	
+		```
+		TODO BEAM-900
+		```
+	
+	1.  [DataflowRunner]({{ site.baseurl }}/documentation/runners/dataflow)
+	
+		```
+		$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
+			 -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://<your-gcs-bucket>/tmp \
+			              --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts"
+		```
+
+
+## Inspect the results
+
+Once the pipeline has completed, you can view the output. You'll notice that there may be multiple output files prefixed by `count`. The exact number of these files is decided by the runner, giving it the flexibility to do efficient, distributed execution.
+
+1.  [DirectRunner]({{ site.baseurl }}/documentation/runners/direct)
+
+	```
+	$ ls counts*
+	```
+
+1.  [FlinkRunner]({{ site.baseurl }}/documentation/runners/flink)
+
+	``` 
+	TODO BEAM-899
+	```
+
+1.  [SparkRunner]({{ site.baseurl }}/documentation/runners/spark)
+
+	```
+	TODO BEAM-900
+	```
+
+
+1.  [DataflowRunner]({{ site.baseurl }}/documentation/runners/dataflow)
+
+	```
+	$ gsutil ls gs://<your-gcs-bucket>/counts*
+	```
+	
+When you look into the contents of the file, you'll see that they contain unique words and the number of occurrences of each word. The order of elements within the file may differ because the Beam model does not generally guarantee ordering, again to allow runners to optimize for efficiency.
+	
+1.  [DirectRunner]({{ site.baseurl }}/documentation/runners/direct)
+ 
+	```
+	$ more counts*
+	api: 9
+	bundled: 1
+	old: 4
+	Apache: 2
+	The: 1
+	limitations: 1
+	Foundation: 1
+	...
+	```
+
+1.  [FlinkRunner]({{ site.baseurl }}/documentation/runners/flink)
+
+	``` 
+	TODO BEAM-899
+	```
+
+1.  [SparkRunner]({{ site.baseurl }}/documentation/runners/spark)
+
+	```
+	TODO BEAM-900
+	```
+
+1.  [DataflowRunner]({{ site.baseurl }}/documentation/runners/dataflow)
+
+	```
+	$ gsutil cat gs://<your-gcs-bucket>/counts*
+	feature: 15
+	smother'st: 1
+	revelry: 1
+	bashfulness: 1
+	Bashful: 1
+	Below: 2
+	deserves: 32
+	barrenly: 1
+	...
+	```
+	
+## Next Steps
+
+* Learn more about these WordCount examples in the [WordCount Example Walkthrough]({{ site.baseurl }}/get-started/wordcount-example).
+* Dive in to some of our favorite [articles and presentations]({{ site.baseurl }}/documentation/resources).
+* Join the Beam [users@]({{ site.baseurl }}/get-started/support#mailing-lists) mailing list.
+
+Please don't hesitate to [reach out]({{ site.baseurl }}/get-started/support) if you encounter any issues!
+	
\ No newline at end of file