You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@beam.apache.org by am...@apache.org on 2016/03/17 18:08:38 UTC

[1/2] incubator-beam git commit: [BEAM-113] Update Spark runner README

Repository: incubator-beam
Updated Branches:
  refs/heads/master ef1e32dee -> 659f0b877


[BEAM-113] Update Spark runner README

Just a couple of more changes


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/b0db3131
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/b0db3131
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/b0db3131

Branch: refs/heads/master
Commit: b0db313199fcb01eec04457c2a04103b7a218a1a
Parents: ef1e32d
Author: Sela <an...@paypal.com>
Authored: Wed Mar 16 22:22:35 2016 +0200
Committer: Sela <an...@paypal.com>
Committed: Thu Mar 17 18:54:45 2016 +0200

----------------------------------------------------------------------
 runners/spark/README.md | 112 ++++++++++++++++++++++++-------------------
 1 file changed, 63 insertions(+), 49 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/b0db3131/runners/spark/README.md
----------------------------------------------------------------------
diff --git a/runners/spark/README.md b/runners/spark/README.md
index ccf8516..1d75b35 100644
--- a/runners/spark/README.md
+++ b/runners/spark/README.md
@@ -1,55 +1,78 @@
-spark-dataflow
-==============
+Spark Beam Runner (Spark-Runner)
+================================
 
 ## Intro
 
-Spark-dataflow allows users to execute data pipelines written against the Google Cloud Dataflow API
-with Apache Spark. Spark-dataflow is an early prototype, and we'll be working on it continuously.
-If this project interests you, we welcome issues, comments, and (especially!) pull requests.
-To get an idea of what we have already identified as
-areas that need improvement, checkout the issues listed in the github repo.
+The Spark-Runner allows users to execute data pipelines written against the Apache Beam API
+with Apache Spark. This runner allows to execute both batch and streaming pipelines on top of the Spark engine.
 
-## Motivation
+## Overview
 
-We had two primary goals when we started working on Spark-dataflow:
+### Features
 
-1. *Provide portability for data pipelines written for Google Cloud Dataflow.* Google makes
-it really easy to get started writing pipelines against the Dataflow API, but they wanted
-to be sure that creating a pipeline using their tools would not lock developers in to their
-platform. A Spark-based implementation of Dataflow means that you can take your pipeline
-logic with you wherever you go. This also means that any new machine learning and anomaly
-detection algorithms that are developed against the Dataflow API are available to everyone,
-regardless of their underlying execution platform.
+- ParDo
+- GroupByKey
+- Combine
+- Windowing
+- Flatten
+- View
+- Side inputs/outputs
+- Encoding
+
+### Sources and Sinks
+
+- Text
+- Hadoop
+- Avro
+- Kafka
+
+### Fault-Tolerance
+
+The Spark runner fault-tolerance guarantees the same guarantees as [Apache Spark](http://spark.apache.org/).
+
+### Monitoring
+
+The Spark runner supports monitoring via Beam Aggregators implemented on top of Spark's [Accumulators](http://spark.apache.org/docs/latest/programming-guide.html#accumulators).  
+Spark also provides a web UI for monitoring, more details [here](http://spark.apache.org/docs/latest/monitoring.html).
+
+## Beam Model support
+
+### Batch
+
+The Spark runner provides support for batch processing of Beam bounded PCollections as Spark [RDD](http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds)s.
+
+### Streaming
+
+The Spark runner currently provides partial support for stream processing of Beam unbounded PCollections as Spark [DStream](http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams)s.  
+Current implementation of *Windowing* takes the first window size in the pipeline and treats it as the Spark "batch interval", while following windows will be treated as *Processing Time* windows.  
+Further work is required to provide full support for the Beam Model *event-time* and *out-of-order* stream processing.
+
+### issue tracking
+
+See [Beam JIRA](https://issues.apache.org/jira/browse/BEAM) (runner-spark)
 
-2. *Experiment with new data pipeline design patterns.* The Dataflow API has a number of
-interesting ideas, especially with respect to the unification of batch and stream data
-processing into a single API that maps into two separate engines. The Dataflow streaming
-engine, based on Google's [Millwheel](http://research.google.com/pubs/pub41378.html), does
-not have a direct open source analogue, and we wanted to understand how to replicate its
-functionality using frameworks like Spark Streaming.
 
 ## Getting Started
 
-The Maven coordinates of the current version of this project are:
+To get the latest version of the Spark Runner, first clone the Beam repository:
+
+    git clone https://github.com/apache/incubator-beam
 
-    <groupId>com.cloudera.dataflow.spark</groupId>
-    <artifactId>spark-dataflow</artifactId>
-    <version>0.4.2</version>
     
-and are hosted in Cloudera's repository at:
+Then switch to the newly created directory and run Maven to build the Apache Beam:
+
+    cd incubator-beam
+    mvn clean install -DskipTests
 
-    <repository>
-      <id>cloudera.repo</id>
-      <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
-    </repository>
+Now Apache Beam and the Spark Runner are installed in your local maven repository.
 
-If we wanted to run a dataflow pipeline with the default options of a single threaded spark
+If we wanted to run a Beam pipeline with the default options of a single threaded Spark
 instance in local mode, we would do the following:
 
     Pipeline p = <logic for pipeline creation >
     EvaluationResult result = SparkPipelineRunner.create().run(p);
 
-To create a pipeline runner to run against a different spark cluster, with a custom master url we
+To create a pipeline runner to run against a different Spark cluster, with a custom master url we
 would do the following:
 
     Pipeline p = <logic for pipeline creation >
@@ -62,7 +85,11 @@ would do the following:
 First download a text document to use as input:
 
     curl http://www.gutenberg.org/cache/epub/1128/pg1128.txt > /tmp/kinglear.txt
+    
+Switch to the Spark runner directory:
 
+    cd runners/spark
+    
 Then run the [word count example][wc] from the SDK using a single threaded Spark instance
 in local mode:
 
@@ -77,11 +104,10 @@ Check the output by running:
 __Note: running examples using `mvn exec:exec` only works for Spark local mode at the
 moment. See the next section for how to run on a cluster.__
 
-[wc]: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/WordCount.java
-
+[wc]: https://github.com/apache/incubator-beam/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/WordCount.java
 ## Running on a Cluster
 
-Spark Dataflow pipelines can be run on a cluster using the `spark-submit` command.
+Spark Beam pipelines can be run on a cluster using the `spark-submit` command.
 
 First copy a text document to HDFS:
 
@@ -93,21 +119,9 @@ Then run the word count example using Spark submit with the `yarn-client` master
     spark-submit \
       --class com.google.cloud.dataflow.examples.WordCount \
       --master yarn-client \
-      target/spark-dataflow-*-spark-app.jar \
+      target/spark-runner-*-spark-app.jar \
         --inputFile=kinglear.txt --output=out --runner=SparkPipelineRunner --sparkMaster=yarn-client
 
 Check the output by running:
 
     hadoop fs -tail out-00000-of-00002
-
-## How to Release
-
-Committers can release the project using the standard [Maven Release Plugin](http://maven.apache.org/maven-release/maven-release-plugin/) commands:
-
-    mvn release:prepare
-    mvn release:perform -Darguments="-Dgpg.passphrase=XXX"
-
-Note that you will need a [public GPG key](http://www.apache.org/dev/openpgp.html).
-
-[![Build Status](https://travis-ci.org/cloudera/spark-dataflow.png?branch=master)](https://travis-ci.org/cloudera/spark-dataflow)
-[![codecov.io](https://codecov.io/github/cloudera/spark-dataflow/coverage.svg?branch=master)](https://codecov.io/github/cloudera/spark-dataflow?branch=master)

[2/2] incubator-beam git commit: This closes #55

Posted by am...@apache.org.

This closes #55


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/659f0b87
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/659f0b87
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/659f0b87

Branch: refs/heads/master
Commit: 659f0b8779c05e160d7e2c031b18b7bceb00de79
Parents: ef1e32d b0db313
Author: Sela <an...@paypal.com>
Authored: Thu Mar 17 18:58:54 2016 +0200
Committer: Sela <an...@paypal.com>
Committed: Thu Mar 17 18:58:54 2016 +0200

----------------------------------------------------------------------
 runners/spark/README.md | 112 ++++++++++++++++++++++++-------------------
 1 file changed, 63 insertions(+), 49 deletions(-)
----------------------------------------------------------------------