You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tinkerpop.apache.org by ok...@apache.org on 2017/11/01 18:07:41 UTC

[01/14] tinkerpop git commit: Changes relative to tp32 to get spark-2.2 on yarn working

Repository: tinkerpop
Updated Branches:
  refs/heads/master f2436691a -> c29e0f6ae


Changes relative to tp32 to get spark-2.2 on yarn working


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/a60ac454
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/a60ac454
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/a60ac454

Branch: refs/heads/master
Commit: a60ac4544f62db5a721d4526780d6ca1f45d993b
Parents: 9127a4f
Author: HadoopMarc <vt...@xs4all.nl>
Authored: Wed Sep 20 08:12:48 2017 +0200
Committer: HadoopMarc <vt...@xs4all.nl>
Committed: Thu Oct 12 21:54:06 2017 +0200

----------------------------------------------------------------------
 docs/src/recipes/olap-spark-yarn.asciidoc | 26 +++++++++++++-------------
 spark-gremlin/pom.xml                     |  2 +-
 2 files changed, 14 insertions(+), 14 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/a60ac454/docs/src/recipes/olap-spark-yarn.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/olap-spark-yarn.asciidoc b/docs/src/recipes/olap-spark-yarn.asciidoc
index fbe9c8f..f909ce0 100644
--- a/docs/src/recipes/olap-spark-yarn.asciidoc
+++ b/docs/src/recipes/olap-spark-yarn.asciidoc
@@ -33,8 +33,8 @@ Most configuration problems of Tinkerpop with Spark on Yarn stem from three reas
 
 1. `SparkGraphComputer` creates its own `SparkContext` so it does not get any configs from the usual `spark-submit` command.
 2. The Tinkerpop Spark-plugin did not include Spark Yarn runtime dependencies until version 3.2.7/3.3.1.
-3. Resolving reason 2 by adding the cluster's `spark-assembly` jar to the classpath creates a host of version
-conflicts, because Spark 1.x dependency versions have remained frozen since 2014.
+3. Resolving reason 2 by adding the cluster's Spark jars to the classpath may create all kinds of version
+conflicts with the Tinkerpop dependencies.
 
 The current recipe follows a minimalist approach in which no dependencies are added to the dependencies
 included in the Tinkerpop binary distribution. The Hadoop cluster's Spark installation is completely ignored. This
@@ -94,14 +94,14 @@ $ . bin/spark-yarn.sh
 ----
 hadoop = System.getenv('HADOOP_HOME')
 hadoopConfDir = System.getenv('HADOOP_CONF_DIR')
-archive = 'spark-gremlin.zip'
-archivePath = "/tmp/$archive"
+archivePath = "/tmp/spark-gremlin.zip"
 ['bash', '-c', "rm $archivePath 2>/dev/null; cd ext/spark-gremlin/lib && zip $archivePath *.jar"].execute()
 conf = new PropertiesConfiguration('conf/hadoop/hadoop-gryo.properties')
-conf.setProperty('spark.master', 'yarn-client')
-conf.setProperty('spark.yarn.dist.archives', "$archivePath")
-conf.setProperty('spark.yarn.appMasterEnv.CLASSPATH', "./$archive/*:$hadoopConfDir")
-conf.setProperty('spark.executor.extraClassPath', "./$archive/*:$hadoopConfDir")
+conf.setProperty('spark.master', 'yarn')
+conf.setProperty('spark.submit.deployMode', 'client')
+conf.setProperty('spark.yarn.archive', "$archivePath")
+conf.setProperty('spark.yarn.appMasterEnv.CLASSPATH', "./__spark_libs__/*:$hadoopConfDir")
+conf.setProperty('spark.executor.extraClassPath', "./__spark_libs__/*:$hadoopConfDir")
 conf.setProperty('spark.driver.extraLibraryPath', "$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64")
 conf.setProperty('spark.executor.extraLibraryPath', "$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64")
 graph = GraphFactory.open(conf)
@@ -119,11 +119,11 @@ Explanation
 This recipe does not require running the `bin/hadoop/init-tp-spark.sh` script described in the
 http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[reference documentation] and thus is also
 valid for cluster users without access permissions to do so.
-Rather, it exploits the `spark.yarn.dist.archives` property, which points to an archive with jars on the local file
+Rather, it exploits the `spark.yarn.archive` property, which points to an archive with jars on the local file
 system and is loaded into the various Yarn containers. As a result the `spark-gremlin.zip` archive becomes available
-as the directory named `spark-gremlin.zip` in the Yarn containers. The `spark.executor.extraClassPath` and
-`spark.yarn.appMasterEnv.CLASSPATH` properties point to the files inside this archive.
-This is why they contain the `./spark-gremlin.zip/*` item. Just because a Spark executor got the archive with
+as the directory named `+__spark_libs__+` in the Yarn containers. The `spark.executor.extraClassPath` and
+`spark.yarn.appMasterEnv.CLASSPATH` properties point to the jars inside this directory.
+This is why they contain the `+./__spark_lib__/*+` item. Just because a Spark executor got the archive with
 jars loaded into its container, does not mean it knows how to access them.
 Also the `HADOOP_GREMLIN_LIBS` mechanism is not used because it can not work for Spark on Yarn as implemented (jars
 added to the `SparkContext` are not available to the Yarn application master).
@@ -141,5 +141,5 @@ as long as you do not use the `spark-submit` or `spark-shell` commands.
 
 You may not like the idea that the Hadoop and Spark jars from the Tinkerpop distribution differ from the versions in
 your cluster. If so, just build Tinkerpop from source with the corresponding dependencies changed in the various `pom.xml`
-files (e.g. `spark-core_2.10-1.6.1-some-vendor.jar` instead of `spark-core_2.10-1.6.1.jar`). Of course, Tinkerpop will
+files (e.g. `spark-core_2.11-2.2.0-some-vendor.jar` instead of `spark-core_2.11-2.2.0.jar`). Of course, Tinkerpop will
 only build for exactly matching or slightly differing artifact versions.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/a60ac454/spark-gremlin/pom.xml
----------------------------------------------------------------------
diff --git a/spark-gremlin/pom.xml b/spark-gremlin/pom.xml
index 4dc00d4..82dbe21 100644
--- a/spark-gremlin/pom.xml
+++ b/spark-gremlin/pom.xml
@@ -442,7 +442,7 @@
                     <archive>
                         <manifestEntries>
                             <Gremlin-Plugin-Dependencies>
-                                org.apache.hadoop:hadoop-client:${hadoop.version};org.apache.hadoop:hadoop-yarn-server-web-proxy:${hadoop.version};org.apache.spark:spark-yarn_2.10:${spark.version}
+                                org.apache.hadoop:hadoop-client:${hadoop.version};org.apache.hadoop:hadoop-yarn-server-web-proxy:${hadoop.version};org.apache.spark:spark-yarn_2.11:${spark.version}
                             </Gremlin-Plugin-Dependencies>
                             <!-- deletes the servlet-api jar from the path after install - causes conflicts -->
                             <Gremlin-Plugin-Paths>servlet-api-2.5.jar=</Gremlin-Plugin-Paths>


[03/14] tinkerpop git commit: Moved gremlin.spark.persistContext back to recipe

Posted by ok...@apache.org.
Moved gremlin.spark.persistContext back to recipe


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/df73338f
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/df73338f
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/df73338f

Branch: refs/heads/master
Commit: df73338fe29a8ba1faa1c873a6ed8ff597607b60
Parents: 16f3ee7
Author: HadoopMarc <vt...@xs4all.nl>
Authored: Tue Oct 3 21:20:16 2017 +0200
Committer: HadoopMarc <vt...@xs4all.nl>
Committed: Thu Oct 12 21:55:28 2017 +0200

----------------------------------------------------------------------
 docs/src/recipes/olap-spark-yarn.asciidoc  | 6 ++++++
 hadoop-gremlin/conf/hadoop-gryo.properties | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/df73338f/docs/src/recipes/olap-spark-yarn.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/olap-spark-yarn.asciidoc b/docs/src/recipes/olap-spark-yarn.asciidoc
index 01aedcb..1bcc443 100644
--- a/docs/src/recipes/olap-spark-yarn.asciidoc
+++ b/docs/src/recipes/olap-spark-yarn.asciidoc
@@ -104,6 +104,7 @@ conf.setProperty('spark.yarn.appMasterEnv.CLASSPATH', "./__spark_libs__/*:$hadoo
 conf.setProperty('spark.executor.extraClassPath', "./__spark_libs__/*:$hadoopConfDir")
 conf.setProperty('spark.driver.extraLibraryPath', "$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64")
 conf.setProperty('spark.executor.extraLibraryPath', "$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64")
+conf.setProperty('gremlin.spark.persistContext', 'true')
 graph = GraphFactory.open(conf)
 g = graph.traversal().withComputer(SparkGraphComputer)
 g.V().group().by(values('name')).by(both().count())
@@ -125,9 +126,14 @@ as the directory named `+__spark_libs__+` in the Yarn containers. The `spark.exe
 `spark.yarn.appMasterEnv.CLASSPATH` properties point to the jars inside this directory.
 This is why they contain the `+./__spark_lib__/*+` item. Just because a Spark executor got the archive with
 jars loaded into its container, does not mean it knows how to access them.
+
 Also the `HADOOP_GREMLIN_LIBS` mechanism is not used because it can not work for Spark on Yarn as implemented (jars
 added to the `SparkContext` are not available to the Yarn application master).
 
+The `gremlin.spark.persistContext` property is explained in the reference documentation of
+http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[SparkGraphComputer]: it helps in getting
+follow-up OLAP queries answered faster, because you skip the overhead for getting resources from Yarn.
+
 Additional configuration options
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 This recipe does most of the graph configuration in the gremlin console so that environment variables can be used and

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/df73338f/hadoop-gremlin/conf/hadoop-gryo.properties
----------------------------------------------------------------------
diff --git a/hadoop-gremlin/conf/hadoop-gryo.properties b/hadoop-gremlin/conf/hadoop-gryo.properties
index ec56abc..c156a98 100644
--- a/hadoop-gremlin/conf/hadoop-gryo.properties
+++ b/hadoop-gremlin/conf/hadoop-gryo.properties
@@ -29,11 +29,11 @@ gremlin.hadoop.outputLocation=output
 ####################################
 spark.master=local[4]
 spark.executor.memory=1g
-gremlin.spark.persistContext=true
 spark.serializer=org.apache.spark.serializer.KryoSerializer
 spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator
 # spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer (3.2.x model)
 # gremlin.spark.graphStorageLevel=MEMORY_AND_DISK
+# gremlin.spark.persistContext=true
 # gremlin.spark.graphWriter=org.apache.tinkerpop.gremlin.spark.structure.io.PersistedOutputRDD
 # gremlin.spark.persistStorageLevel=DISK_ONLY
 # spark.kryo.registrationRequired=true


[11/14] tinkerpop git commit: More consistent capitalization and other text improvements

Posted by ok...@apache.org.
More consistent capitalization and other text improvements


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/cd653783
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/cd653783
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/cd653783

Branch: refs/heads/master
Commit: cd653783df7ce450de033da3caf2d396e7b05a4d
Parents: db859fb
Author: HadoopMarc <vt...@xs4all.nl>
Authored: Mon Oct 16 23:16:22 2017 +0200
Committer: HadoopMarc <vt...@xs4all.nl>
Committed: Thu Oct 19 16:11:57 2017 +0200

----------------------------------------------------------------------
 CHANGELOG.asciidoc                        |  2 +-
 docs/src/recipes/olap-spark-yarn.asciidoc | 58 ++++++++++++++------------
 2 files changed, 32 insertions(+), 28 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/cd653783/CHANGELOG.asciidoc
----------------------------------------------------------------------
diff --git a/CHANGELOG.asciidoc b/CHANGELOG.asciidoc
index 7bb05ea..572db9e 100644
--- a/CHANGELOG.asciidoc
+++ b/CHANGELOG.asciidoc
@@ -45,7 +45,7 @@ image::https://raw.githubusercontent.com/apache/tinkerpop/master/docs/static/ima
 * Fixed a bug that prevented Gremlin from ordering lists and streams made of mixed number types.
 * Fixed a bug where `keepLabels` were being corrupted because a defensive copy was not being made when they were being set by `PathRetractionStrategy`.
 * Cancel script evaluation timeout in `GremlinExecutor` when script evaluation finished.
-* Added a recipe for OLAP traversals with Spark on Yarn.
+* Added a recipe for OLAP traversals with Spark on YARN.
 * Added `spark-yarn` dependencies to the manifest of `spark-gremlin`.
 
 [[release-3-2-6]]

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/cd653783/docs/src/recipes/olap-spark-yarn.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/olap-spark-yarn.asciidoc b/docs/src/recipes/olap-spark-yarn.asciidoc
index f55edaa..6755e5f 100644
--- a/docs/src/recipes/olap-spark-yarn.asciidoc
+++ b/docs/src/recipes/olap-spark-yarn.asciidoc
@@ -15,24 +15,24 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ////
 [[olap-spark-yarn]]
-OLAP traversals with Spark on Yarn
+OLAP traversals with Spark on YARN
 ----------------------------------
 
-TinkerPop's combination of http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[SparkGraphComputer]
-and http://tinkerpop.apache.org/docs/current/reference/#_properties_files[HadoopGraph] allows for running
+TinkerPop's combination of http://tinkerpop.apache.org/docs/x.y.z/reference/#sparkgraphcomputer[SparkGraphComputer]
+and http://tinkerpop.apache.org/docs/x.y.z/reference/#_properties_files[HadoopGraph] allows for running
 distributed, analytical graph queries (OLAP) on a computer cluster. The
-http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[reference documentation] covers the cases
+http://tinkerpop.apache.org/docs/x.y.z/reference/#sparkgraphcomputer[reference documentation] covers the cases
 where Spark runs locally or where the cluster is managed by a Spark server. However, many users can only run OLAP jobs
-via the http://hadoop.apache.org/[Hadoop 2.x] Resource Manager (Yarn), which requires `SparkGraphComputer` to be
+via the http://hadoop.apache.org/[Hadoop 2.x] Resource Manager (YARN), which requires `SparkGraphComputer` to be
 configured differently. This recipe describes this configuration.
 
 Approach
 ~~~~~~~~
 
-Most configuration problems of TinkerPop with Spark on Yarn stem from three reasons:
+Most configuration problems of TinkerPop with Spark on YARN stem from three reasons:
 
 1. `SparkGraphComputer` creates its own `SparkContext` so it does not get any configs from the usual `spark-submit` command.
-2. The TinkerPop Spark plugin did not include Spark on Yarn runtime dependencies until version 3.2.7/3.3.1.
+2. The TinkerPop Spark plugin did not include Spark on YARN runtime dependencies until version 3.2.7/3.3.1.
 3. Resolving reason 2 by adding the cluster's `spark-assembly` jar to the classpath creates a host of version
 conflicts, because Spark 1.x dependency versions have remained frozen since 2014.
 
@@ -50,13 +50,13 @@ If you want to try the recipe on a local Hadoop pseudo-cluster, the easiest way
 it is to look at the install script at https://github.com/apache/tinkerpop/blob/x.y.z/docker/hadoop/install.sh
 and the `start hadoop` section of https://github.com/apache/tinkerpop/blob/x.y.z/docker/scripts/build.sh.
 
-This recipe assumes that you installed the gremlin console with the
-http://tinkerpop.apache.org/docs/x.y.z/reference/#spark-plugin[spark plugin] (the
-http://tinkerpop.apache.org/docs/x.y.z/reference/#hadoop-plugin[hadoop plugin] is optional). Your Hadoop cluster
-may have been configured to use file compression, e.g. lzo compression. If so, you need to copy the relevant
-jar (e.g. `hadoop-lzo-*.jar`) to gremlin console's `ext/spark-gremlin/lib` folder.
+This recipe assumes that you installed the Gremlin Console with the
+http://tinkerpop.apache.org/docs/x.y.z/reference/#spark-plugin[Spark plugin] (the
+http://tinkerpop.apache.org/docs/x.y.z/reference/#hadoop-plugin[Hadoop plugin] is optional). Your Hadoop cluster
+may have been configured to use file compression, e.g. LZO compression. If so, you need to copy the relevant
+jar (e.g. `hadoop-lzo-*.jar`) to Gremlin Console's `ext/spark-gremlin/lib` folder.
 
-For starting the gremlin console in the right environment, create a shell script (e.g. `bin/spark-yarn.sh`) with the
+For starting the Gremlin Console in the right environment, create a shell script (e.g. `bin/spark-yarn.sh`) with the
 contents below. Of course, actual values for `GREMLIN_HOME`, `HADOOP_HOME` and `HADOOP_CONF_DIR` need to be adapted to
 your particular environment.
 
@@ -82,7 +82,7 @@ bin/gremlin.sh
 Running the job
 ~~~~~~~~~~~~~~~
 
-You can now run a gremlin OLAP query with Spark on Yarn:
+You can now run a gremlin OLAP query with Spark on YARN:
 
 [source]
 ----
@@ -110,39 +110,43 @@ g = graph.traversal().withComputer(SparkGraphComputer)
 g.V().group().by(values('name')).by(both().count())
 ----
 
-If you run into exceptions, the best way to see what is going on is to look into the Yarn Resource Manager UI
-(e.g. http://rm.your.domain:8088/cluster) to find the `applicationId` and get the logs using
-`yarn logs -applicationId application_1498627870374_0008` from the command shell.
+If you run into exceptions, you will have to dig into the logs. You can do this from the command line with
+`yarn application -list -appStates ALL` to find the `applicationId`, while the logs are available with
+`yarn logs -applicationId application_1498627870374_0008`. Alternatively, you can inspect the logs via
+the YARN Resource Manager UI (e.g. http://rm.your.domain:8088/cluster), provided that YARN was configured with the
+`yarn.log-aggregation-enable` property set to `true`. See the Spark documentation for
+https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application[additional hints].
 
 Explanation
 ~~~~~~~~~~~
 
 This recipe does not require running the `bin/hadoop/init-tp-spark.sh` script described in the
-http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[reference documentation] and thus is also
+http://tinkerpop.apache.org/docs/x.y.z/reference/#sparkgraphcomputer[reference documentation] and thus is also
 valid for cluster users without access permissions to do so.
 Rather, it exploits the `spark.yarn.dist.archives` property, which points to an archive with jars on the local file
-system and is loaded into the various Yarn containers. As a result the `spark-gremlin.zip` archive becomes available
-as the directory named `spark-gremlin.zip` in the Yarn containers. The `spark.executor.extraClassPath` and
+system and is loaded into the various YARN containers. As a result the `spark-gremlin.zip` archive becomes available
+as the directory named `spark-gremlin.zip` in the YARN containers. The `spark.executor.extraClassPath` and
 `spark.yarn.appMasterEnv.CLASSPATH` properties point to the files inside this archive.
 This is why they contain the `./spark-gremlin.zip/*` item. Just because a Spark executor got the archive with
 jars loaded into its container, does not mean it knows how to access them.
 
-Also the `HADOOP_GREMLIN_LIBS` mechanism is not used because it can not work for Spark on Yarn as implemented (jars
-added to the `SparkContext` are not available to the Yarn application master).
+Also the `HADOOP_GREMLIN_LIBS` mechanism is not used because it can not work for Spark on YARN as implemented (jars
+added to the `SparkContext` are not available to the YARN application master).
 
 The `gremlin.spark.persistContext` property is explained in the reference documentation of
-http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[SparkGraphComputer]: it helps in getting
-follow-up OLAP queries answered faster, because you skip the overhead for getting resources from Yarn.
+http://tinkerpop.apache.org/docs/x.y.z/reference/#sparkgraphcomputer[SparkGraphComputer]: it helps in getting
+follow-up OLAP queries answered faster, because you skip the overhead for getting resources from YARN.
 
 Additional configuration options
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-This recipe does most of the graph configuration in the gremlin console so that environment variables can be used and
+This recipe does most of the graph configuration in the Gremlin Console so that environment variables can be used and
 the chance of configuration mistakes is minimal. Once you have your setup working, it is probably easier to make a copy
 of the `conf/hadoop/hadoop-gryo.properties` file and put the property values specific to your environment there. This is
 also the right moment to take a look at the `spark-defaults.xml` file of your cluster, in particular the settings for
-the Spark History Service, which allows you to access logs of finished jobs via the Yarn resource manager UI.
+the https://spark.apache.org/docs/latest/monitoring.html[Spark History Service], which allows you to access logs of
+finished applications via the YARN resource manager UI.
 
-This recipe uses the gremlin console, but things should not be very different for your own JVM-based application,
+This recipe uses the Gremlin Console, but things should not be very different for your own JVM-based application,
 as long as you do not use the `spark-submit` or `spark-shell` commands. You will also want to check the additional
 runtime dependencies listed in the `Gremlin-Plugin-Dependencies` section of the manifest file in the `spark-gremlin`
 jar.


[09/14] tinkerpop git commit: Added spark-yarn recipe and missing manifest items in spark-gremlin

Posted by ok...@apache.org.
Added spark-yarn recipe and missing manifest items in spark-gremlin


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/3396e924
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/3396e924
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/3396e924

Branch: refs/heads/master
Commit: 3396e924243845204de0f47962b58a3ffef87459
Parents: 19e261c
Author: HadoopMarc <vt...@xs4all.nl>
Authored: Sun Sep 10 14:45:45 2017 +0200
Committer: HadoopMarc <vt...@xs4all.nl>
Committed: Thu Oct 19 16:11:57 2017 +0200

----------------------------------------------------------------------
 docs/preprocessor/preprocess-file.sh       |   2 +-
 docs/src/recipes/index.asciidoc            |   2 +
 docs/src/recipes/olap-spark-yarn.asciidoc  | 145 ++++++++++++++++++++++++
 hadoop-gremlin/conf/hadoop-gryo.properties |   2 +-
 pom.xml                                    |   1 +
 spark-gremlin/pom.xml                      |   5 +-
 6 files changed, 153 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/3396e924/docs/preprocessor/preprocess-file.sh
----------------------------------------------------------------------
diff --git a/docs/preprocessor/preprocess-file.sh b/docs/preprocessor/preprocess-file.sh
index 16612fe..0ca534a 100755
--- a/docs/preprocessor/preprocess-file.sh
+++ b/docs/preprocessor/preprocess-file.sh
@@ -107,7 +107,7 @@ if [ ! ${SKIP} ] && [ $(grep -c '^\[gremlin' ${input}) -gt 0 ]; then
       mv ext/spark-gremlin .ext/
       cat ext/plugins.txt | tee .ext/plugins.all | grep -Fv 'SparkGremlinPlugin' > .ext/plugins.txt
       ;;
-    "implementations-hadoop-start" | "implementations-hadoop-end" | "implementations-spark" | "implementations-giraph")
+    "implementations-hadoop-start" | "implementations-hadoop-end" | "implementations-spark" | "implementations-giraph" | "olap-spark-yarn")
       # deactivate Neo4j plugin to prevent version conflicts between TinkerPop's Spark jars and Neo4j's Spark jars
       mkdir .ext
       mv ext/neo4j-gremlin .ext/

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/3396e924/docs/src/recipes/index.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/index.asciidoc b/docs/src/recipes/index.asciidoc
index f549b1f..bb88301 100644
--- a/docs/src/recipes/index.asciidoc
+++ b/docs/src/recipes/index.asciidoc
@@ -58,6 +58,8 @@ include::traversal-induced-values.asciidoc[]
 
 include::tree.asciidoc[]
 
+include::olap-spark-yarn.asciidoc[]
+
 = Implementation Recipes
 
 include::style-guide.asciidoc[]

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/3396e924/docs/src/recipes/olap-spark-yarn.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/olap-spark-yarn.asciidoc b/docs/src/recipes/olap-spark-yarn.asciidoc
new file mode 100644
index 0000000..fbe9c8f
--- /dev/null
+++ b/docs/src/recipes/olap-spark-yarn.asciidoc
@@ -0,0 +1,145 @@
+////
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+////
+[[olap-spark-yarn]]
+OLAP traversals with Spark on Yarn
+----------------------------------
+
+Tinkerpop's combination of http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[SparkGraphComputer]
+and http://tinkerpop.apache.org/docs/current/reference/#_properties_files[HadoopGraph] allows for running
+distributed, analytical graph queries (OLAP) on a computer cluster. The
+http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[reference documentation] covers the cases
+where Spark runs locally or where the cluster is managed by a Spark server. However, many users can only run OLAP jobs
+via the http://hadoop.apache.org/[Hadoop 2.x] Resource Manager (Yarn), which requires `SparkGraphComputer` to be
+configured differently. This recipe describes this configuration.
+
+Approach
+~~~~~~~~
+
+Most configuration problems of Tinkerpop with Spark on Yarn stem from three reasons:
+
+1. `SparkGraphComputer` creates its own `SparkContext` so it does not get any configs from the usual `spark-submit` command.
+2. The Tinkerpop Spark-plugin did not include Spark Yarn runtime dependencies until version 3.2.7/3.3.1.
+3. Resolving reason 2 by adding the cluster's `spark-assembly` jar to the classpath creates a host of version
+conflicts, because Spark 1.x dependency versions have remained frozen since 2014.
+
+The current recipe follows a minimalist approach in which no dependencies are added to the dependencies
+included in the Tinkerpop binary distribution. The Hadoop cluster's Spark installation is completely ignored. This
+approach minimizes the chance of dependency version conflicts.
+
+Prerequisites
+~~~~~~~~~~~~~
+This recipe is suitable for both a real external and a local pseudo Hadoop cluster. While the recipe is maintained
+for the vanilla Hadoop pseudo-cluster, it has been reported to work on real clusters with Hadoop distributions
+from various vendors.
+
+If you want to try the recipe on a local Hadoop pseudo-cluster, the easiest way to install
+it is to look at the install script at https://github.com/apache/tinkerpop/blob/x.y.z/docker/hadoop/install.sh
+and the `start hadoop` section of https://github.com/apache/tinkerpop/blob/x.y.z/docker/scripts/build.sh.
+
+This recipe assumes that you installed the gremlin console with the
+http://tinkerpop.apache.org/docs/x.y.z/reference/#spark-plugin[spark plugin] (the
+http://tinkerpop.apache.org/docs/x.y.z/reference/#hadoop-plugin[hadoop plugin] is optional). Your Hadoop cluster
+may have been configured to use file compression, e.g. lzo compression. If so, you need to copy the relevant
+jar (e.g. `hadoop-lzo-*.jar`) to gremlin console's `ext/spark-gremlin/lib` folder.
+
+For starting the gremlin console in the right environment, create a shell script (e.g. `bin/spark-yarn.sh`) with the
+contents below. Of course, actual values for `GREMLIN_HOME`, `HADOOP_HOME` and `HADOOP_CONF_DIR` need to be adapted to
+your particular environment.
+
+[source]
+----
+#!/bin/bash
+# Variables to be adapted to the actual environment
+GREMLIN_HOME=/home/yourdir/lib/apache-tinkerpop-gremlin-console-x.y.z-standalone
+export HADOOP_HOME=/usr/local/lib/hadoop-2.7.2
+export HADOOP_CONF_DIR=/usr/local/lib/hadoop-2.7.2/etc/hadoop
+
+# Have Tinkerpop find the hadoop cluster configs and hadoop native libraries
+export CLASSPATH=$HADOOP_CONF_DIR
+export JAVA_OPTIONS="-Djava.library.path=$HADOOP_HOME/lib/native:$HADOOP_HOME/lib/native/Linux-amd64-64"
+
+# Start gremlin-console without getting the HADOOP_GREMLIN_LIBS warning
+cd $GREMLIN_HOME
+[ ! -e empty ] && mkdir empty
+export HADOOP_GREMLIN_LIBS=$GREMLIN_HOME/empty
+bin/gremlin.sh
+----
+
+Running the job
+~~~~~~~~~~~~~~~
+
+You can now run a gremlin OLAP query with Spark on Yarn:
+
+[source]
+----
+$ hdfs dfs -put data/tinkerpop-modern.kryo .
+$ . bin/spark-yarn.sh
+----
+
+[gremlin-groovy]
+----
+hadoop = System.getenv('HADOOP_HOME')
+hadoopConfDir = System.getenv('HADOOP_CONF_DIR')
+archive = 'spark-gremlin.zip'
+archivePath = "/tmp/$archive"
+['bash', '-c', "rm $archivePath 2>/dev/null; cd ext/spark-gremlin/lib && zip $archivePath *.jar"].execute()
+conf = new PropertiesConfiguration('conf/hadoop/hadoop-gryo.properties')
+conf.setProperty('spark.master', 'yarn-client')
+conf.setProperty('spark.yarn.dist.archives', "$archivePath")
+conf.setProperty('spark.yarn.appMasterEnv.CLASSPATH', "./$archive/*:$hadoopConfDir")
+conf.setProperty('spark.executor.extraClassPath', "./$archive/*:$hadoopConfDir")
+conf.setProperty('spark.driver.extraLibraryPath', "$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64")
+conf.setProperty('spark.executor.extraLibraryPath', "$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64")
+graph = GraphFactory.open(conf)
+g = graph.traversal().withComputer(SparkGraphComputer)
+g.V().group().by(values('name')).by(both().count())
+----
+
+If you run into exceptions, the best way to see what is going on is to look into the Yarn Resource Manager UI
+(e.g. http://rm.your.domain:8088/cluster) to find the `applicationId` and get the logs using
+`yarn logs -applicationId application_1498627870374_0008` from the command shell.
+
+Explanation
+~~~~~~~~~~~
+
+This recipe does not require running the `bin/hadoop/init-tp-spark.sh` script described in the
+http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[reference documentation] and thus is also
+valid for cluster users without access permissions to do so.
+Rather, it exploits the `spark.yarn.dist.archives` property, which points to an archive with jars on the local file
+system and is loaded into the various Yarn containers. As a result the `spark-gremlin.zip` archive becomes available
+as the directory named `spark-gremlin.zip` in the Yarn containers. The `spark.executor.extraClassPath` and
+`spark.yarn.appMasterEnv.CLASSPATH` properties point to the files inside this archive.
+This is why they contain the `./spark-gremlin.zip/*` item. Just because a Spark executor got the archive with
+jars loaded into its container, does not mean it knows how to access them.
+Also the `HADOOP_GREMLIN_LIBS` mechanism is not used because it can not work for Spark on Yarn as implemented (jars
+added to the `SparkContext` are not available to the Yarn application master).
+
+Additional configuration options
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This recipe does most of the graph configuration in the gremlin console so that environment variables can be used and
+the chance of configuration mistakes is minimal. Once you have your setup working, it is probably easier to make a copy
+of the `conf/hadoop/hadoop-gryo.properties` file and put the property values specific to your environment there. This is
+also the right moment to take a look at the `spark-defaults.xml` file of your cluster, in particular the settings for
+the Spark History Service, which allows you to access logs of finished jobs via the Yarn resource manager UI.
+
+This recipe uses the gremlin console, but things should not be very different for your own JVM-based application,
+as long as you do not use the `spark-submit` or `spark-shell` commands.
+
+You may not like the idea that the Hadoop and Spark jars from the Tinkerpop distribution differ from the versions in
+your cluster. If so, just build Tinkerpop from source with the corresponding dependencies changed in the various `pom.xml`
+files (e.g. `spark-core_2.10-1.6.1-some-vendor.jar` instead of `spark-core_2.10-1.6.1.jar`). Of course, Tinkerpop will
+only build for exactly matching or slightly differing artifact versions.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/3396e924/hadoop-gremlin/conf/hadoop-gryo.properties
----------------------------------------------------------------------
diff --git a/hadoop-gremlin/conf/hadoop-gryo.properties b/hadoop-gremlin/conf/hadoop-gryo.properties
index aaab24d..7990431 100644
--- a/hadoop-gremlin/conf/hadoop-gryo.properties
+++ b/hadoop-gremlin/conf/hadoop-gryo.properties
@@ -29,8 +29,8 @@ gremlin.hadoop.outputLocation=output
 spark.master=local[4]
 spark.executor.memory=1g
 spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer
+gremlin.spark.persistContext=true
 # gremlin.spark.graphStorageLevel=MEMORY_AND_DISK
-# gremlin.spark.persistContext=true
 # gremlin.spark.graphWriter=org.apache.tinkerpop.gremlin.spark.structure.io.PersistedOutputRDD
 # gremlin.spark.persistStorageLevel=DISK_ONLY
 # spark.kryo.registrationRequired=true

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/3396e924/pom.xml
----------------------------------------------------------------------
diff --git a/pom.xml b/pom.xml
index 867aaf4..5a93109 100644
--- a/pom.xml
+++ b/pom.xml
@@ -149,6 +149,7 @@ limitations under the License.
         <netty.version>4.0.50.Final</netty.version>
         <slf4j.version>1.7.21</slf4j.version>
         <snakeyaml.version>1.15</snakeyaml.version>
+        <spark.version>1.6.1</spark.version>
 
         <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
         <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/3396e924/spark-gremlin/pom.xml
----------------------------------------------------------------------
diff --git a/spark-gremlin/pom.xml b/spark-gremlin/pom.xml
index 560e236..77a455b 100644
--- a/spark-gremlin/pom.xml
+++ b/spark-gremlin/pom.xml
@@ -104,7 +104,7 @@
         <dependency>
             <groupId>org.apache.spark</groupId>
             <artifactId>spark-core_2.10</artifactId>
-            <version>1.6.1</version>
+            <version>${spark.version}</version>
             <exclusions>
                 <!-- self conflicts -->
                 <exclusion>
@@ -382,7 +382,8 @@
                 <configuration>
                     <archive>
                         <manifestEntries>
-                            <Gremlin-Plugin-Dependencies>org.apache.hadoop:hadoop-client:2.7.2
+                            <Gremlin-Plugin-Dependencies>
+                                org.apache.hadoop:hadoop-client:${hadoop.version};org.apache.hadoop:hadoop-yarn-server-web-proxy:${hadoop.version};org.apache.spark:spark-yarn_2.10:${spark.version}
                             </Gremlin-Plugin-Dependencies>
                             <!-- deletes the servlet-api jar from the path after install - causes conflicts -->
                             <Gremlin-Plugin-Paths>servlet-api-2.5.jar=</Gremlin-Plugin-Paths>


[04/14] tinkerpop git commit: Corrected TinkerPop naming, added changelog and pointer to manifest

Posted by ok...@apache.org.
Corrected TinkerPop naming, added changelog and pointer to manifest


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/16f3ee7e
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/16f3ee7e
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/16f3ee7e

Branch: refs/heads/master
Commit: 16f3ee7e765e05a344be62774c14617b90f17585
Parents: a60ac45
Author: HadoopMarc <vt...@xs4all.nl>
Authored: Sun Oct 1 16:38:42 2017 +0200
Committer: HadoopMarc <vt...@xs4all.nl>
Committed: Thu Oct 12 21:55:28 2017 +0200

----------------------------------------------------------------------
 docs/src/recipes/olap-spark-yarn.asciidoc | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/16f3ee7e/docs/src/recipes/olap-spark-yarn.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/olap-spark-yarn.asciidoc b/docs/src/recipes/olap-spark-yarn.asciidoc
index f909ce0..01aedcb 100644
--- a/docs/src/recipes/olap-spark-yarn.asciidoc
+++ b/docs/src/recipes/olap-spark-yarn.asciidoc
@@ -18,7 +18,7 @@ limitations under the License.
 OLAP traversals with Spark on Yarn
 ----------------------------------
 
-Tinkerpop's combination of http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[SparkGraphComputer]
+TinkerPop's combination of http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[SparkGraphComputer]
 and http://tinkerpop.apache.org/docs/current/reference/#_properties_files[HadoopGraph] allows for running
 distributed, analytical graph queries (OLAP) on a computer cluster. The
 http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[reference documentation] covers the cases
@@ -29,15 +29,15 @@ configured differently. This recipe describes this configuration.
 Approach
 ~~~~~~~~
 
-Most configuration problems of Tinkerpop with Spark on Yarn stem from three reasons:
+Most configuration problems of TinkerPop with Spark on Yarn stem from three reasons:
 
 1. `SparkGraphComputer` creates its own `SparkContext` so it does not get any configs from the usual `spark-submit` command.
-2. The Tinkerpop Spark-plugin did not include Spark Yarn runtime dependencies until version 3.2.7/3.3.1.
+2. The TinkerPop Spark plugin did not include Spark on Yarn runtime dependencies until version 3.2.7/3.3.1.
 3. Resolving reason 2 by adding the cluster's Spark jars to the classpath may create all kinds of version
 conflicts with the Tinkerpop dependencies.
 
 The current recipe follows a minimalist approach in which no dependencies are added to the dependencies
-included in the Tinkerpop binary distribution. The Hadoop cluster's Spark installation is completely ignored. This
+included in the TinkerPop binary distribution. The Hadoop cluster's Spark installation is completely ignored. This
 approach minimizes the chance of dependency version conflicts.
 
 Prerequisites
@@ -68,7 +68,7 @@ GREMLIN_HOME=/home/yourdir/lib/apache-tinkerpop-gremlin-console-x.y.z-standalone
 export HADOOP_HOME=/usr/local/lib/hadoop-2.7.2
 export HADOOP_CONF_DIR=/usr/local/lib/hadoop-2.7.2/etc/hadoop
 
-# Have Tinkerpop find the hadoop cluster configs and hadoop native libraries
+# Have TinkerPop find the hadoop cluster configs and hadoop native libraries
 export CLASSPATH=$HADOOP_CONF_DIR
 export JAVA_OPTIONS="-Djava.library.path=$HADOOP_HOME/lib/native:$HADOOP_HOME/lib/native/Linux-amd64-64"
 
@@ -137,9 +137,11 @@ also the right moment to take a look at the `spark-defaults.xml` file of your cl
 the Spark History Service, which allows you to access logs of finished jobs via the Yarn resource manager UI.
 
 This recipe uses the gremlin console, but things should not be very different for your own JVM-based application,
-as long as you do not use the `spark-submit` or `spark-shell` commands.
+as long as you do not use the `spark-submit` or `spark-shell` commands. You will also want to check the additional
+runtime dependencies listed in the `Gremlin-Plugin-Dependencies` section of the manifest file in the `spark-gremlin`
+jar.
 
 You may not like the idea that the Hadoop and Spark jars from the Tinkerpop distribution differ from the versions in
-your cluster. If so, just build Tinkerpop from source with the corresponding dependencies changed in the various `pom.xml`
-files (e.g. `spark-core_2.11-2.2.0-some-vendor.jar` instead of `spark-core_2.11-2.2.0.jar`). Of course, Tinkerpop will
+your cluster. If so, just build TinkerPop from source with the corresponding dependencies changed in the various `pom.xml`
+files (e.g. `spark-core_2.11-2.2.0-some-vendor.jar` instead of `spark-core_2.11-2.2.0.jar`). Of course, TinkerPop will
 only build for exactly matching or slightly differing artifact versions.
\ No newline at end of file


[12/14] tinkerpop git commit: Merge branch 'spark-yarn-recipe-tp32' of https://github.com/vtslab/incubator-tinkerpop into tp32

Posted by ok...@apache.org.
Merge branch 'spark-yarn-recipe-tp32' of https://github.com/vtslab/incubator-tinkerpop into tp32


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/a451ca56
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/a451ca56
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/a451ca56

Branch: refs/heads/master
Commit: a451ca56a9c75ab9c1d4d2b9c2523a4004193aa0
Parents: bef43d6 6110354
Author: Marko A. Rodriguez <ok...@gmail.com>
Authored: Wed Nov 1 12:02:41 2017 -0600
Committer: Marko A. Rodriguez <ok...@gmail.com>
Committed: Wed Nov 1 12:02:41 2017 -0600

----------------------------------------------------------------------
 CHANGELOG.asciidoc                        |   2 +
 docs/preprocessor/preprocess-file.sh      |   2 +-
 docs/src/recipes/index.asciidoc           |   2 +
 docs/src/recipes/olap-spark-yarn.asciidoc | 157 +++++++++++++++++++++++++
 pom.xml                                   |   1 +
 spark-gremlin/pom.xml                     |   5 +-
 6 files changed, 166 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/a451ca56/CHANGELOG.asciidoc
----------------------------------------------------------------------


[05/14] tinkerpop git commit: More consistent capitalization and other text improvements

Posted by ok...@apache.org.
More consistent capitalization and other text improvements


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/e30cb7d8
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/e30cb7d8
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/e30cb7d8

Branch: refs/heads/master
Commit: e30cb7d84c6b66e3cee0fecda70bc826edb59667
Parents: df73338
Author: HadoopMarc <vt...@xs4all.nl>
Authored: Mon Oct 16 23:16:22 2017 +0200
Committer: HadoopMarc <vt...@xs4all.nl>
Committed: Tue Oct 17 09:45:26 2017 +0200

----------------------------------------------------------------------
 docs/src/recipes/olap-spark-yarn.asciidoc | 59 ++++++++++++++------------
 1 file changed, 32 insertions(+), 27 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/e30cb7d8/docs/src/recipes/olap-spark-yarn.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/olap-spark-yarn.asciidoc b/docs/src/recipes/olap-spark-yarn.asciidoc
index 1bcc443..88f3885 100644
--- a/docs/src/recipes/olap-spark-yarn.asciidoc
+++ b/docs/src/recipes/olap-spark-yarn.asciidoc
@@ -15,24 +15,24 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ////
 [[olap-spark-yarn]]
-OLAP traversals with Spark on Yarn
+OLAP traversals with Spark on YARN
 ----------------------------------
 
-TinkerPop's combination of http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[SparkGraphComputer]
-and http://tinkerpop.apache.org/docs/current/reference/#_properties_files[HadoopGraph] allows for running
+TinkerPop's combination of http://tinkerpop.apache.org/docs/x.y.z/reference/#sparkgraphcomputer[SparkGraphComputer]
+and http://tinkerpop.apache.org/docs/x.y.z/reference/#_properties_files[HadoopGraph] allows for running
 distributed, analytical graph queries (OLAP) on a computer cluster. The
-http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[reference documentation] covers the cases
+http://tinkerpop.apache.org/docs/x.y.z/reference/#sparkgraphcomputer[reference documentation] covers the cases
 where Spark runs locally or where the cluster is managed by a Spark server. However, many users can only run OLAP jobs
-via the http://hadoop.apache.org/[Hadoop 2.x] Resource Manager (Yarn), which requires `SparkGraphComputer` to be
+via the http://hadoop.apache.org/[Hadoop 2.x] Resource Manager (YARN), which requires `SparkGraphComputer` to be
 configured differently. This recipe describes this configuration.
 
 Approach
 ~~~~~~~~
 
-Most configuration problems of TinkerPop with Spark on Yarn stem from three reasons:
+Most configuration problems of TinkerPop with Spark on YARN stem from three reasons:
 
 1. `SparkGraphComputer` creates its own `SparkContext` so it does not get any configs from the usual `spark-submit` command.
-2. The TinkerPop Spark plugin did not include Spark on Yarn runtime dependencies until version 3.2.7/3.3.1.
+2. The TinkerPop Spark plugin did not include Spark on YARN runtime dependencies until version 3.2.7/3.3.1.
 3. Resolving reason 2 by adding the cluster's Spark jars to the classpath may create all kinds of version
 conflicts with the Tinkerpop dependencies.
 
@@ -50,13 +50,13 @@ If you want to try the recipe on a local Hadoop pseudo-cluster, the easiest way
 it is to look at the install script at https://github.com/apache/tinkerpop/blob/x.y.z/docker/hadoop/install.sh
 and the `start hadoop` section of https://github.com/apache/tinkerpop/blob/x.y.z/docker/scripts/build.sh.
 
-This recipe assumes that you installed the gremlin console with the
-http://tinkerpop.apache.org/docs/x.y.z/reference/#spark-plugin[spark plugin] (the
-http://tinkerpop.apache.org/docs/x.y.z/reference/#hadoop-plugin[hadoop plugin] is optional). Your Hadoop cluster
-may have been configured to use file compression, e.g. lzo compression. If so, you need to copy the relevant
-jar (e.g. `hadoop-lzo-*.jar`) to gremlin console's `ext/spark-gremlin/lib` folder.
+This recipe assumes that you installed the Gremlin Console with the
+http://tinkerpop.apache.org/docs/x.y.z/reference/#spark-plugin[Spark plugin] (the
+http://tinkerpop.apache.org/docs/x.y.z/reference/#hadoop-plugin[Hadoop plugin] is optional). Your Hadoop cluster
+may have been configured to use file compression, e.g. LZO compression. If so, you need to copy the relevant
+jar (e.g. `hadoop-lzo-*.jar`) to Gremlin Console's `ext/spark-gremlin/lib` folder.
 
-For starting the gremlin console in the right environment, create a shell script (e.g. `bin/spark-yarn.sh`) with the
+For starting the Gremlin Console in the right environment, create a shell script (e.g. `bin/spark-yarn.sh`) with the
 contents below. Of course, actual values for `GREMLIN_HOME`, `HADOOP_HOME` and `HADOOP_CONF_DIR` need to be adapted to
 your particular environment.
 
@@ -82,7 +82,7 @@ bin/gremlin.sh
 Running the job
 ~~~~~~~~~~~~~~~
 
-You can now run a gremlin OLAP query with Spark on Yarn:
+You can now run a gremlin OLAP query with Spark on YARN:
 
 [source]
 ----
@@ -110,39 +110,44 @@ g = graph.traversal().withComputer(SparkGraphComputer)
 g.V().group().by(values('name')).by(both().count())
 ----
 
-If you run into exceptions, the best way to see what is going on is to look into the Yarn Resource Manager UI
-(e.g. http://rm.your.domain:8088/cluster) to find the `applicationId` and get the logs using
-`yarn logs -applicationId application_1498627870374_0008` from the command shell.
+If you run into exceptions, you will have to dig into the logs. You can do this from the command line with
+`yarn application -list -appStates ALL` to find the `applicationId`, while the logs are available with
+`yarn logs -applicationId application_1498627870374_0008`. Alternatively, you can inspect the logs via
+the YARN Resource Manager UI (e.g. http://rm.your.domain:8088/cluster), provided that YARN was configured with the
+`yarn.log-aggregation-enable` property set to `true`. See the Spark documentation for
+https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application[additional hints].
 
 Explanation
 ~~~~~~~~~~~
 
 This recipe does not require running the `bin/hadoop/init-tp-spark.sh` script described in the
-http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[reference documentation] and thus is also
+http://tinkerpop.apache.org/docs/x.y.z/reference/#sparkgraphcomputer[reference documentation] and thus is also
 valid for cluster users without access permissions to do so.
+
 Rather, it exploits the `spark.yarn.archive` property, which points to an archive with jars on the local file
-system and is loaded into the various Yarn containers. As a result the `spark-gremlin.zip` archive becomes available
-as the directory named `+__spark_libs__+` in the Yarn containers. The `spark.executor.extraClassPath` and
+system and is loaded into the various YARN containers. As a result the `spark-gremlin.zip` archive becomes available
+as the directory named `+__spark_libs__+` in the YARN containers. The `spark.executor.extraClassPath` and
 `spark.yarn.appMasterEnv.CLASSPATH` properties point to the jars inside this directory.
 This is why they contain the `+./__spark_lib__/*+` item. Just because a Spark executor got the archive with
 jars loaded into its container, does not mean it knows how to access them.
 
-Also the `HADOOP_GREMLIN_LIBS` mechanism is not used because it can not work for Spark on Yarn as implemented (jars
-added to the `SparkContext` are not available to the Yarn application master).
+Also the `HADOOP_GREMLIN_LIBS` mechanism is not used because it can not work for Spark on YARN as implemented (jars
+added to the `SparkContext` are not available to the YARN application master).
 
 The `gremlin.spark.persistContext` property is explained in the reference documentation of
-http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[SparkGraphComputer]: it helps in getting
-follow-up OLAP queries answered faster, because you skip the overhead for getting resources from Yarn.
+http://tinkerpop.apache.org/docs/x.y.z/reference/#sparkgraphcomputer[SparkGraphComputer]: it helps in getting
+follow-up OLAP queries answered faster, because you skip the overhead for getting resources from YARN.
 
 Additional configuration options
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-This recipe does most of the graph configuration in the gremlin console so that environment variables can be used and
+This recipe does most of the graph configuration in the Gremlin Console so that environment variables can be used and
 the chance of configuration mistakes is minimal. Once you have your setup working, it is probably easier to make a copy
 of the `conf/hadoop/hadoop-gryo.properties` file and put the property values specific to your environment there. This is
 also the right moment to take a look at the `spark-defaults.xml` file of your cluster, in particular the settings for
-the Spark History Service, which allows you to access logs of finished jobs via the Yarn resource manager UI.
+the https://spark.apache.org/docs/latest/monitoring.html[Spark History Service], which allows you to access logs of
+finished applications via the YARN resource manager UI.
 
-This recipe uses the gremlin console, but things should not be very different for your own JVM-based application,
+This recipe uses the Gremlin Console, but things should not be very different for your own JVM-based application,
 as long as you do not use the `spark-submit` or `spark-shell` commands. You will also want to check the additional
 runtime dependencies listed in the `Gremlin-Plugin-Dependencies` section of the manifest file in the `spark-gremlin`
 jar.


[08/14] tinkerpop git commit: Moved gremlin.spark.persistContext back to recipe

Posted by ok...@apache.org.
Moved gremlin.spark.persistContext back to recipe


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/db859fb5
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/db859fb5
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/db859fb5

Branch: refs/heads/master
Commit: db859fb51cc0c37e28747f68b50a11dfb3799413
Parents: b0b087e
Author: HadoopMarc <vt...@xs4all.nl>
Authored: Tue Oct 3 21:20:16 2017 +0200
Committer: HadoopMarc <vt...@xs4all.nl>
Committed: Thu Oct 19 16:11:57 2017 +0200

----------------------------------------------------------------------
 docs/src/recipes/olap-spark-yarn.asciidoc  | 6 ++++++
 hadoop-gremlin/conf/hadoop-gryo.properties | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/db859fb5/docs/src/recipes/olap-spark-yarn.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/olap-spark-yarn.asciidoc b/docs/src/recipes/olap-spark-yarn.asciidoc
index 464470f..f55edaa 100644
--- a/docs/src/recipes/olap-spark-yarn.asciidoc
+++ b/docs/src/recipes/olap-spark-yarn.asciidoc
@@ -104,6 +104,7 @@ conf.setProperty('spark.yarn.appMasterEnv.CLASSPATH', "./$archive/*:$hadoopConfD
 conf.setProperty('spark.executor.extraClassPath', "./$archive/*:$hadoopConfDir")
 conf.setProperty('spark.driver.extraLibraryPath', "$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64")
 conf.setProperty('spark.executor.extraLibraryPath', "$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64")
+conf.setProperty('gremlin.spark.persistContext', 'true')
 graph = GraphFactory.open(conf)
 g = graph.traversal().withComputer(SparkGraphComputer)
 g.V().group().by(values('name')).by(both().count())
@@ -125,9 +126,14 @@ as the directory named `spark-gremlin.zip` in the Yarn containers. The `spark.ex
 `spark.yarn.appMasterEnv.CLASSPATH` properties point to the files inside this archive.
 This is why they contain the `./spark-gremlin.zip/*` item. Just because a Spark executor got the archive with
 jars loaded into its container, does not mean it knows how to access them.
+
 Also the `HADOOP_GREMLIN_LIBS` mechanism is not used because it can not work for Spark on Yarn as implemented (jars
 added to the `SparkContext` are not available to the Yarn application master).
 
+The `gremlin.spark.persistContext` property is explained in the reference documentation of
+http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[SparkGraphComputer]: it helps in getting
+follow-up OLAP queries answered faster, because you skip the overhead for getting resources from Yarn.
+
 Additional configuration options
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 This recipe does most of the graph configuration in the gremlin console so that environment variables can be used and

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/db859fb5/hadoop-gremlin/conf/hadoop-gryo.properties
----------------------------------------------------------------------
diff --git a/hadoop-gremlin/conf/hadoop-gryo.properties b/hadoop-gremlin/conf/hadoop-gryo.properties
index 7990431..aaab24d 100644
--- a/hadoop-gremlin/conf/hadoop-gryo.properties
+++ b/hadoop-gremlin/conf/hadoop-gryo.properties
@@ -29,8 +29,8 @@ gremlin.hadoop.outputLocation=output
 spark.master=local[4]
 spark.executor.memory=1g
 spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer
-gremlin.spark.persistContext=true
 # gremlin.spark.graphStorageLevel=MEMORY_AND_DISK
+# gremlin.spark.persistContext=true
 # gremlin.spark.graphWriter=org.apache.tinkerpop.gremlin.spark.structure.io.PersistedOutputRDD
 # gremlin.spark.persistStorageLevel=DISK_ONLY
 # spark.kryo.registrationRequired=true


[02/14] tinkerpop git commit: Added spark-yarn recipe and missing manifest items in spark-gremlin

Posted by ok...@apache.org.
Added spark-yarn recipe and missing manifest items in spark-gremlin


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/9127a4fd
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/9127a4fd
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/9127a4fd

Branch: refs/heads/master
Commit: 9127a4fd4f1a954a888eb946157c1409c93e2cef
Parents: 99d0814
Author: HadoopMarc <vt...@xs4all.nl>
Authored: Sun Sep 10 14:45:45 2017 +0200
Committer: HadoopMarc <vt...@xs4all.nl>
Committed: Thu Oct 12 21:54:06 2017 +0200

----------------------------------------------------------------------
 docs/preprocessor/preprocess-file.sh       |   2 +-
 docs/src/recipes/index.asciidoc            |   2 +
 docs/src/recipes/olap-spark-yarn.asciidoc  | 145 ++++++++++++++++++++++++
 hadoop-gremlin/conf/hadoop-gryo.properties |   2 +-
 pom.xml                                    |   1 +
 spark-gremlin/pom.xml                      |   5 +-
 6 files changed, 153 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/9127a4fd/docs/preprocessor/preprocess-file.sh
----------------------------------------------------------------------
diff --git a/docs/preprocessor/preprocess-file.sh b/docs/preprocessor/preprocess-file.sh
index 16612fe..0ca534a 100755
--- a/docs/preprocessor/preprocess-file.sh
+++ b/docs/preprocessor/preprocess-file.sh
@@ -107,7 +107,7 @@ if [ ! ${SKIP} ] && [ $(grep -c '^\[gremlin' ${input}) -gt 0 ]; then
       mv ext/spark-gremlin .ext/
       cat ext/plugins.txt | tee .ext/plugins.all | grep -Fv 'SparkGremlinPlugin' > .ext/plugins.txt
       ;;
-    "implementations-hadoop-start" | "implementations-hadoop-end" | "implementations-spark" | "implementations-giraph")
+    "implementations-hadoop-start" | "implementations-hadoop-end" | "implementations-spark" | "implementations-giraph" | "olap-spark-yarn")
       # deactivate Neo4j plugin to prevent version conflicts between TinkerPop's Spark jars and Neo4j's Spark jars
       mkdir .ext
       mv ext/neo4j-gremlin .ext/

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/9127a4fd/docs/src/recipes/index.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/index.asciidoc b/docs/src/recipes/index.asciidoc
index f549b1f..bb88301 100644
--- a/docs/src/recipes/index.asciidoc
+++ b/docs/src/recipes/index.asciidoc
@@ -58,6 +58,8 @@ include::traversal-induced-values.asciidoc[]
 
 include::tree.asciidoc[]
 
+include::olap-spark-yarn.asciidoc[]
+
 = Implementation Recipes
 
 include::style-guide.asciidoc[]

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/9127a4fd/docs/src/recipes/olap-spark-yarn.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/olap-spark-yarn.asciidoc b/docs/src/recipes/olap-spark-yarn.asciidoc
new file mode 100644
index 0000000..fbe9c8f
--- /dev/null
+++ b/docs/src/recipes/olap-spark-yarn.asciidoc
@@ -0,0 +1,145 @@
+////
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+////
+[[olap-spark-yarn]]
+OLAP traversals with Spark on Yarn
+----------------------------------
+
+Tinkerpop's combination of http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[SparkGraphComputer]
+and http://tinkerpop.apache.org/docs/current/reference/#_properties_files[HadoopGraph] allows for running
+distributed, analytical graph queries (OLAP) on a computer cluster. The
+http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[reference documentation] covers the cases
+where Spark runs locally or where the cluster is managed by a Spark server. However, many users can only run OLAP jobs
+via the http://hadoop.apache.org/[Hadoop 2.x] Resource Manager (Yarn), which requires `SparkGraphComputer` to be
+configured differently. This recipe describes this configuration.
+
+Approach
+~~~~~~~~
+
+Most configuration problems of Tinkerpop with Spark on Yarn stem from three reasons:
+
+1. `SparkGraphComputer` creates its own `SparkContext` so it does not get any configs from the usual `spark-submit` command.
+2. The Tinkerpop Spark-plugin did not include Spark Yarn runtime dependencies until version 3.2.7/3.3.1.
+3. Resolving reason 2 by adding the cluster's `spark-assembly` jar to the classpath creates a host of version
+conflicts, because Spark 1.x dependency versions have remained frozen since 2014.
+
+The current recipe follows a minimalist approach in which no dependencies are added to the dependencies
+included in the Tinkerpop binary distribution. The Hadoop cluster's Spark installation is completely ignored. This
+approach minimizes the chance of dependency version conflicts.
+
+Prerequisites
+~~~~~~~~~~~~~
+This recipe is suitable for both a real external and a local pseudo Hadoop cluster. While the recipe is maintained
+for the vanilla Hadoop pseudo-cluster, it has been reported to work on real clusters with Hadoop distributions
+from various vendors.
+
+If you want to try the recipe on a local Hadoop pseudo-cluster, the easiest way to install
+it is to look at the install script at https://github.com/apache/tinkerpop/blob/x.y.z/docker/hadoop/install.sh
+and the `start hadoop` section of https://github.com/apache/tinkerpop/blob/x.y.z/docker/scripts/build.sh.
+
+This recipe assumes that you installed the gremlin console with the
+http://tinkerpop.apache.org/docs/x.y.z/reference/#spark-plugin[spark plugin] (the
+http://tinkerpop.apache.org/docs/x.y.z/reference/#hadoop-plugin[hadoop plugin] is optional). Your Hadoop cluster
+may have been configured to use file compression, e.g. lzo compression. If so, you need to copy the relevant
+jar (e.g. `hadoop-lzo-*.jar`) to gremlin console's `ext/spark-gremlin/lib` folder.
+
+For starting the gremlin console in the right environment, create a shell script (e.g. `bin/spark-yarn.sh`) with the
+contents below. Of course, actual values for `GREMLIN_HOME`, `HADOOP_HOME` and `HADOOP_CONF_DIR` need to be adapted to
+your particular environment.
+
+[source]
+----
+#!/bin/bash
+# Variables to be adapted to the actual environment
+GREMLIN_HOME=/home/yourdir/lib/apache-tinkerpop-gremlin-console-x.y.z-standalone
+export HADOOP_HOME=/usr/local/lib/hadoop-2.7.2
+export HADOOP_CONF_DIR=/usr/local/lib/hadoop-2.7.2/etc/hadoop
+
+# Have Tinkerpop find the hadoop cluster configs and hadoop native libraries
+export CLASSPATH=$HADOOP_CONF_DIR
+export JAVA_OPTIONS="-Djava.library.path=$HADOOP_HOME/lib/native:$HADOOP_HOME/lib/native/Linux-amd64-64"
+
+# Start gremlin-console without getting the HADOOP_GREMLIN_LIBS warning
+cd $GREMLIN_HOME
+[ ! -e empty ] && mkdir empty
+export HADOOP_GREMLIN_LIBS=$GREMLIN_HOME/empty
+bin/gremlin.sh
+----
+
+Running the job
+~~~~~~~~~~~~~~~
+
+You can now run a gremlin OLAP query with Spark on Yarn:
+
+[source]
+----
+$ hdfs dfs -put data/tinkerpop-modern.kryo .
+$ . bin/spark-yarn.sh
+----
+
+[gremlin-groovy]
+----
+hadoop = System.getenv('HADOOP_HOME')
+hadoopConfDir = System.getenv('HADOOP_CONF_DIR')
+archive = 'spark-gremlin.zip'
+archivePath = "/tmp/$archive"
+['bash', '-c', "rm $archivePath 2>/dev/null; cd ext/spark-gremlin/lib && zip $archivePath *.jar"].execute()
+conf = new PropertiesConfiguration('conf/hadoop/hadoop-gryo.properties')
+conf.setProperty('spark.master', 'yarn-client')
+conf.setProperty('spark.yarn.dist.archives', "$archivePath")
+conf.setProperty('spark.yarn.appMasterEnv.CLASSPATH', "./$archive/*:$hadoopConfDir")
+conf.setProperty('spark.executor.extraClassPath', "./$archive/*:$hadoopConfDir")
+conf.setProperty('spark.driver.extraLibraryPath', "$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64")
+conf.setProperty('spark.executor.extraLibraryPath', "$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64")
+graph = GraphFactory.open(conf)
+g = graph.traversal().withComputer(SparkGraphComputer)
+g.V().group().by(values('name')).by(both().count())
+----
+
+If you run into exceptions, the best way to see what is going on is to look into the Yarn Resource Manager UI
+(e.g. http://rm.your.domain:8088/cluster) to find the `applicationId` and get the logs using
+`yarn logs -applicationId application_1498627870374_0008` from the command shell.
+
+Explanation
+~~~~~~~~~~~
+
+This recipe does not require running the `bin/hadoop/init-tp-spark.sh` script described in the
+http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[reference documentation] and thus is also
+valid for cluster users without access permissions to do so.
+Rather, it exploits the `spark.yarn.dist.archives` property, which points to an archive with jars on the local file
+system and is loaded into the various Yarn containers. As a result the `spark-gremlin.zip` archive becomes available
+as the directory named `spark-gremlin.zip` in the Yarn containers. The `spark.executor.extraClassPath` and
+`spark.yarn.appMasterEnv.CLASSPATH` properties point to the files inside this archive.
+This is why they contain the `./spark-gremlin.zip/*` item. Just because a Spark executor got the archive with
+jars loaded into its container, does not mean it knows how to access them.
+Also the `HADOOP_GREMLIN_LIBS` mechanism is not used because it can not work for Spark on Yarn as implemented (jars
+added to the `SparkContext` are not available to the Yarn application master).
+
+Additional configuration options
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This recipe does most of the graph configuration in the gremlin console so that environment variables can be used and
+the chance of configuration mistakes is minimal. Once you have your setup working, it is probably easier to make a copy
+of the `conf/hadoop/hadoop-gryo.properties` file and put the property values specific to your environment there. This is
+also the right moment to take a look at the `spark-defaults.xml` file of your cluster, in particular the settings for
+the Spark History Service, which allows you to access logs of finished jobs via the Yarn resource manager UI.
+
+This recipe uses the gremlin console, but things should not be very different for your own JVM-based application,
+as long as you do not use the `spark-submit` or `spark-shell` commands.
+
+You may not like the idea that the Hadoop and Spark jars from the Tinkerpop distribution differ from the versions in
+your cluster. If so, just build Tinkerpop from source with the corresponding dependencies changed in the various `pom.xml`
+files (e.g. `spark-core_2.10-1.6.1-some-vendor.jar` instead of `spark-core_2.10-1.6.1.jar`). Of course, Tinkerpop will
+only build for exactly matching or slightly differing artifact versions.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/9127a4fd/hadoop-gremlin/conf/hadoop-gryo.properties
----------------------------------------------------------------------
diff --git a/hadoop-gremlin/conf/hadoop-gryo.properties b/hadoop-gremlin/conf/hadoop-gryo.properties
index c156a98..ec56abc 100644
--- a/hadoop-gremlin/conf/hadoop-gryo.properties
+++ b/hadoop-gremlin/conf/hadoop-gryo.properties
@@ -29,11 +29,11 @@ gremlin.hadoop.outputLocation=output
 ####################################
 spark.master=local[4]
 spark.executor.memory=1g
+gremlin.spark.persistContext=true
 spark.serializer=org.apache.spark.serializer.KryoSerializer
 spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator
 # spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer (3.2.x model)
 # gremlin.spark.graphStorageLevel=MEMORY_AND_DISK
-# gremlin.spark.persistContext=true
 # gremlin.spark.graphWriter=org.apache.tinkerpop.gremlin.spark.structure.io.PersistedOutputRDD
 # gremlin.spark.persistStorageLevel=DISK_ONLY
 # spark.kryo.registrationRequired=true

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/9127a4fd/pom.xml
----------------------------------------------------------------------
diff --git a/pom.xml b/pom.xml
index a31951e..7fd4168 100644
--- a/pom.xml
+++ b/pom.xml
@@ -148,6 +148,7 @@ limitations under the License.
         <netty.version>4.0.50.Final</netty.version>
         <slf4j.version>1.7.21</slf4j.version>
         <snakeyaml.version>1.15</snakeyaml.version>
+        <spark.version>2.2.0</spark.version>
 
         <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
         <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/9127a4fd/spark-gremlin/pom.xml
----------------------------------------------------------------------
diff --git a/spark-gremlin/pom.xml b/spark-gremlin/pom.xml
index ee81096..4dc00d4 100644
--- a/spark-gremlin/pom.xml
+++ b/spark-gremlin/pom.xml
@@ -108,7 +108,7 @@
         <dependency>
             <groupId>org.apache.spark</groupId>
             <artifactId>spark-core_2.11</artifactId>
-            <version>2.2.0</version>
+            <version>${spark.version}</version>
             <exclusions>
                 <!-- self conflicts -->
                 <exclusion>
@@ -441,7 +441,8 @@
                 <configuration>
                     <archive>
                         <manifestEntries>
-                            <Gremlin-Plugin-Dependencies>org.apache.hadoop:hadoop-client:${hadoop.version}
+                            <Gremlin-Plugin-Dependencies>
+                                org.apache.hadoop:hadoop-client:${hadoop.version};org.apache.hadoop:hadoop-yarn-server-web-proxy:${hadoop.version};org.apache.spark:spark-yarn_2.10:${spark.version}
                             </Gremlin-Plugin-Dependencies>
                             <!-- deletes the servlet-api jar from the path after install - causes conflicts -->
                             <Gremlin-Plugin-Paths>servlet-api-2.5.jar=</Gremlin-Plugin-Paths>


[13/14] tinkerpop git commit: Merge branch 'tp32'

Posted by ok...@apache.org.
Merge branch 'tp32'


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/d603d11d
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/d603d11d
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/d603d11d

Branch: refs/heads/master
Commit: d603d11d57d6b71cea3bc45b1bdab17a0f52adc3
Parents: f243669 a451ca5
Author: Marko A. Rodriguez <ok...@gmail.com>
Authored: Wed Nov 1 12:06:19 2017 -0600
Committer: Marko A. Rodriguez <ok...@gmail.com>
Committed: Wed Nov 1 12:06:19 2017 -0600

----------------------------------------------------------------------
 CHANGELOG.asciidoc                        |   2 +
 docs/preprocessor/preprocess-file.sh      |   2 +-
 docs/src/recipes/index.asciidoc           |   2 +
 docs/src/recipes/olap-spark-yarn.asciidoc | 157 +++++++++++++++++++++++++
 pom.xml                                   |   1 +
 spark-gremlin/pom.xml                     |   7 +-
 6 files changed, 167 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/d603d11d/CHANGELOG.asciidoc
----------------------------------------------------------------------

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/d603d11d/pom.xml
----------------------------------------------------------------------
diff --cc pom.xml
index a31951e,5a93109..7fd4168
--- a/pom.xml
+++ b/pom.xml
@@@ -148,6 -149,7 +148,7 @@@ limitations under the License
          <netty.version>4.0.50.Final</netty.version>
          <slf4j.version>1.7.21</slf4j.version>
          <snakeyaml.version>1.15</snakeyaml.version>
 -        <spark.version>1.6.1</spark.version>
++        <spark.version>2.2.0</spark.version>
  
          <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
          <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/d603d11d/spark-gremlin/pom.xml
----------------------------------------------------------------------
diff --cc spark-gremlin/pom.xml
index ee81096,77a455b..e70d27d
--- a/spark-gremlin/pom.xml
+++ b/spark-gremlin/pom.xml
@@@ -107,8 -103,8 +107,8 @@@
          <!-- SPARK -->
          <dependency>
              <groupId>org.apache.spark</groupId>
 -            <artifactId>spark-core_2.10</artifactId>
 +            <artifactId>spark-core_2.11</artifactId>
-             <version>2.2.0</version>
+             <version>${spark.version}</version>
              <exclusions>
                  <!-- self conflicts -->
                  <exclusion>
@@@ -441,7 -382,8 +441,8 @@@
                  <configuration>
                      <archive>
                          <manifestEntries>
-                             <Gremlin-Plugin-Dependencies>org.apache.hadoop:hadoop-client:${hadoop.version}
+                             <Gremlin-Plugin-Dependencies>
 -                                org.apache.hadoop:hadoop-client:${hadoop.version};org.apache.hadoop:hadoop-yarn-server-web-proxy:${hadoop.version};org.apache.spark:spark-yarn_2.10:${spark.version}
++                                org.apache.hadoop:hadoop-client:${hadoop.version};org.apache.hadoop:hadoop-yarn-server-web-proxy:${hadoop.version};org.apache.spark:spark-yarn_2.11:${spark.version}
                              </Gremlin-Plugin-Dependencies>
                              <!-- deletes the servlet-api jar from the path after install - causes conflicts -->
                              <Gremlin-Plugin-Paths>servlet-api-2.5.jar=</Gremlin-Plugin-Paths>
@@@ -452,4 -394,4 +453,4 @@@
              </plugin>
          </plugins>
      </build>
--</project>
++</project>


[14/14] tinkerpop git commit: Merge branch 'spark-yarn-recipe-tp33' of https://github.com/vtslab/incubator-tinkerpop

Posted by ok...@apache.org.
Merge branch 'spark-yarn-recipe-tp33' of https://github.com/vtslab/incubator-tinkerpop


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/c29e0f6a
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/c29e0f6a
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/c29e0f6a

Branch: refs/heads/master
Commit: c29e0f6ae77ea9e38a70c7b022c50223140612f0
Parents: d603d11 9ca94f5
Author: Marko A. Rodriguez <ok...@gmail.com>
Authored: Wed Nov 1 12:07:31 2017 -0600
Committer: Marko A. Rodriguez <ok...@gmail.com>
Committed: Wed Nov 1 12:07:31 2017 -0600

----------------------------------------------------------------------
 docs/src/recipes/olap-spark-yarn.asciidoc | 29 +++++++++++++-------------
 1 file changed, 15 insertions(+), 14 deletions(-)
----------------------------------------------------------------------



[07/14] tinkerpop git commit: Corrected TinkerPop naming, added changelog and pointer to manifest

Posted by ok...@apache.org.
Corrected TinkerPop naming, added changelog and pointer to manifest


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/b0b087e4
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/b0b087e4
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/b0b087e4

Branch: refs/heads/master
Commit: b0b087e45ff11f0eaa84e256bdd2cf527b746320
Parents: 3396e92
Author: HadoopMarc <vt...@xs4all.nl>
Authored: Sun Oct 1 16:38:42 2017 +0200
Committer: HadoopMarc <vt...@xs4all.nl>
Committed: Thu Oct 19 16:11:57 2017 +0200

----------------------------------------------------------------------
 CHANGELOG.asciidoc                        |  2 ++
 docs/src/recipes/olap-spark-yarn.asciidoc | 20 +++++++++++---------
 2 files changed, 13 insertions(+), 9 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/b0b087e4/CHANGELOG.asciidoc
----------------------------------------------------------------------
diff --git a/CHANGELOG.asciidoc b/CHANGELOG.asciidoc
index 8ecff99..7bb05ea 100644
--- a/CHANGELOG.asciidoc
+++ b/CHANGELOG.asciidoc
@@ -45,6 +45,8 @@ image::https://raw.githubusercontent.com/apache/tinkerpop/master/docs/static/ima
 * Fixed a bug that prevented Gremlin from ordering lists and streams made of mixed number types.
 * Fixed a bug where `keepLabels` were being corrupted because a defensive copy was not being made when they were being set by `PathRetractionStrategy`.
 * Cancel script evaluation timeout in `GremlinExecutor` when script evaluation finished.
+* Added a recipe for OLAP traversals with Spark on Yarn.
+* Added `spark-yarn` dependencies to the manifest of `spark-gremlin`.
 
 [[release-3-2-6]]
 === TinkerPop 3.2.6 (Release Date: August 21, 2017)

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/b0b087e4/docs/src/recipes/olap-spark-yarn.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/olap-spark-yarn.asciidoc b/docs/src/recipes/olap-spark-yarn.asciidoc
index fbe9c8f..464470f 100644
--- a/docs/src/recipes/olap-spark-yarn.asciidoc
+++ b/docs/src/recipes/olap-spark-yarn.asciidoc
@@ -18,7 +18,7 @@ limitations under the License.
 OLAP traversals with Spark on Yarn
 ----------------------------------
 
-Tinkerpop's combination of http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[SparkGraphComputer]
+TinkerPop's combination of http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[SparkGraphComputer]
 and http://tinkerpop.apache.org/docs/current/reference/#_properties_files[HadoopGraph] allows for running
 distributed, analytical graph queries (OLAP) on a computer cluster. The
 http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[reference documentation] covers the cases
@@ -29,15 +29,15 @@ configured differently. This recipe describes this configuration.
 Approach
 ~~~~~~~~
 
-Most configuration problems of Tinkerpop with Spark on Yarn stem from three reasons:
+Most configuration problems of TinkerPop with Spark on Yarn stem from three reasons:
 
 1. `SparkGraphComputer` creates its own `SparkContext` so it does not get any configs from the usual `spark-submit` command.
-2. The Tinkerpop Spark-plugin did not include Spark Yarn runtime dependencies until version 3.2.7/3.3.1.
+2. The TinkerPop Spark plugin did not include Spark on Yarn runtime dependencies until version 3.2.7/3.3.1.
 3. Resolving reason 2 by adding the cluster's `spark-assembly` jar to the classpath creates a host of version
 conflicts, because Spark 1.x dependency versions have remained frozen since 2014.
 
 The current recipe follows a minimalist approach in which no dependencies are added to the dependencies
-included in the Tinkerpop binary distribution. The Hadoop cluster's Spark installation is completely ignored. This
+included in the TinkerPop binary distribution. The Hadoop cluster's Spark installation is completely ignored. This
 approach minimizes the chance of dependency version conflicts.
 
 Prerequisites
@@ -68,7 +68,7 @@ GREMLIN_HOME=/home/yourdir/lib/apache-tinkerpop-gremlin-console-x.y.z-standalone
 export HADOOP_HOME=/usr/local/lib/hadoop-2.7.2
 export HADOOP_CONF_DIR=/usr/local/lib/hadoop-2.7.2/etc/hadoop
 
-# Have Tinkerpop find the hadoop cluster configs and hadoop native libraries
+# Have TinkerPop find the hadoop cluster configs and hadoop native libraries
 export CLASSPATH=$HADOOP_CONF_DIR
 export JAVA_OPTIONS="-Djava.library.path=$HADOOP_HOME/lib/native:$HADOOP_HOME/lib/native/Linux-amd64-64"
 
@@ -137,9 +137,11 @@ also the right moment to take a look at the `spark-defaults.xml` file of your cl
 the Spark History Service, which allows you to access logs of finished jobs via the Yarn resource manager UI.
 
 This recipe uses the gremlin console, but things should not be very different for your own JVM-based application,
-as long as you do not use the `spark-submit` or `spark-shell` commands.
+as long as you do not use the `spark-submit` or `spark-shell` commands. You will also want to check the additional
+runtime dependencies listed in the `Gremlin-Plugin-Dependencies` section of the manifest file in the `spark-gremlin`
+jar.
 
-You may not like the idea that the Hadoop and Spark jars from the Tinkerpop distribution differ from the versions in
-your cluster. If so, just build Tinkerpop from source with the corresponding dependencies changed in the various `pom.xml`
-files (e.g. `spark-core_2.10-1.6.1-some-vendor.jar` instead of `spark-core_2.10-1.6.1.jar`). Of course, Tinkerpop will
+You may not like the idea that the Hadoop and Spark jars from the TinkerPop distribution differ from the versions in
+your cluster. If so, just build TinkerPop from source with the corresponding dependencies changed in the various `pom.xml`
+files (e.g. `spark-core_2.10-1.6.1-some-vendor.jar` instead of `spark-core_2.10-1.6.1.jar`). Of course, TinkerPop will
 only build for exactly matching or slightly differing artifact versions.
\ No newline at end of file


[10/14] tinkerpop git commit: Escaped example url

Posted by ok...@apache.org.
Escaped example url


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/6110354f
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/6110354f
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/6110354f

Branch: refs/heads/master
Commit: 6110354febb38d7565673e5c98bab92736527c41
Parents: cd65378
Author: HadoopMarc <vt...@xs4all.nl>
Authored: Wed Oct 18 15:54:37 2017 +0200
Committer: HadoopMarc <vt...@xs4all.nl>
Committed: Thu Oct 19 16:11:57 2017 +0200

----------------------------------------------------------------------
 docs/src/recipes/olap-spark-yarn.asciidoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/6110354f/docs/src/recipes/olap-spark-yarn.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/olap-spark-yarn.asciidoc b/docs/src/recipes/olap-spark-yarn.asciidoc
index 6755e5f..e2be4f6 100644
--- a/docs/src/recipes/olap-spark-yarn.asciidoc
+++ b/docs/src/recipes/olap-spark-yarn.asciidoc
@@ -113,7 +113,7 @@ g.V().group().by(values('name')).by(both().count())
 If you run into exceptions, you will have to dig into the logs. You can do this from the command line with
 `yarn application -list -appStates ALL` to find the `applicationId`, while the logs are available with
 `yarn logs -applicationId application_1498627870374_0008`. Alternatively, you can inspect the logs via
-the YARN Resource Manager UI (e.g. http://rm.your.domain:8088/cluster), provided that YARN was configured with the
+the YARN Resource Manager UI (e.g. \http://rm.your.domain:8088/cluster), provided that YARN was configured with the
 `yarn.log-aggregation-enable` property set to `true`. See the Spark documentation for
 https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application[additional hints].
 


[06/14] tinkerpop git commit: Escaped example url

Posted by ok...@apache.org.
Escaped example url


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/9ca94f52
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/9ca94f52
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/9ca94f52

Branch: refs/heads/master
Commit: 9ca94f522ae6dba803fb4756e63f9143a81aedd9
Parents: e30cb7d
Author: HadoopMarc <vt...@xs4all.nl>
Authored: Wed Oct 18 14:53:22 2017 +0200
Committer: HadoopMarc <vt...@xs4all.nl>
Committed: Wed Oct 18 14:53:22 2017 +0200

----------------------------------------------------------------------
 docs/src/recipes/olap-spark-yarn.asciidoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/9ca94f52/docs/src/recipes/olap-spark-yarn.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/olap-spark-yarn.asciidoc b/docs/src/recipes/olap-spark-yarn.asciidoc
index 88f3885..85bfe18 100644
--- a/docs/src/recipes/olap-spark-yarn.asciidoc
+++ b/docs/src/recipes/olap-spark-yarn.asciidoc
@@ -113,7 +113,7 @@ g.V().group().by(values('name')).by(both().count())
 If you run into exceptions, you will have to dig into the logs. You can do this from the command line with
 `yarn application -list -appStates ALL` to find the `applicationId`, while the logs are available with
 `yarn logs -applicationId application_1498627870374_0008`. Alternatively, you can inspect the logs via
-the YARN Resource Manager UI (e.g. http://rm.your.domain:8088/cluster), provided that YARN was configured with the
+the YARN Resource Manager UI (e.g. \http://rm.your.domain:8088/cluster), provided that YARN was configured with the
 `yarn.log-aggregation-enable` property set to `true`. See the Spark documentation for
 https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application[additional hints].