You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tinkerpop.apache.org by Stephen Mallette <sp...@gmail.com> on 2015/10/01 19:55:46 UTC

Gremlin Server + Spark

I spent some time this morning to see if I could execute a Spark-based
traversal remotely in Gremlin Server - and......it worked!  Didn't even
have to really make any code changes for it to happen either though I did
make some minor adjustments to streamline some things, but all-in-all,
there wasn't much to it.

Here's my step-by-step of the process.  First of all, these instructions
are based on the latest 3.1.0-SNAPSHOT (master branch).  You need to be
sure you have Hadoop 2.x running in psuedo distributed mode - in other
words, be sure that you can execute a spark traversal locally from Gremlin
Console (if that works, it should work for Gremlin Server).

To get started, you'll need to open two terminals - one for Gremlin Server
and the other for Gremlin Console.  I had to set the CLASSPATH in both:

export CLASSPATH=/hadoop-2.7.1/etc/hadoop

and in the appropriate terminal (server or console) set HADOOP_GREMLIN_LIBS:

export
HADOOP_GREMLIN_LIBS=/apache-gremlin-console-3.1.0-SNAPSHOT/spark-gremlin/lib
export
HADOOP_GREMLIN_LIBS=/apache-gremlin-server-3.1.0-SNAPSHOT/ext/spark-gremlin/lib

I then started up bin/gremlin.sh and installed the spark plugin:

gremlin> :install org.apache.tinkerpop spark-gremlin 3.1.0-SNAPSHOT

I restart the console as instructed and activate my plugins:

gremlin> :plugin use tinkerpop.hadoop
==>tinkerpop.hadoop activated
gremlin> :plugin use tinkerpop.spark
==>tinkerpop.spark activated

I then copy my graph data to hdfs:

gremlin>
hdfs.copyFromLocal('data/tinkerpop-modern.kryo','tinkerpop-modern.kryo')
==>null
gremlin> hdfs.ls()
==>rw-r--r-- smallette supergroup 781 tinkerpop-modern.kryo

Then we switch gears to the terminal that will run Gremlin Server and
"install" spark:

bin/gremlin-server.sh -i org.apache.tinkerpop spark-gremlin 3.1.0-SNAPSHOT

which will copy down appropriate dependencies the same way the Gremlin
Console :install command does.  Then we start gremlin server with:

bin/gremlin-server.sh conf/gremlin-server-spark.yaml

This new config file is now packaged with the Gremlin Server distribution
when you build it.  It's pretty well documented and should point you to how
stuff works.  You can see it here:

https://github.com/apache/incubator-tinkerpop/blob/a4c70eb24c0934e70bf2cde2ca169ea52cb989b7/gremlin-server/conf/gremlin-server-spark.yaml

Now in the Gremlin Server terminal you should see the standard startup
logging which should include some lines like this:

[INFO] GraphManager - Graph [graph] was successfully configured via
[conf/hadoop-gryo.properties].
...
[INFO] GremlinExecutor - Initialized gremlin-groovy ScriptEngine with
scripts/spark.groovy
...
[INFO] ServerGremlinExecutor - A GraphTraversalSource is now bound to [g]
with graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat],
sparkgraphcomputer]

If you see that much, you should be good to go, head back to the Gremlin
Console terminal and do:

gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Connected - localhost/127.0.0.1:8182
gremlin> :> g.V().count()
==>6
gremlin> :> g.V().out().out().values('name')
==>lop
==>ripple

It was good to confirm that this works as expected.  This information
should be especially useful to those not on the JVM who need a way to
execute OLAP based traversals.

Stephen