You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tinkerpop.apache.org by ok...@apache.org on 2015/05/01 23:44:30 UTC
incubator-tinkerpop git commit: clean up on AbstractHadoopGraphComputer and added section to docs on how vendors can leverage Hadoop-Gremlin for their graph system.

Repository: incubator-tinkerpop
Updated Branches:
  refs/heads/master f4449db4a -> 6b93fab6a


clean up on AbstractHadoopGraphComputer and added section to docs on how vendors can leverage Hadoop-Gremlin for their graph system.


Project: http://git-wip-us.apache.org/repos/asf/incubator-tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-tinkerpop/commit/6b93fab6
Tree: http://git-wip-us.apache.org/repos/asf/incubator-tinkerpop/tree/6b93fab6
Diff: http://git-wip-us.apache.org/repos/asf/incubator-tinkerpop/diff/6b93fab6

Branch: refs/heads/master
Commit: 6b93fab6a793cc8bdd97db27519ba54fc5587125
Parents: f4449db
Author: Marko A. Rodriguez <ok...@gmail.com>
Authored: Fri May 1 15:44:17 2015 -0600
Committer: Marko A. Rodriguez <ok...@gmail.com>
Committed: Fri May 1 15:44:26 2015 -0600

----------------------------------------------------------------------
 docs/src/implementations.asciidoc               | 46 +++++++++++++++-----
 .../computer/AbstractHadoopGraphComputer.java   |  2 +-
 .../computer/giraph/GiraphGraphComputer.java    |  9 ++--
 .../computer/spark/SparkGraphComputer.java      | 13 ++----
 4 files changed, 44 insertions(+), 26 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-tinkerpop/blob/6b93fab6/docs/src/implementations.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/implementations.asciidoc b/docs/src/implementations.asciidoc
index c524d15..836b808 100644
--- a/docs/src/implementations.asciidoc
+++ b/docs/src/implementations.asciidoc
@@ -733,6 +733,18 @@ image:adjacency-list.png[width=300,float=right] Hadoop-Gremlin provides various
 
 {empty} +
 
+[[gryo-io-format]]
+Gryo I/O Format
+^^^^^^^^^^^^^^^
+
+* **InputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat`
+* **OutputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat`
+
+<<gryo-reader-writer,Gryo>> is a binary graph format that leverages link:https://github.com/EsotericSoftware/kryo[Kryo] to make a compact, binary representation of a vertex. It is recommended that users leverage Gryo given its space/time savings over text-based representations.
+
+NOTE: The `GryoInputFormat` is splittable.
+
+[[graphson-io-format]]
 GraphSON I/O Format
 ^^^^^^^^^^^^^^^^^^^
 
@@ -751,16 +763,7 @@ The data below represents an adjacency list representation of the classic Tinker
 {"inE":[{"inV":5,"inVLabel":"vertex","outVLabel":"vertex","id":10,"label":"created","type":"edge","outV":4,"properties":{"weight":1.0}}],"outE":[],"id":5,"label":"vertex","type":"vertex","properties":{"name":[{"id":8,"label":"name","value":"ripple","properties":{}}],"lang":[{"id":9,"label":"lang","value":"java","properties":{}}]}}
 {"inE":[],"outE":[{"inV":3,"inVLabel":"vertex","outVLabel":"vertex","id":12,"label":"created","type":"edge","outV":6,"properties":{"weight":0.2}}],"id":6,"label":"vertex","type":"vertex","properties":{"name":[{"id":10,"label":"name","value":"peter","properties":{}}],"age":[{"id":11,"label":"age","value":35,"properties":{}}]}}
 
-Gryo I/O Format
-^^^^^^^^^^^^^^^
-
-* **InputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat`
-* **OutputFormat**: `org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat`
-
-<<gryo-reader-writer,Gryo>> is a binary graph format that leverages link:https://github.com/EsotericSoftware/kryo[Kryo] to make a compact, binary representation of a vertex. It is recommended that users leverage Gryo given its space/time savings over text-based representations.
-
-NOTE: The `GryoInputFormat` is splittable.
-
+[[script-io-format]]
 Script I/O Format
 ^^^^^^^^^^^^^^^^^
 
@@ -1076,3 +1079,26 @@ Vertex 4 ("josh") is isolated below:
 }
 ----
 
+Hadoop-Gremlin for Vendors
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Hadoop-Gremlin is centered around `InputFormats` and `OutputFormats`. If a 3rd-party vendor wishes to leverage Hadoop-Gremlin (and its respective `GraphComputer` engines), then they simply need to provide, at minimum, a Hadoop 1.x `InputFormat<NullWritable,VertexWritable>` for their graph system. If the vendor wishes to persist computed results back to their graph system (and not just to HDFS via a `FileOutputFormat`), then a vendor-specific `OutputFormat<NullWritable,VertexWritable>` must be developed as well.
+
+Conceptually, `HadoopGraph` is a wrapper around a `Configuration` object. There is no "data" in the `HadoopGraph` as the `InputFormat` specifies where and how to get the graph data at OLAP (and OLTP) runtime. Thus, `HadoopGraph` is a small object with little overhead. Vendors should realize `HadoopGraph` as the gateway to the OLAP features offered by Hadoop-Gremlin. An example, vendor-specific `Graph.compute(Class<? extends GraphComputer> graphComputerClass)`-method may look as follows:
+
+[source,java]
+public <C extends GraphComputer> C compute(final Class<C> graphComputerClass) throws IllegalArgumentException {
+  if(SparkGraphComputer.class.isAssignableFrom(graphComputerClass))
+    return new SparkGraphComputer(HadoopGraph.open(this.configuration()));
+  else if(GiraphGraphComputer.class.isAssignableFrom(graphComputerClass))
+    return new GiraphGraphComputer(HadoopGraph.open(this.configuration()));
+  else if(...) // vendor specific graph computer classes
+    // return vendor specific instance
+  else
+    throw Graph.Exceptions.graphDoesNotSupportProvidedGraphComputer(graphComputerClass);
+}
+
+Note that the configurations for Hadoop are assumed to be in the `Graph.configuration()` object. If this is not the case, then the `Configuration` provided to `HadoopGraph.open()` should be dynamically created within the `compute()`-method. It is in the provided configuration that `HadoopGraph` gets the various properties which determine how to read and write data to and from Hadoop. For instance, `gremlin.hadoop.graphInputFormat` and `gremlin.hadoop.graphOutputFormat`.
+
+IMPORTANT: A vendor's `OutputFormat` should implement the `PersistResultGraphAware` interface which determines which persistence options are available to the user. For the standard file-based `OutputFormats` provided by Hadoop-Gremlin (e.g. <<gryo-io-format,`GryoOutputFormat`>>, <<graphson-io-format,`GraphSONOutputFormat`>>, and <<script-io-format,`ScriptInputOutputFormat`>>) `ResultGraph.ORIGINAL` is not supported as the original graph data files are not random access and are, in essence, immutable. Thus, these file-based `OutputFormats` only support `ResultGraph.NEW` which creates a copy of the data specified by the `Persist` enum.
+

http://git-wip-us.apache.org/repos/asf/incubator-tinkerpop/blob/6b93fab6/hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/process/computer/AbstractHadoopGraphComputer.java
----------------------------------------------------------------------
diff --git a/hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/process/computer/AbstractHadoopGraphComputer.java b/hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/process/computer/AbstractHadoopGraphComputer.java
index c730594..8bda57d 100644
--- a/hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/process/computer/AbstractHadoopGraphComputer.java
+++ b/hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/process/computer/AbstractHadoopGraphComputer.java
@@ -44,7 +44,7 @@ import java.util.Set;
  */
 public abstract class AbstractHadoopGraphComputer implements GraphComputer {
 
-    private final Logger logger;
+    protected final Logger logger;
     protected final HadoopGraph hadoopGraph;
     protected boolean executed = false;
     protected final Set<MapReduce> mapReducers = new HashSet<>();

http://git-wip-us.apache.org/repos/asf/incubator-tinkerpop/blob/6b93fab6/hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/process/computer/giraph/GiraphGraphComputer.java
----------------------------------------------------------------------
diff --git a/hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/process/computer/giraph/GiraphGraphComputer.java b/hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/process/computer/giraph/GiraphGraphComputer.java
index d187166..d03fa37 100644
--- a/hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/process/computer/giraph/GiraphGraphComputer.java
+++ b/hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/process/computer/giraph/GiraphGraphComputer.java
@@ -51,8 +51,6 @@ import org.apache.tinkerpop.gremlin.process.computer.MapReduce;
 import org.apache.tinkerpop.gremlin.process.computer.VertexProgram;
 import org.apache.tinkerpop.gremlin.process.computer.util.DefaultComputerResult;
 import org.apache.tinkerpop.gremlin.process.computer.util.MapMemory;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
 
 import java.io.File;
 import java.util.HashSet;
@@ -66,7 +64,6 @@ import java.util.stream.Stream;
  */
 public class GiraphGraphComputer extends AbstractHadoopGraphComputer implements GraphComputer, Tool {
 
-    public static final Logger LOGGER = LoggerFactory.getLogger(GiraphGraphComputer.class);
     protected GiraphConfiguration giraphConfiguration = new GiraphConfiguration();
     private MapMemory memory = new MapMemory();
 
@@ -145,7 +142,7 @@ public class GiraphGraphComputer extends AbstractHadoopGraphComputer implements
                 else
                     FileOutputFormat.setOutputPath(job.getInternalJob(), outputPath);
                 job.getInternalJob().setJarByClass(GiraphGraphComputer.class);
-                LOGGER.info(Constants.GREMLIN_HADOOP_GIRAPH_JOB_PREFIX + this.vertexProgram);
+                this.logger.info(Constants.GREMLIN_HADOOP_GIRAPH_JOB_PREFIX + this.vertexProgram);
                 // execute the job and wait until it completes (if it fails, throw an exception)
                 if (!job.run(true))
                     throw new IllegalStateException("The GiraphGraphComputer job failed -- aborting all subsequent MapReduce jobs");
@@ -185,7 +182,7 @@ public class GiraphGraphComputer extends AbstractHadoopGraphComputer implements
         if (this.giraphConfiguration.getBoolean(Constants.GREMLIN_HADOOP_JARS_IN_DISTRIBUTED_CACHE, true)) {
             final String hadoopGremlinLocalLibs = System.getenv(Constants.HADOOP_GREMLIN_LIBS);
             if (null == hadoopGremlinLocalLibs)
-                LOGGER.warn(Constants.HADOOP_GREMLIN_LIBS + " is not set -- proceeding regardless");
+                this.logger.warn(Constants.HADOOP_GREMLIN_LIBS + " is not set -- proceeding regardless");
             else {
                 final String[] paths = hadoopGremlinLocalLibs.split(":");
                 for (final String path : paths) {
@@ -205,7 +202,7 @@ public class GiraphGraphComputer extends AbstractHadoopGraphComputer implements
                             }
                         });
                     } else {
-                        LOGGER.warn(path + " does not reference a valid directory -- proceeding regardless");
+                        this.logger.warn(path + " does not reference a valid directory -- proceeding regardless");
                     }
                 }
             }

http://git-wip-us.apache.org/repos/asf/incubator-tinkerpop/blob/6b93fab6/hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/process/computer/spark/SparkGraphComputer.java
----------------------------------------------------------------------
diff --git a/hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/process/computer/spark/SparkGraphComputer.java b/hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/process/computer/spark/SparkGraphComputer.java
index 325427e..fff1039 100644
--- a/hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/process/computer/spark/SparkGraphComputer.java
+++ b/hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/process/computer/spark/SparkGraphComputer.java
@@ -42,8 +42,6 @@ import org.apache.tinkerpop.gremlin.process.computer.Memory;
 import org.apache.tinkerpop.gremlin.process.computer.VertexProgram;
 import org.apache.tinkerpop.gremlin.process.computer.util.DefaultComputerResult;
 import org.apache.tinkerpop.gremlin.process.computer.util.MapMemory;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
 import scala.Tuple2;
 
 import java.io.File;
@@ -56,9 +54,6 @@ import java.util.stream.Stream;
  */
 public final class SparkGraphComputer extends AbstractHadoopGraphComputer {
 
-    public static final Logger LOGGER = LoggerFactory.getLogger(SparkGraphComputer.class);
-    protected final SparkConf configuration = new SparkConf();
-
     public SparkGraphComputer(final HadoopGraph hadoopGraph) {
         super(hadoopGraph);
     }
@@ -94,7 +89,7 @@ public final class SparkGraphComputer extends AbstractHadoopGraphComputer {
                     // execute the vertex program and map reducers and if there is a failure, auto-close the spark context
                     try (final JavaSparkContext sparkContext = new JavaSparkContext(sparkConfiguration)) {
                         // add the project jars to the cluster
-                        SparkGraphComputer.loadJars(sparkContext, hadoopConfiguration);
+                        this.loadJars(sparkContext, hadoopConfiguration);
                         // create a message-passing friendly rdd from the hadoop input format
                         final JavaPairRDD<Object, VertexWritable> graphRDD = sparkContext.newAPIHadoopRDD(hadoopConfiguration,
                                 (Class<InputFormat<NullWritable, VertexWritable>>) hadoopConfiguration.getClass(Constants.GREMLIN_HADOOP_GRAPH_INPUT_FORMAT, InputFormat.class),
@@ -167,11 +162,11 @@ public final class SparkGraphComputer extends AbstractHadoopGraphComputer {
 
     /////////////////
 
-    private static void loadJars(final JavaSparkContext sparkContext, final Configuration hadoopConfiguration) {
+    private void loadJars(final JavaSparkContext sparkContext, final Configuration hadoopConfiguration) {
         if (hadoopConfiguration.getBoolean(Constants.GREMLIN_HADOOP_JARS_IN_DISTRIBUTED_CACHE, true)) {
             final String hadoopGremlinLocalLibs = System.getenv(Constants.HADOOP_GREMLIN_LIBS);
             if (null == hadoopGremlinLocalLibs)
-                LOGGER.warn(Constants.HADOOP_GREMLIN_LIBS + " is not set -- proceeding regardless");
+                this.logger.warn(Constants.HADOOP_GREMLIN_LIBS + " is not set -- proceeding regardless");
             else {
                 final String[] paths = hadoopGremlinLocalLibs.split(":");
                 for (final String path : paths) {
@@ -179,7 +174,7 @@ public final class SparkGraphComputer extends AbstractHadoopGraphComputer {
                     if (file.exists())
                         Stream.of(file.listFiles()).filter(f -> f.getName().endsWith(Constants.DOT_JAR)).forEach(f -> sparkContext.addJar(f.getAbsolutePath()));
                     else
-                        LOGGER.warn(path + " does not reference a valid directory -- proceeding regardless");
+                        this.logger.warn(path + " does not reference a valid directory -- proceeding regardless");
                 }
             }
         }