You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tinkerpop.apache.org by "Marko A. Rodriguez (JIRA)" <ji...@apache.org> on 2015/12/02 21:23:11 UTC

[jira] [Commented] (TINKERPOP3-1015) InputRDD and InputFormat to load into HadoopGraph from any Graph System

    [ https://issues.apache.org/jira/browse/TINKERPOP3-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036535#comment-15036535 ] 

Marko A. Rodriguez commented on TINKERPOP3-1015:
------------------------------------------------

Would people be interested in this feature? It would allow you to, for example, use Spark with Neo4j. Also, another thing we could do to make this efficient is:

	List<Iterator<Vertex>> Graph.vertexSplits(int numberOfSplits)

Then each graph provider can specify how to do parallel reads. The default implementation would be:
	
	List<Iterator<Vertex>> splits = new ArrayList<>(numberOfSplits);
	list.add(this.vertices());
	return splits;

> InputRDD and InputFormat to load into HadoopGraph from any Graph System
> -----------------------------------------------------------------------
>
>                 Key: TINKERPOP3-1015
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP3-1015
>             Project: TinkerPop 3
>          Issue Type: Improvement
>          Components: hadoop
>    Affects Versions: 3.1.0-incubating
>            Reporter: Marko A. Rodriguez
>
> I just recently added {{ToyGraphInputRDD}} to test InputRDD stuff against the full test suite. Check out it works:
> {code}
> public final class ToyGraphInputRDD implements InputRDD {
>     public static final String GREMLIN_SPARK_TOY_GRAPH = "gremlin.spark.toyGraph";
>     @Override
>     public JavaPairRDD<Object, VertexWritable> readGraphRDD(final Configuration configuration, final JavaSparkContext sparkContext) {
>         final List<Vertex> vertices;
>         if (configuration.getProperty(GREMLIN_SPARK_TOY_GRAPH).equals(LoadGraphWith.GraphData.MODERN.toString()))
>             vertices = IteratorUtils.list(TinkerFactory.createModern().vertices());
>         else if (configuration.getProperty(GREMLIN_SPARK_TOY_GRAPH).equals(LoadGraphWith.GraphData.CLASSIC.toString()))
>             vertices = IteratorUtils.list(TinkerFactory.createClassic().vertices());
>         else if (configuration.getProperty(GREMLIN_SPARK_TOY_GRAPH).equals(LoadGraphWith.GraphData.CREW.toString()))
>             vertices = IteratorUtils.list(TinkerFactory.createTheCrew().vertices());
>         else if (configuration.getProperty(GREMLIN_SPARK_TOY_GRAPH).equals(LoadGraphWith.GraphData.GRATEFUL.toString())) {
>             try {
>                 final Graph graph = TinkerGraph.open();
>                 graph.io(GryoIo.build()).readGraph(GryoResourceAccess.class.getResource("grateful-dead.kryo").getFile());
>                 vertices = IteratorUtils.list(graph.vertices());
>             } catch (final IOException e) {
>                 throw new IllegalStateException(e.getMessage(), e);
>             }
>         } else
>             throw new IllegalArgumentException("No legal toy graph was provided to load: " + configuration.getProperty(GREMLIN_SPARK_TOY_GRAPH));
>         return sparkContext.parallelize(vertices.stream().map(VertexWritable::new).collect(Collectors.toList())).mapToPair(vertex -> new Tuple2<>(vertex.get().id(), vertex));
>     }
> }
> {code}
> In principle, we could have a {{DefaultInputRDD}} and {{DefaultInputFormat}} that do this:
> {code}
> public final class DefaultInputRDD implements InputRDD {
>     @Override
>     public JavaPairRDD<Object, VertexWritable> readGraphRDD(final Configuration configuration, final JavaSparkContext sparkContext) {
>      Graph graph = GraphFactory.open(configuration);
>         return sparkContext.parallelize(graph.vertices().stream().map(VertexWritable::new).collect(Collectors.toList())).mapToPair(vertex -> new Tuple2<>(vertex.get().id(), vertex));
>     }
> }
> {code}
> It would be a serial/single-threaded load, but it would allow any OLTP graph system to use Spark/Giraph/etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)