You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tinkerpop.apache.org by Marko Rodriguez <ok...@gmail.com> on 2015/04/30 18:01:35 UTC

OLAP and Graph.io()

Hi,

Stephen is interested in making sure that Graph.io() works cleanly for both OLTP and OLAP. In particular, making sure that io().readGraph() and io().writeGraph() can be used in both OLTP and OLAP situations seamlessly much like Gremlin does for traversals.

------------

OLAP graph writing will occur via a (yet to be written) BulkLoaderVertexProgram. BulkLoaderVertexProgram takes a Graph (with vertices/edges) and writes to another Graph. In essence, two graphs, where the first graph has the data and the second is empty. I always expected this to typically happen via Hadoop (HadoopGraph) -> VendorDatabase (VendorGraph). However, while most distributed graph database vendors will leverage Hadoop/Giraph/Spark for their OLAP bulk loading operations because of HDFS, we can't always assume this -- especially in the context of OLAP Graph.io().

Thus, BulkLoaderVertexProgram shouldn't just operate on Graph->Graph, but can optionally stream in a file as well, File->Graph. This means we have to get into the concept of "InputSplits" at the gremlin-core level. A quick and dirty is to simply serially load the graph data from a file, this is not the optimal solution, but can move us forward on the Graph.io() API.

To the API of Graph.io(). This would mean, like Traversal, the user can specify a Computer to use to do the readGraph().

	graph.io().readGraph(file, graph.compute(MyGraphComputer.class))

For writeGraph()

	graph.io().writeGraph(file,graph.compute(MyGraphComputer.class))
 

Where, "file" can be a directory in both situations and each "worker" of the GraphComputer reads/writes a split.

Thoughts?,
Marko.

http://markorodriguez.com

Re: OLAP and Graph.io()

Posted by Marko Rodriguez <ok...@gmail.com>.

Hey Stephen,

Yes, it would need access to the custom serializers. Perhaps this is the same problem that we are having with GryoPool. GryoInput/OutputFormats use Gryo, but they don't have an easy way of getting the serializers.

?,
Marko.

http://markorodriguez.com

On Apr 30, 2015, at 12:46 PM, Stephen Mallette <sp...@gmail.com> wrote:

> It would be nice if this change could just be treated as an overload to
> read/writeGraph() so in that sense it sounds good to me.  I presume that
> the underlying work done by the BulkLoader/DumperVertexProgram would simply
> be using the existing read/writeVertex functions on the GraphReader/Writer
> implementations themselves.  In that way,
> the BulkLoader/DumperVertexProgram would have access to any custom
> serializers required by the Graph instance.
> 
> On Thu, Apr 30, 2015 at 12:01 PM, Marko Rodriguez <ok...@gmail.com>
> wrote:
> 
>> Hi,
>> 
>> Stephen is interested in making sure that Graph.io() works cleanly for
>> both OLTP and OLAP. In particular, making sure that io().readGraph() and
>> io().writeGraph() can be used in both OLTP and OLAP situations seamlessly
>> much like Gremlin does for traversals.
>> 
>> ------------
>> 
>> OLAP graph writing will occur via a (yet to be written)
>> BulkLoaderVertexProgram. BulkLoaderVertexProgram takes a Graph (with
>> vertices/edges) and writes to another Graph. In essence, two graphs, where
>> the first graph has the data and the second is empty. I always expected
>> this to typically happen via Hadoop (HadoopGraph) -> VendorDatabase
>> (VendorGraph). However, while most distributed graph database vendors will
>> leverage Hadoop/Giraph/Spark for their OLAP bulk loading operations because
>> of HDFS, we can't always assume this -- especially in the context of OLAP
>> Graph.io().
>> 
>> Thus, BulkLoaderVertexProgram shouldn't just operate on Graph->Graph, but
>> can optionally stream in a file as well, File->Graph. This means we have to
>> get into the concept of "InputSplits" at the gremlin-core level. A quick
>> and dirty is to simply serially load the graph data from a file, this is
>> not the optimal solution, but can move us forward on the Graph.io() API.
>> 
>> To the API of Graph.io(). This would mean, like Traversal, the user can
>> specify a Computer to use to do the readGraph().
>> 
>>        graph.io().readGraph(file, graph.compute(MyGraphComputer.class))
>> 
>> For writeGraph()
>> 
>>        graph.io().writeGraph(file,graph.compute(MyGraphComputer.class))
>> 
>> 
>> Where, "file" can be a directory in both situations and each "worker" of
>> the GraphComputer reads/writes a split.
>> 
>> Thoughts?,
>> Marko.
>> 
>> http://markorodriguez.com
>> 
>>

Re: OLAP and Graph.io()

Posted by Stephen Mallette <sp...@gmail.com>.

It would be nice if this change could just be treated as an overload to
read/writeGraph() so in that sense it sounds good to me.  I presume that
the underlying work done by the BulkLoader/DumperVertexProgram would simply
be using the existing read/writeVertex functions on the GraphReader/Writer
implementations themselves.  In that way,
the BulkLoader/DumperVertexProgram would have access to any custom
serializers required by the Graph instance.

On Thu, Apr 30, 2015 at 12:01 PM, Marko Rodriguez <ok...@gmail.com>
wrote:

> Hi,
>
> Stephen is interested in making sure that Graph.io() works cleanly for
> both OLTP and OLAP. In particular, making sure that io().readGraph() and
> io().writeGraph() can be used in both OLTP and OLAP situations seamlessly
> much like Gremlin does for traversals.
>
> ------------
>
> OLAP graph writing will occur via a (yet to be written)
> BulkLoaderVertexProgram. BulkLoaderVertexProgram takes a Graph (with
> vertices/edges) and writes to another Graph. In essence, two graphs, where
> the first graph has the data and the second is empty. I always expected
> this to typically happen via Hadoop (HadoopGraph) -> VendorDatabase
> (VendorGraph). However, while most distributed graph database vendors will
> leverage Hadoop/Giraph/Spark for their OLAP bulk loading operations because
> of HDFS, we can't always assume this -- especially in the context of OLAP
> Graph.io().
>
> Thus, BulkLoaderVertexProgram shouldn't just operate on Graph->Graph, but
> can optionally stream in a file as well, File->Graph. This means we have to
> get into the concept of "InputSplits" at the gremlin-core level. A quick
> and dirty is to simply serially load the graph data from a file, this is
> not the optimal solution, but can move us forward on the Graph.io() API.
>
> To the API of Graph.io(). This would mean, like Traversal, the user can
> specify a Computer to use to do the readGraph().
>
>         graph.io().readGraph(file, graph.compute(MyGraphComputer.class))
>
> For writeGraph()
>
>         graph.io().writeGraph(file,graph.compute(MyGraphComputer.class))
>
>
> Where, "file" can be a directory in both situations and each "worker" of
> the GraphComputer reads/writes a split.
>
> Thoughts?,
> Marko.
>
> http://markorodriguez.com
>
>

Re: OLAP and Graph.io()

Posted by Stephen Mallette <sp...@gmail.com>.

Matt, I hadn't thought of the streaming use case for read/write.  IO has
been more about serialization of graphs, their related elements and in some
cases, serialization of arbitrary objects (as needed by Gremlin Server).
Not sure if the streaming use case is a TinkerPop responsibility or
not.....more thinky think required i guess.

As it stands readGraph() is really not meant for incremental loading (it
doesn't expect mutations to be occurring beyond what it is doing itself).
I suppose it could/should be adapted as such with some more code - perhaps
that is something for post-GA.  As for writeGraph() and simultaneous
operations, the write occurs over a simple iteration of all vertices, so i
would suspect that if there were other transactional contexts at play, that
such change would be visible dependent on how the Graph implementation
handles such things.

This read/writeGraph() feature is meant for small graphs right now.  That's
why marko brought up this thread as we need better and improved methods for
dealing with more complex loads.

On Thu, Apr 30, 2015 at 12:50 PM, Matt Frantz <ma...@gmail.com>
wrote:

> The questions that occur to me are somewhat broad, so I apologize if they
> distract from the intended topic.  However, I do feel they are related to a
> proper IO design.
>
> Would the readGraph API be suitable for a continuously streaming loader,
> e.g. to parse an activity stream, or is it only used for finite inputs?
>
> Would the writeGraph API be suitable for a continuously streaming
> extractor, e.g. to write an external transaction log, or to synchronize a
> replica, or is it only used for finite outputs?
>
> What is the expected behavior when there is simultaneous access, e.g.
> queries occurring during readGraph, or mutations occurring during
> writeGraph?
>
> On Thu, Apr 30, 2015 at 9:01 AM, Marko Rodriguez <ok...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Stephen is interested in making sure that Graph.io() works cleanly for
> > both OLTP and OLAP. In particular, making sure that io().readGraph() and
> > io().writeGraph() can be used in both OLTP and OLAP situations seamlessly
> > much like Gremlin does for traversals.
> >
> > ------------
> >
> > OLAP graph writing will occur via a (yet to be written)
> > BulkLoaderVertexProgram. BulkLoaderVertexProgram takes a Graph (with
> > vertices/edges) and writes to another Graph. In essence, two graphs,
> where
> > the first graph has the data and the second is empty. I always expected
> > this to typically happen via Hadoop (HadoopGraph) -> VendorDatabase
> > (VendorGraph). However, while most distributed graph database vendors
> will
> > leverage Hadoop/Giraph/Spark for their OLAP bulk loading operations
> because
> > of HDFS, we can't always assume this -- especially in the context of OLAP
> > Graph.io().
> >
> > Thus, BulkLoaderVertexProgram shouldn't just operate on Graph->Graph, but
> > can optionally stream in a file as well, File->Graph. This means we have
> to
> > get into the concept of "InputSplits" at the gremlin-core level. A quick
> > and dirty is to simply serially load the graph data from a file, this is
> > not the optimal solution, but can move us forward on the Graph.io() API.
> >
> > To the API of Graph.io(). This would mean, like Traversal, the user can
> > specify a Computer to use to do the readGraph().
> >
> >         graph.io().readGraph(file, graph.compute(MyGraphComputer.class))
> >
> > For writeGraph()
> >
> >         graph.io().writeGraph(file,graph.compute(MyGraphComputer.class))
> >
> >
> > Where, "file" can be a directory in both situations and each "worker" of
> > the GraphComputer reads/writes a split.
> >
> > Thoughts?,
> > Marko.
> >
> > http://markorodriguez.com
> >
> >
>

Re: OLAP and Graph.io()

Posted by Matt Frantz <ma...@gmail.com>.

The questions that occur to me are somewhat broad, so I apologize if they
distract from the intended topic.  However, I do feel they are related to a
proper IO design.

Would the readGraph API be suitable for a continuously streaming loader,
e.g. to parse an activity stream, or is it only used for finite inputs?

Would the writeGraph API be suitable for a continuously streaming
extractor, e.g. to write an external transaction log, or to synchronize a
replica, or is it only used for finite outputs?

What is the expected behavior when there is simultaneous access, e.g.
queries occurring during readGraph, or mutations occurring during
writeGraph?

On Thu, Apr 30, 2015 at 9:01 AM, Marko Rodriguez <ok...@gmail.com>
wrote:

> Hi,
>
> Stephen is interested in making sure that Graph.io() works cleanly for
> both OLTP and OLAP. In particular, making sure that io().readGraph() and
> io().writeGraph() can be used in both OLTP and OLAP situations seamlessly
> much like Gremlin does for traversals.
>
> ------------
>
> OLAP graph writing will occur via a (yet to be written)
> BulkLoaderVertexProgram. BulkLoaderVertexProgram takes a Graph (with
> vertices/edges) and writes to another Graph. In essence, two graphs, where
> the first graph has the data and the second is empty. I always expected
> this to typically happen via Hadoop (HadoopGraph) -> VendorDatabase
> (VendorGraph). However, while most distributed graph database vendors will
> leverage Hadoop/Giraph/Spark for their OLAP bulk loading operations because
> of HDFS, we can't always assume this -- especially in the context of OLAP
> Graph.io().
>
> Thus, BulkLoaderVertexProgram shouldn't just operate on Graph->Graph, but
> can optionally stream in a file as well, File->Graph. This means we have to
> get into the concept of "InputSplits" at the gremlin-core level. A quick
> and dirty is to simply serially load the graph data from a file, this is
> not the optimal solution, but can move us forward on the Graph.io() API.
>
> To the API of Graph.io(). This would mean, like Traversal, the user can
> specify a Computer to use to do the readGraph().
>
>         graph.io().readGraph(file, graph.compute(MyGraphComputer.class))
>
> For writeGraph()
>
>         graph.io().writeGraph(file,graph.compute(MyGraphComputer.class))
>
>
> Where, "file" can be a directory in both situations and each "worker" of
> the GraphComputer reads/writes a split.
>
> Thoughts?,
> Marko.
>
> http://markorodriguez.com
>
>