You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tinkerpop.apache.org by Stephen Mallette <sp...@gmail.com> on 2018/06/07 15:53:41 UTC

[DISCUSS] Bulk Loading

TinkerPop tries to generalize various aspects of graph computing and does a
pretty good job of doing so, but every so often we try to generalize
something and it just doesn't work the way we'd like. Indexing was one such
casualty, if you need an example to consider, but I think that our attempt
at bulk loading is falling into that area as well, specifically:
BulkLoaderVertexProgram (BLVP):

http://tinkerpop.apache.org/docs/current/reference/#bulkloadervertexprogram

What I'm seeing is that graph providers are offering their own bulk loading
tools which are inevitably faster and/or easier to use that BLVP. Here's
some examples:

CosmosDB: https://github.com/Microsoft/Microsoft.Azure.Graphs.BulkImport
Neptune: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load.html
Neo4j: https://neo4j.com/blog/bulk-data-import-neo4j-3-0/
DSE Graph:
https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/graph/dgl/dglOverview.html
JanusGraph: https://docs.janusgraph.org/0.2.0/bulk-loading.html

I suppose there are others, but hopefully those examples convey the point.
Of those I mentioned, perhaps the JanusGraph one is a bit of a stretch as
its documentation references hadoop-gremlin which I presume means BLVP.
Maybe someone on JanusGraph can comment a bit further.

In addition to graph providers having their own approaches to bulk loading,
I tend to find that BLVP is always a question mark for users. They tend to
have problems getting it working right and we really haven't done much to
improve its usage.

So, given all that, would it be a bad idea to get TinkerPop out of the
business of trying to generalize bulk loading? If we did, that would be one
less feature to support and we could arguably recommend to users a better
experience by instructing them to use the bulk loader of their graph of
choice. I suppose that the downside to taking this stance would be that
graph providers that don't provide bulk loaders couldn't rely on TinkerPop
anymore for this need (JanusGraph? others?). Finally, users would not have
a single general way to bulk load to any graph implementation. Perhaps
there is a way to do that without BLVP in place?

Re: [DISCUSS] Bulk Loading

Posted by Stephen Mallette <sp...@gmail.com>.
As there haven't been any concerns or other ideas raised, I've created this
issue to further track this work:

https://issues.apache.org/jira/browse/TINKERPOP-1985

As a summary,  we will

1. Deprecate BulkLoaderVertexProgram as for 3.2.10 (will not remove for
3.4.0 at this point)
2. Remove user docs regarding TinkerPop bulk loading capabilities
3. Author new provider documentation around writing custom
{{Input/OutputFormat}} implementations for use with
{{BulkDumperVertexProgram}}



On Tue, Jun 12, 2018 at 6:41 AM Stephen Mallette <sp...@gmail.com>
wrote:

> That's a nice idea. I think that gets us out most of the way out of bulk
> loading while still providing a method for providers who want an "easy" way
> to offer a bulk loader. I sense that we wouldn't directly promote this
> feature to users and we would leave it to graph providers to present that
> information in their own documentation. I think we'd just write up
> something in Provider Documentation so that folks are aware of how it works
> and what they need to do to take advantage of it.
>
>
>
> On Thu, Jun 7, 2018 at 12:32 PM Daniel Kuppitz <me...@gremlin.guru> wrote:
>
>> IMO it would be best if graph providers would implement a
>> GraphOutputFormat
>> for their graph implementation. This way we could rely on
>> BulkDumperVertexProgram [1,2], which only relies on an InputFormat and an
>> OutputFormat and thus can be seen as kind of a [Copy|Clone]VertexProgram.
>> If that's not an option, then graph providers could still create their own
>> VP, that is optimized to handle transactions, id assignments, etc.
>> properly
>> in the underlying graph DB implementation.
>>
>> [1]
>>
>> http://tinkerpop.apache.org/docs/current/reference/#bulkdumpervertexprogram
>> [2]
>>
>> https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/computer/bulkdumping/BulkDumperVertexProgram.java
>>
>> Cheers,
>> Daniel
>>
>>
>> On Thu, Jun 7, 2018 at 8:53 AM, Stephen Mallette <sp...@gmail.com>
>> wrote:
>>
>> > TinkerPop tries to generalize various aspects of graph computing and
>> does a
>> > pretty good job of doing so, but every so often we try to generalize
>> > something and it just doesn't work the way we'd like. Indexing was one
>> such
>> > casualty, if you need an example to consider, but I think that our
>> attempt
>> > at bulk loading is falling into that area as well, specifically:
>> > BulkLoaderVertexProgram (BLVP):
>> >
>> > http://tinkerpop.apache.org/docs/current/reference/#
>> > bulkloadervertexprogram
>> >
>> > What I'm seeing is that graph providers are offering their own bulk
>> loading
>> > tools which are inevitably faster and/or easier to use that BLVP. Here's
>> > some examples:
>> >
>> > CosmosDB:
>> https://github.com/Microsoft/Microsoft.Azure.Graphs.BulkImport
>> > Neptune: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-
>> > load.html
>> > Neo4j: https://neo4j.com/blog/bulk-data-import-neo4j-3-0/
>> > DSE Graph:
>> > https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_
>> > enterprise/graph/dgl/dglOverview.html
>> > JanusGraph: https://docs.janusgraph.org/0.2.0/bulk-loading.html
>> >
>> > I suppose there are others, but hopefully those examples convey the
>> point.
>> > Of those I mentioned, perhaps the JanusGraph one is a bit of a stretch
>> as
>> > its documentation references hadoop-gremlin which I presume means BLVP.
>> > Maybe someone on JanusGraph can comment a bit further.
>> >
>> > In addition to graph providers having their own approaches to bulk
>> loading,
>> > I tend to find that BLVP is always a question mark for users. They tend
>> to
>> > have problems getting it working right and we really haven't done much
>> to
>> > improve its usage.
>> >
>> > So, given all that, would it be a bad idea to get TinkerPop out of the
>> > business of trying to generalize bulk loading? If we did, that would be
>> one
>> > less feature to support and we could arguably recommend to users a
>> better
>> > experience by instructing them to use the bulk loader of their graph of
>> > choice. I suppose that the downside to taking this stance would be that
>> > graph providers that don't provide bulk loaders couldn't rely on
>> TinkerPop
>> > anymore for this need (JanusGraph? others?). Finally, users would not
>> have
>> > a single general way to bulk load to any graph implementation. Perhaps
>> > there is a way to do that without BLVP in place?
>> >
>>
>

Re: [DISCUSS] Bulk Loading

Posted by Stephen Mallette <sp...@gmail.com>.
That's a nice idea. I think that gets us out most of the way out of bulk
loading while still providing a method for providers who want an "easy" way
to offer a bulk loader. I sense that we wouldn't directly promote this
feature to users and we would leave it to graph providers to present that
information in their own documentation. I think we'd just write up
something in Provider Documentation so that folks are aware of how it works
and what they need to do to take advantage of it.



On Thu, Jun 7, 2018 at 12:32 PM Daniel Kuppitz <me...@gremlin.guru> wrote:

> IMO it would be best if graph providers would implement a GraphOutputFormat
> for their graph implementation. This way we could rely on
> BulkDumperVertexProgram [1,2], which only relies on an InputFormat and an
> OutputFormat and thus can be seen as kind of a [Copy|Clone]VertexProgram.
> If that's not an option, then graph providers could still create their own
> VP, that is optimized to handle transactions, id assignments, etc. properly
> in the underlying graph DB implementation.
>
> [1]
> http://tinkerpop.apache.org/docs/current/reference/#bulkdumpervertexprogram
> [2]
>
> https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/computer/bulkdumping/BulkDumperVertexProgram.java
>
> Cheers,
> Daniel
>
>
> On Thu, Jun 7, 2018 at 8:53 AM, Stephen Mallette <sp...@gmail.com>
> wrote:
>
> > TinkerPop tries to generalize various aspects of graph computing and
> does a
> > pretty good job of doing so, but every so often we try to generalize
> > something and it just doesn't work the way we'd like. Indexing was one
> such
> > casualty, if you need an example to consider, but I think that our
> attempt
> > at bulk loading is falling into that area as well, specifically:
> > BulkLoaderVertexProgram (BLVP):
> >
> > http://tinkerpop.apache.org/docs/current/reference/#
> > bulkloadervertexprogram
> >
> > What I'm seeing is that graph providers are offering their own bulk
> loading
> > tools which are inevitably faster and/or easier to use that BLVP. Here's
> > some examples:
> >
> > CosmosDB: https://github.com/Microsoft/Microsoft.Azure.Graphs.BulkImport
> > Neptune: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-
> > load.html
> > Neo4j: https://neo4j.com/blog/bulk-data-import-neo4j-3-0/
> > DSE Graph:
> > https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_
> > enterprise/graph/dgl/dglOverview.html
> > JanusGraph: https://docs.janusgraph.org/0.2.0/bulk-loading.html
> >
> > I suppose there are others, but hopefully those examples convey the
> point.
> > Of those I mentioned, perhaps the JanusGraph one is a bit of a stretch as
> > its documentation references hadoop-gremlin which I presume means BLVP.
> > Maybe someone on JanusGraph can comment a bit further.
> >
> > In addition to graph providers having their own approaches to bulk
> loading,
> > I tend to find that BLVP is always a question mark for users. They tend
> to
> > have problems getting it working right and we really haven't done much to
> > improve its usage.
> >
> > So, given all that, would it be a bad idea to get TinkerPop out of the
> > business of trying to generalize bulk loading? If we did, that would be
> one
> > less feature to support and we could arguably recommend to users a better
> > experience by instructing them to use the bulk loader of their graph of
> > choice. I suppose that the downside to taking this stance would be that
> > graph providers that don't provide bulk loaders couldn't rely on
> TinkerPop
> > anymore for this need (JanusGraph? others?). Finally, users would not
> have
> > a single general way to bulk load to any graph implementation. Perhaps
> > there is a way to do that without BLVP in place?
> >
>

Re: [DISCUSS] Bulk Loading

Posted by Daniel Kuppitz <me...@gremlin.guru>.
IMO it would be best if graph providers would implement a GraphOutputFormat
for their graph implementation. This way we could rely on
BulkDumperVertexProgram [1,2], which only relies on an InputFormat and an
OutputFormat and thus can be seen as kind of a [Copy|Clone]VertexProgram.
If that's not an option, then graph providers could still create their own
VP, that is optimized to handle transactions, id assignments, etc. properly
in the underlying graph DB implementation.

[1]
http://tinkerpop.apache.org/docs/current/reference/#bulkdumpervertexprogram
[2]
https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/computer/bulkdumping/BulkDumperVertexProgram.java

Cheers,
Daniel


On Thu, Jun 7, 2018 at 8:53 AM, Stephen Mallette <sp...@gmail.com>
wrote:

> TinkerPop tries to generalize various aspects of graph computing and does a
> pretty good job of doing so, but every so often we try to generalize
> something and it just doesn't work the way we'd like. Indexing was one such
> casualty, if you need an example to consider, but I think that our attempt
> at bulk loading is falling into that area as well, specifically:
> BulkLoaderVertexProgram (BLVP):
>
> http://tinkerpop.apache.org/docs/current/reference/#
> bulkloadervertexprogram
>
> What I'm seeing is that graph providers are offering their own bulk loading
> tools which are inevitably faster and/or easier to use that BLVP. Here's
> some examples:
>
> CosmosDB: https://github.com/Microsoft/Microsoft.Azure.Graphs.BulkImport
> Neptune: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-
> load.html
> Neo4j: https://neo4j.com/blog/bulk-data-import-neo4j-3-0/
> DSE Graph:
> https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_
> enterprise/graph/dgl/dglOverview.html
> JanusGraph: https://docs.janusgraph.org/0.2.0/bulk-loading.html
>
> I suppose there are others, but hopefully those examples convey the point.
> Of those I mentioned, perhaps the JanusGraph one is a bit of a stretch as
> its documentation references hadoop-gremlin which I presume means BLVP.
> Maybe someone on JanusGraph can comment a bit further.
>
> In addition to graph providers having their own approaches to bulk loading,
> I tend to find that BLVP is always a question mark for users. They tend to
> have problems getting it working right and we really haven't done much to
> improve its usage.
>
> So, given all that, would it be a bad idea to get TinkerPop out of the
> business of trying to generalize bulk loading? If we did, that would be one
> less feature to support and we could arguably recommend to users a better
> experience by instructing them to use the bulk loader of their graph of
> choice. I suppose that the downside to taking this stance would be that
> graph providers that don't provide bulk loaders couldn't rely on TinkerPop
> anymore for this need (JanusGraph? others?). Finally, users would not have
> a single general way to bulk load to any graph implementation. Perhaps
> there is a way to do that without BLVP in place?
>