You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tinkerpop.apache.org by Marko Rodriguez <ok...@gmail.com> on 2016/10/07 12:54:06 UTC

The Beauty of Bulking -- Redux'prise

Hello,

Many moons ago, I wrote an email to this list called “The Beauty of Bulking” which discussed the concept of traverser bulking.

This notion is perhaps the most important aspect of the Gremlin traversal machine. It is what makes OLAP scale, speedy complex path-calculations plausible, and runtimes on reverberant traversals extremely fast.

Bulking can be summed up easily:

1. If you have a stream of numbers, say [6, 5, 4, 4, 4, 6] you can map() those numbers by *10 and thus do 6 arithmetic operations to yield [60, 50, 40, 40, 40, 60].
2. However, if stream ordering does not matter (an important constraint to lift!), then you could get the “same” result with only 3 arithmetic operations and yield [60, 60, 50, 40, 40, 40].
3. In short, “bulk” your objects — 6(2), 5(1), 4(3). Then you only need to execute *10 three times.

Prior to this moment, the beauty of bulking has made itself apparent in every OLAP job executed as well as in repeat(), barrier(), and the epic work from Ted Wilmes on PathRetractionStrategy.

Kuppitz and I are adding to this subtle beauty with a rewrite of LazyBarrierStrategy and its inclusion into the default TraversalStrategies that are inherited (typically) by all graph system providers.

Lets look at what we have done to master/ and what you can expect in TinkerPop 3.2.3.

graph = TinkerGraph.open()
graph.io(gryo()).readGraph('data/grateful-dead.kryo')
g1 = graph.traversal().withoutStrategies(LazyBarrierStrategy.class)
g2 = graph.traversal()

From g1 (no beauty) and g2 (beautiful), lets see how various traversals perform.

gremlin> clock(10){g1.V().out().in().out().count().iterate()}
==>1073.5150449999999
gremlin> clock(10){g2.V().out().in().out().count().iterate()}
==>8.8109061

Deep traversals with lots of intersections do the best. A 1 second query is dropped to 8 milliseconds. Thats a 100x+ speed improvement.

gremlin> clock(1000){g1.V().out().values('name').iterate()}
==>1.1394026739999998
gremlin> clock(1000){g2.V().out().values('name').iterate()}
==>1.030407659

Not that much intersection above, but we still get a little bump in performance.

**DRUM ROLL — THE PHOENIX RISES FROM THE TEXTS OF EDFU**

Lets get practical. 

Graph analysis is all about “reverberation.” What does that mean? — look at any graph algorithm. Its not about just “getting data," its about seeing how paths intersect/overlay/loop amongst themselves and each other. Paths tell you a lot about the “structure” of your data — topological statistics. PageRank, Betweenness, Eccentricity, Eigenvectors, Recommendations, Spreading Activation, …. these algorithms are all about the paths, not “the data." Lets look at a classic Amazon.com <http://amazon.com/>-style recommendation algorithm:

	“Person ‘a' likes X. X are liked by people B. People B like Y. Recommend ‘a’ items from Y that are not in X.”

g.V(89).out('followedBy').aggregate('a').
  in('followedBy').out(‘followedBy').
    where(not(within('a'))).
  groupCount().by('name')

Okay. So lets do that walk on our toy graph, where Person ‘a’ is v[89].

gremlin> clock(100){g1.V(89).out('followedBy').aggregate('a').in('followedBy').out('followedBy').where(not(within('a'))).groupCount().by('name').iterate()}
==>23.73006225
gremlin> clock(100){g2.V(89).out('followedBy').aggregate('a').in('followedBy').out('followedBy').where(not(within('a'))).groupCount().by('name').iterate()}
==>1.4410725199999999

So there you have it. The classic, “everyone and their mother does it”-traversal went from 23ms to 1.4ms. A 20x performance improvement on a practical everyday example.

gremlin> g2.V(89).out('followedBy').aggregate('a').in('followedBy').out('followedBy').where(not(within('a'))).groupCount().by('name').explain()
==>Traversal Explanation
================================================================================================================================================================================================================================================================================================================
Original Traversal                 [GraphStep(vertex,[89]), VertexStep(OUT,[followedBy],vertex), AggregateStep(a), VertexStep(IN,[followedBy],vertex), VertexStep(OUT,[followedBy],vertex), WherePredicateStep(without([a])), GroupCountStep(value(name))]
...
LazyBarrierStrategy          [O]   [GraphStep(vertex,[89]), VertexStep(OUT,[followedBy],vertex), AggregateStep(a), VertexStep(IN,[followedBy],vertex), NoOpBarrierStep(10000), VertexStep(OUT,[followedBy],vertex), NoOpBarrierStep(10000), WherePredicateStep(without([a])), GroupCountStep(value(name))]
...
Final Traversal                    [TinkerGraphStep(vertex,[89]), VertexStep(OUT,[followedBy],vertex), AggregateStep(a), VertexStep(IN,[followedBy],vertex), NoOpBarrierStep(10000), VertexStep(OUT,[followedBy],vertex), NoOpBarrierStep(10000), WherePredicateStep(without([a])), GroupCountStep(value(name))]

Word to your mutha,
Marko.

http://markorodriguez.com