You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tinkerpop.apache.org by Daniel Kuppitz <me...@gremlin.guru> on 2015/02/11 20:17:29 UTC

[TinkerPop3] Strategy to optimize count()-filters

Hello hackers,

now that we have the IsStep, filters by count() seem to become a
common pattern, e.g.


// find people with no friendsg.V().has(both("friend").count().is(0L))
// poor guys


But what's wrong with the query? The problems arise when you have
people with a lot of friends, 100's or 1.000's. The traverser will
fetch all of the friends, count them and then decide whether .is(0)
evaluates true or not. But in fact we only need to find 1 friend to
tell that .is(0) will evaluate false. That's why we introduced a new
strategy: RangeByIsCountStrategy
<https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/com/tinkerpop/gremlin/process/graph/traversal/strategy/RangeByIsCountStrategy.java>

What does it do? The solution to the aforementioned problem is simple,
we just need to add a RangeStep on front of count() and fetch not more
elements than ultimately needed:


g.V().has(both("friend").limit(1).count().is(0L))


And that's exactly what the strategy does for you. Here's an example:


gremlin> g = TinkerGraph.open()==>tinkergraph[vertices:0
edges:0]gremlin>
g.io().readGraphML('data/grateful-dead.xml')==>nullgremlin> t =
g.V().count().is(0L); null==>nullgremlin> t.toString() // strategies
not yet applied==>[GraphStep(vertex), CountStep, IsStep(eq,0)]gremlin>
t.iterate()gremlin> t.toString() // strategies
applied==>[TinkerGraphStep(vertex), *RangeStep(0,1)*, CountStep,
IsStep(eq,0)]


We can use the profile()-step to show the significant effect. Let's
first check the metrics for the traversal if the strategy is not
applied:


gremlin> t = g.V().count().is(0L).profile().cap(TraversalMetrics.METRICS_KEY);
null==>nullgremlin>
t.strategies.removeStrategies(com.tinkerpop.gremlin.process.graph.traversal.strategy.RangeByIsCountStrategy.class)==>strategies[DedupOptimizerStrategy,
IdentityRemovalStrategy, SideEffectCapStrategy, MatchWhereStrategy,
ComparatorHolderRemovalStrategy, ReducingStrategy,
LabeledEndStepStrategy, EngineDependentStrategy,
SideEffectRegistrationStrategy, ProfileStrategy,
TraversalVerificationStrategy, ConjunctionStrategy,
TinkerGraphStepStrategy]gremlin> t==>Traversal Metrics
       Step                 Count  Traversers       Time (ms)    % Dur
          GraphStep(vertex)                   808         808
62.980    52.32                   CountStep                     1
     1          56.479    46.92                IsStep(eq,0)
         0           0           0.615
0.51SideEffectCapStep([~metri...                     1           1
      0.304     0.25                       TOTAL                     -
          -         120.380        -


And again, now with the new strategy applied (which is the default behavior):


gremlin> t = g.V().count().is(0L).profile().cap(TraversalMetrics.METRICS_KEY)==>Traversal
Metrics                        Step                 Count  Traversers
     Time (ms)    % Dur           GraphStep(vertex)
 2           2           0.386    36.16              RangeStep(0,1)
                 1           1           0.162    15.19
   CountStep                     1           1           0.147
13.83                IsStep(eq,0)                     0           0
       0.101     9.49SideEffectCapStep([~metri...
1           1           0.270    25.33                       TOTAL
                -           -           1.068        -


As you can see, instead of 808 traversers who's job is to waste CPU
cycles (and thus a lot of time), we end up with only 2 traversers to
get the result we're looking for.
Btw. the strategy is applied to all occurences of
.count().is(<predicate>, <value>). The only exception is if another
RangeStep already precedes the CountStep; in this case the strategy
won't do anything.

Cheers,
Daniel