You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@atlas.apache.org by "Graham Wallis (JIRA)" <ji...@apache.org> on 2017/06/14 13:26:00 UTC
[jira] [Commented] (ATLAS-1868) Highly inefficient DSL-queries

    [ https://issues.apache.org/jira/browse/ATLAS-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049176#comment-16049176 ] 

Graham Wallis commented on ATLAS-1868:
--------------------------------------

Hi Christian,

This is a really interesting observation. It clearly makes sense to start the traversal at the indexed vertex type.

I decided to experiment with Atlas 0.9 and then did some experimentation with Titan 1.0 (directly, without Atlas). I then experimented with declarative-style gremlin queries and finally explored the Titan index APIs to see if I could think of a way to implement an automated optimization. The results were as follows:

h2. Experiment with Atlas 0.9
I tried the DSL query below with Atlas 0.9 and it produced the following gremlin - which I think is similar in structure to Christian's example (from Atlas 0.7). It uses a local variable to collate results but it still starts the traversal at the non-indexed end of the edge, instead of reversing it.

DSL query: 
Table where columns.name="customer_id"

Translated Gremlin Query: 
{
  def r=(([]) as Set);
  def f1={
    GremlinPipeline x->x.as('a0').out('__Table.columns').has('Column.name',T.'eq','customer_id').
      back('a0').as('__res') [0..<25].select(['a0', '__res']).fill(r)
    };
  f1(g.V().has('__typeName','Table'));
  f1(g.V().has('__superTypeNames','Table'));
  r._().as('__tmp').
    transform({((Row)it).getColumn('a0')}).as('a0').back('__tmp').
    transform({((Row)it).getColumn('__res')}).as('__res') [0..<25].toList()
}

This translation was on Atlas 0.9 using Titan 0.5.4. I did not try it with Titan 1.0 because there are parts of Atlas that either do not support Java 8 yet or have hard dependencies on Tinkerpop 2.

h2. Experiments directly on Titan 1.0
I decided to do some exploration on Titan 1.0 - because I wanted to see if Titan's gremlin optimization would handle this automatically. The answer is it doesn't seem to. I populated a Titan 1.0 graph with 100K 'Person' vertices that each have one 'owns' edge to one of 100K 'Asset' vertices. i.e. there are 100K triples of Person->owns->Asset. I created a composite index on Asset.assetid, with indexOnly() of vertices with label 'Asset' on the basis that different vertex types may have properties with similar keys, so we would probably need a label-specific index).

conf=new BaseConfiguration()
graph = TitanFactory.open('../conf/titan-berkeleyje.properties')

mgmt=graph.openManagement()
assetid = mgmt.makePropertyKey('assetid').dataType(String.class).make()
asset=mgmt.makeVertexLabel('Asset').make()
mgmt.buildIndex('assetById',Vertex.class).addKey(assetid).indexOnly(asset).buildCompositeIndex()
mgmt.commit()


for (i in 1..100000) {
  p=graph.addVertex(label,'Person','name',i)
  a=graph.addVertex(label,'Asset','assetid','asset-num-'+i)
  p.addEdge('owns',a)
}
graph.tx().commit()

I could then issue queries like the following:

// 'Forward' query
t=System.currentTimeMillis();g.V().hasLabel('Person').out('owns').hasLabel('Asset').has('assetid','asset-num-7').iterate();System.currentTimeMillis()-t

// 'Reversed' query
t=System.currentTimeMillis();g.V().hasLabel('Asset').has('assetid','asset-num-7').in('owns').hasLabel('Person').iterate();System.currentTimeMillis()-t


Results for DSL-style gremlin versus reversed traversal: 2773ms versus 2ms, i.e. ~ x1300 faster when reversed. This result compares closely with Christian's results in which he saw ~ x1100 faster performance.

h2. Experiments with declarative queries
I experimented to see if I could get Titan's gremlin optimizer to translate the query into the obviously more efficient traversal, following Christian's example. Hoewever, even a declarative query, with the sub-traversals in any order, is not optimized:

gremlin> g.V().match(
......1> __.as('b').hasLabel('Asset').has('assetid','asset-num-38'),
......2>  __.as('a').hasLabel('Person'),
......3> __.as('a').out('owns').as('b')).
......4>  select('a','b').by('name').by('assetid')
04:20:08,513  WARN StandardTitanTx:1262 - Query requires iterating over all vertices [(~label = Person)]. For better performance, use indexes
==>[a:38,b:asset-num-38]

You have to explicitly reverse the edge traversal to get good performance, as in the following:

gremlin> g.V().match(
......1> __.as('b').hasLabel('Asset').has('assetid','asset-num-38'),
......2>  __.as('a').hasLabel('Person'),
......3> __.as('b').in('owns').as('a')).
......4>  select('a','b').by('name').by('assetid')
==>[a:38,b:asset-num-38]


h2. Exploration of Titan's index APIs
If Titan cannot perform this optimization, it looks as if Atlas would have to implement it, so I looked at Titan's TitanGraphIndex APIs to see how much Atlas would be able to derive generically, but I'm currently of the opinion that the Titan APIs do not provide enough to support a fully programmatic query optimization in Atlas. The alternative appears to be Atlas capitalizing on its own knowledge of the graph schema, indexes and stats. I have not explored that area yet.



> Highly inefficient DSL-queries
> ------------------------------
>
>                 Key: ATLAS-1868
>                 URL: https://issues.apache.org/jira/browse/ATLAS-1868
>             Project: Atlas
>          Issue Type: Bug
>          Components:  atlas-core
>    Affects Versions: 0.7-incubating
>         Environment: linux, hbase + solr configuration.
>            Reporter: Christian R
>              Labels: dsl, gremlin
>
> The DSL query 'mytype where property.id = "id1"' appears to be rewritten as a gremlin query that resembles:
> g.V.has(typename, 'mytype'¨).as(x).out('property').has('id', 'id1').back('x')
> On our system this query takes 6-7 minutes. The query
> g.V.has('id', 'id1').in('property').has('typename', 'mytype')
> takes 350 milliseconds.
> Our graph:
> g.V.count() = 1359151
> We have atlas 0.7 installed. I've compiled the latest 0.9 code and looked at the generated gremlin query as reported in the logs for the same DSL-query, and I think 0.9 has the same performance issues. Unfortunately I don't have a big graph on a 0.9 installation to test performance. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)