You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@atlas.apache.org by Suma Shivaprasad <su...@gmail.com> on 2016/12/05 22:32:42 UTC
Re: Atlas DSL Performance Idea

+dev@

Thanks Jeffrey for the detailed explanation. Looking forward to your patch.

Thanks
Suma


On Mon, Dec 5, 2016 at 8:29 AM, Jeffrey N Hagelberg <jn...@us.ibm.com>
wrote:

> Hi Suma,
>
>
>
> I did experiment with the union step.  That did not make a difference.  If
> that step was implemented by literally splitting the traversals and then
> joining them back together, it might work.  However, that is not how it
> works today.  Basically, in TP3 there is a “HasContainer” mechanism that
> the query engine uses to gather all of the has predicates into an initial
> graph query that gets run to get the start vertices for the traversal.
> Once this initial graph query is done, the rest of the traversal is done in
> memory using the vertices that were found as the starting point.  In “and”,
> “or”, “where”, “union”, and other similar steps, the arguments of the steps
> are actually child traversals.  In the general case, they could be
> arbitrary subqueries that are not at all connected to the parent query.
> Because of this, the HasContainer mechanism cannot use the has steps within
> “union”, “and”, “or”, etc.  This is why rewriting the queries is
> necessary.  Ideally the gremlin optimizer would do that, but since it does
> not, we are doing it.  Figuring out when it is safe to inline these
> traversals is not an easy problem to solve.  There are some common cases
> that Atlas uses (which I outlined in my earlier email) where I’ve figured
> out that it is safe to do this inlining, but there are many others where it
> is not.   I’m sure that as we move forward we’ll identifier other cases as
> well.  The optimizations I have put in are all centered around making the
> HasContainer mechanism be able to find as many of the has predicates as it
> can so that that initial graph query does as much filtering as possible.
> Like I said, I’ve seen pretty good results with this approach.  It does
> work with all types of has predicates, not just =.   I’m also not sure that
> we necessarily need to get rid of fill.  Even if gremlin took care of the
> doing the optimizing, it would still need to do something similar and store
> the matching vertices in memory while they move through the graph
> traversal.  The combining of the vertices from the various “or” conditions
> would still need to take place.  We’re just making it more explicit.  There
> may be come cases where the OR could be pushed into the initial indexed
> query.  That would require making pretty significant changes to gremlin
> though.
>
>
>
> I am also adding a flag that will allow the optimization to be turned
> on/off.  I was planning to make this be a global flag, but I can definitely
> have different flags for the or/and optimizers.
>
>
>
> -Jeff
>
>
>
> *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> *Sent:* Friday, December 02, 2016 8:17 PM
> *To:* Jeffrey N Hagelberg <jn...@us.ibm.com>
> *Cc:* Shwetha Shivalingamurthy <ss...@hortonworks.com>; Dave
> Kantor <dk...@us.ibm.com>; Fnu Neerju <gu...@us.ibm.com>; Madhan
> Neethiraj <mn...@hortonworks.com>; Hemanth Yamijala <
> hyamijala@hortonworks.com>
>
> *Subject:* Re: Atlas DSL Performance Idea
>
>
>
> Hi Jeff,
>
>
>
> Appreciate the detailed explanation on the optimizations.   The approach
> looks quite flexible/ extensible in that we can add other optimizations
> easily if needed to this.
>
> Will the optimization kick in for comparison predicates as well like <, >
> as well? I dont see an issue with the AND optimizations since the
> predicates can be applied inline.
>
>
>
> There could still be some issues on using fill in intermediate
> optimization steps like OR etc when the step output results in a lot of
> results which could occur when there are comparision operators like <. > or
> when there are equal match predicates on attributes which result in a lot
> of values for eg: country or Asset.owner ( Eg query could be Asset where
> (owner='x' or 'createTime' > 'y')  which results in a lot of values for
> both owner and createTime). From my tests , I can clearly see that it fills
> the whole set of vertices into a collection and then pushes the predicates
> on top of this.
>
>
>
> Do you think we could instead use a copySplit and merge in Gremlin which
> could avoid the fill step? Not sure if it would break indexing though(Looks
> like it might based on your 'and' optimization step of moving out the
> simple ands into direct predicates) and there have been some issues
> expressed with copySplit https://groups.google.com/
> forum/#!topic/gremlin-users/6_MRJxBnivo and has been replaced by union in
> TP3. Unfortunately if we find issues with copySplit then not sure what else
> we can consider here.
>
>
>
> If we cant figure out a way to do it efficiently for ORs without affecting
> stability/causing memory issues, can we make the OR optimizations
> configurable  ? When we switch to TP3, I guess it will be a lot more easier
> to enable this with indexing
>
>
>
> Thanks
>
> Suma
>
>
>
>
>
>
>
> On Fri, Dec 2, 2016 at 3:26 PM, Jeffrey N Hagelberg <
> jnhagelberg@us.ibm.com> wrote:
>
> Hi Suma,
>
>
>
> In the current approach, we’re not exactly getting rid of the collection
> variable.  What are doing is using it differently.  We’ve put in an
> optimizer that does post-processing of the GroovyExpression generated by
> the DSL translator.  It takes that expression and recursively expands any
> top-level “or” and “and” expressions.  I actually just some long Javadoc
> comments describing the “or” and “and” optimization logic and what cases it
> handles.  I’ve included it below.  Basically, what we’ve done is changed
> the type check to use an “or”, and let the logic described below handle it
> along with all of the other stuff.
>
>
>
> The end result is queries that look kind of like this:
>
>
>
> import java.util.function.Function;
>
> import org.apache.tinkerpop.gremlin.process.traversal.strategy.
> decoration.PartitionStrategy;import static org.apache.tinkerpop.gremlin.
> process.traversal.P.*;
>
> def g=GraphTraversalSource.build().with(PartitionStrategy.build(
> ).partitionKey('__tenant_id').addReadPartition('omas-17').
> writePartition('omas-17').create()).create(graph);L:{def r=(([]) as Set);
>
> def f1={GraphTraversal x->x.as('inst').has('Referenceable.qualifiedName',
> eq('DataSet~ds-400002')).as('_src1').select('inst','_src1').fill(r)};
>
> f1(g.V().has('__typeName','OMAS_OMRSAsset'));
>
> f1(g.V().has('__superTypeNames','OMAS_OMRSAsset'));
>
> __.inject(((r) as Object[])).as('__tmp').map({((
> Map)it.get()).get('inst')}).as('inst').select('__tmp').
> map({((Map)it.get()).get('_src1')}).as('_src1').select('inst').by((({[((((it)
> as Vertex[])) as List<Vertex>).last(), ((String)((((it) as Vertex[])) as
> List<Vertex>).last().property('OMAS_OMRSCommonObject.name').orElse(null))]})
> as Function)).toList()}
>
>
>
> (Note that the partition strategy is part of our multi-tenancy
> implementation)
>
>
>
> So we’re not removing the intermediate variable.  What we are doing is
> making it so that it will contain less stuff, and pushing as much of the
> conditions as possible into the intimal queries so that they will use an
> index.  For our “Create Data Set” use case, this brought the time for
> creating 1 DataSet in an environment with 6000 existing data sets down from
> around 20 seconds to about 1-4 seconds.  Most of that time was spend
> executing DSL queries.
>
>
>
> *And Optimization Logic*
>
>
>
> The “and” optimization logic pulls expressions out of the “and” into the
> main expression wherever possible
>
> For example:
>
> g.V().and(has('x'),has('y')
>
> is optimized to:
>
> g.V().has('x').has('y')
>
> There are certain cases where it is not safe to move an expression out
>
> of the 'and'. For example, in the expression
>
> g.V().and(has('x').out('y'),has('z'))
>
> has('x').out('y') cannot be moved out of the 'and', since it changes the
> value of the traverser.
>
> At this time, the ExpandAndsOptimizer is not able to handle this scenario,
> so we don't extract
>
> that expression. In this case, the result is:
>
>
>
> g.V().has('z').and(has('x').out('y')
>
>
>
>
>
> *Or Optimization Logic*
>
>
>
> We recursively remove 'or' expressions from a graph traversal when
> possible
>
> and replace them with separate calls that are combined using a logic union
> operation.
>
> Unfortunately, Gremlin does not use indices when executing the child graph
> traversals associated
>
> with an 'or' call. In order to make the index be used, we split queries
> with
>
> or expressions into multiple queries. These queries are executed
> individually,
>
> using indices, and then the results are combined back together. Here is a
>
> simple example to illustrate this:
>
> *Original Query*
>
> g.V().or(has('name','Fred'),has('age','17'))
>
> *Optimized Query*
>
> def r = [] as Set;
>
> g.V().has('name','Fred').fill(r);
>
> g.V().has('age','17').fill(r);
>
> r;
>
> Here, we introduce an intermediate variable "r" which is declared as a
> Set. The Set is performing
>
> the union for us. If there are vertices that happen to both have "Fred" as
> the name and "17" as the age,
>
> the second query execution will prevent a duplicate vertex from being
> added to the result. Recall that
>
> in Groovy scripts, the last expression is the one that will be returned
> back to the caller. We refer to
>
> that expression is the "result expression". For this example, the result
> expression is simply "r", which
>
> contains the vertices that matched the query.
>
> If the query does any kind of transformation of the vertices to produce
> the query result, that need
>
> to be done in the result expression. To understand why that is, let's take
> a look at another example:
>
> *Original Query*
>
> g.V().or(has('name','Fred'),has('age','17')).as('person').
> select('person').by('gender')
>
> *Incorrect Optimized Query*
>
> def r = [] as Set;
>
> g.V().has('name','Fred').as('person').select('person').by('
> gender').fill(r)
>
> g.V().has('age','17').as('person').select('person').by('gender').fill(r)
>
> r;
>
> The problem with this query is that now 'r' contains Strings (the gender
> of the person). Suppose
>
> that there is one person named Fred and there are 3 people whose age is 17
> (let's say Fred's age is 16).
>
> The original query would have produced 4 rows, one corresponding to each
> of those people. The new
>
> query would produce at most 2 rows - one for 'male' and one for 'female'.
> This is happening because
>
> we are now performing the union on the Strings, not on the vertices. To
> fix this, we need to split
>
> the original query and put the end portion into the result expression:
>
> *Correct Optimized Query*
>
> def r = [] as Set;
>
> g.V().has('name','Fred').fill(r)
>
> g.V().has('age','17').fill(r)
>
> __.inject(r as Object[]).as('person').select('person').by('gender')
>
> There is one more problematic case that this optimizer is able to handle.
> Let's look at the following example:
>
> *Original Query*
>
> g.V().or(has('type','Person'),has('superType','Person')).as(
> 'x').has('qualifiedName','Fred').as('y').select('x','y')
> .by('name').by('name')
>
> Queries of this form appear often when translating DSL queries.
>
> If we were to optimize this query using the logic described above, we
> would get something like this:
>
> *Incorrect Optimized Query*
>
> def r = [] as Set;
>
> g.V().has('type','Person').fill(r);
>
> g.V().has('superType','Person').fill(r);
>
> __.inject(r as Object[]).as('x').has('qualifiedName','Fred').as('y')
> .select('x','y');
>
> While not strictly incorrect, this query will not perform well since the
> index on qualifiedName will
>
> not be used. In order for that index to be used, the 'has' expression
> needs to be part of the original
>
> query. However, if we do that alone, the query will be broken, since the
> select
>
> will now refer to an undefined label:
>
> *Incorrect Optimized Query*
>
> def r = [] as Set;
>
> g.V().has('type','Person').as('x').has('qualifiedName','Fred').fill(r);
>
> g.V().has('superType','Person').as('x').has('qualifiedName',
> 'Fred').fill(r);
>
> __.inject(r as Object[]).as('y').select('x','y')
>
> To fix this, we need to save the values of the aliased vertices in the
> original
>
> query, and create labels in the result expression that refer to them. We
> do this
>
> as follows:
>
> *Correct Optimized Query*
>
> def r = [] as Set;
>
> g.V().has('type','Person').as('x').has('qualifiedName','
> Fred').as('y').select('x','y').fill(r);
>
> g.V().has('superType','Person').as('x').has('qualifiedName',
> 'Fred').select('x','y').fill(r);
>
> __.inject(r as Object[]).as('__tmp').map({((Map)it.get()).get('x')}).as('
> x').select('__tmp').map({((Map)it.get()).get('x')}).as('
> y').select('x','y').by('name').by('name')
>
> This is not pretty, but it appears to work. What ends up happening is that
> r gets populated with alias->Vertex maps. In the result expression, we make
> 'x' point
>
> to a step where the value in the traverser is the vertex for 'x', and we
> do the same thing for y. The <code>select('_tmp')</code> step in the middle
> restores the value of
>
> the traverser back to the map.
>
> The one known issue with the alias rearrangement is that it breaks loop
> expressions. As a result, expressions containing loops are currently
> excluded
>
>
>
> from this optimization.
>
>
>
>
>
>
>
> Note that at this point I haven’t really investigated making this work
> with Titan 0.5.4, although I don’t know if any reason why the logic would
> not work there
>
>
>
> *From:* Suma Shivaprasad [mailto:sumasai.shivaprasad@gmail.com]
> *Sent:* Friday, December 02, 2016 3:12 AM
> *To:* Shwetha Shivalingamurthy <ss...@hortonworks.com>
> *Cc:* Jeffrey N Hagelberg <jn...@us.ibm.com>; Dave Kantor <
> dkantor@us.ibm.com>; Fnu Neerju <gu...@us.ibm.com>; Madhan Neethiraj
> <mn...@hortonworks.com>; Hemanth Yamijala <hy...@hortonworks.com>
> *Subject:* Re: Atlas DSL Performance Idea
>
>
>
> HI Jeff,
>
>
>
> Was interested in knowing whats the approach you are planning to resolve
> this. I am concerned with the collection variable(_var_0) that we use along
> with the gremlin fill function( Pipe.fill) that fills that collection
> variable. I suspect this could lead to a lot of vertices being loaded into
> this collection based on my understanding of http://gremlindocs.
> spmallette.documentup.com/#pipefill which are then used for subsequent
> processing, leading to high memory usage. This is more concerning than the
> execution time when there are a lot of vertices for a given type or
> supertype.
>
>
>
> There are composite indexes on type+name and supertype+name combinations .
> So I think it should use the index if we rewrite the query without the
> collection variable. Please let us know what your findings on this are and
> whats your current progress on this, since we are facing a similiar issue
> as well during our scale tests and would like to get this in as soon as
> possible.
>
>
>
> Thanks
>
> Suma
>
>
>
> On Wed, Nov 23, 2016 at 12:32 AM, Shwetha Shivalingamurthy <
> sshivalingamurthy@hortonworks.com> wrote:
>
> +Hemanth, Madhan
>
>
>
> Hi Jeff,
>
>
>
> As a general practice, we should provide backward compatibility always, as
> we don’t know the users using Apache projects in production. If the project
> is incubator or TLP doesn’t matter. If breaking compatibility is required
> really, we can send an email in the user/dev mailing list and check if any
> users object to the change. If no one objects, its probably ok to make the
> change. But I think we should avoid these changes as much as possible.
>
>
>
> As for this particular change, its better to have a flag to disable this
> by default, to make it backward compatible. After the data migration tool
> is available, we can enable this feature by default and provide steps for
> data migration.
>
>
>
> On another note on the gremlin query optimisation, I think changing
> predicate order should help as well. For example, DSL query ‘hive_table
> where name = “xx"' is translated to (type = ‘hive_table’ OR super type ~
> ‘hive_table’) AND (hive_table.name = ‘xx’). The number of vertices with
> the hive_table in super type will still be high and there is no combined
> index on (type + name). After the first predicate, its still full scan for
> name = ‘xx’ on the results of supertype = ‘hive_table’. Instead, if we
> re-order the predicates to (hive_table.name = ‘xx’) AND (type =
> ‘hive_table’ OR super type ~ ‘hive_table’), we might get better performance
> as number of vertices with hive_table.name = ‘xx’ will be less. I haven’t
> tried this, but you can explore this option as well.
>
>
>
> Regards,
>
> Shwetha
>
>
>
>
>
> *From: *Jeffrey N Hagelberg <jn...@us.ibm.com>
> *Date: *Monday, 21 November 2016 at 8:13 PM
> *To: *Shwetha Shivalingamurthy <ss...@hortonworks.com>
> *Cc: *'Suma Shivaprasad' <su...@gmail.com>, Dave Kantor <
> dkantor@us.ibm.com>, Fnu Neerju <gu...@us.ibm.com>
> *Subject: *Atlas DSL Performance Idea
>
>
>
> Hi Shweta,
>
>
>
> We’ve been taking a deep dive into the performance of the gremlin that is
> generated for executing the DSL queries.  We noticed that with IBM Graph,
> and Titan in general, “and” and “or” gremlin predicates do not use
> indices.  We saw that the gremlin queries in environments with a very large
> number of instances of the type being queries for were very slow, largely
> due to the use of the _*var*_0 intermediate variable that gets initially
> populated with all of the instances with the correct type/supertype.  We
> are looking at an optimization that adds the type name to the list of super
> types.  This enables us to just check the supertype property in the
> generated gremlin.  With this change, we’re seeing a very significant
> performance improvement, more than an order of magnitude, in the query
> execution time when there are a large number of existing entities of the
> type being queried.  However, our concern with this approach is that it
> breaks compatibility with existing Atlas installations.  Existing Atlas
> users would need to migrate their existing data to add the type name to the
> supertypes property in all entities.
>
>
>
> Our first question is whether or not this is something we need to be
> concerned about.  I’m not entirely clear on what it means to be an
> “incubator” product and if that has implications for whether migration
> needs to be supported.  We’ve also discussed having a feature toggle which
> turns the new behavior off by default as an alternative to supporting data
> migration.  That would allow existing Atlas instances to continue to work.
> Once we have some migration utility (again, assuming we need one), they
> could then migrate their instance at their leisure and then turn on the
> optimization.  New Atlas users would then need to explicitly turn it on to
> get the performance benefit.
>
>
>
> I’m also curious if this is something if you have looked at before and
> have any thoughts about.  Ideally we would like to get this same kind of
> performance improvement by changing the generated gremlin in a way that
> does not break compatibility with existing installs.  I’m not sure how that
> would be possible without fixing titan to support indexed or queries though.
>
>
>
> Thanks,
>
>
>
> -Jeff
>
>
>
> Jeffrey Hagelberg
>
> Software Engineer, IBM Analytics
>
> (978) 889-2055
>
>
>
>
>
>
>
>
>
>
>
>