You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Osma Suominen <os...@helsinki.fi> on 2016/11/01 09:03:05 UTC

Placement of VALUES block affects performance in 3.1.1

Hi,

I'm investigating a performance regression we're seeing with the current 
Jena 3.1.1-SNAPSHOT compared to 3.1.0.

The data in graph <http://www.yso.fi/onto/yso/> is the YSO ontology, 
available from http://api.finto.fi/download/yso/yso-skos.ttl

This query used to take about 0.2 seconds (with 3.1.0) and now takes 
about 10 seconds:

--cut--
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT *
FROM NAMED <http://www.yso.fi/onto/yso/>
WHERE {
   ?uri ?p ?o .
   OPTIONAL {
     ?x skos:member ?o .
     FILTER NOT EXISTS {
       ?x skos:member ?other .
       FILTER NOT EXISTS {
         ?other skos:broader ?uri
       }
     }
   }
   VALUES ?uri { <http://www.yso.fi/onto/yso/p864> }
}
--cut--

If I move the VALUES block to the top of the query, right after WHERE, 
then the query becomes fast again.

Is the placement of the VALUES block supposed to affect query evaluation 
order in this way? It appears to me that in the slow version, ?uri is 
not bound inside the inner FILTER NOT EXISTS, which causes an explosion 
of results internally.

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Placement of VALUES block affects performance in 3.1.1

Posted by Osma Suominen <os...@helsinki.fi>.

02.11.2016, 13:08, Osma Suominen wrote:

> It's possible that a MINUS expression could be used instead of FILTER
> NOT EXISTS and perform better. I will have to test this. But other than
> switching to MINUS, I can't think of any way to express this constraint
> on collections without using some kind of double negation.

I played with MINUS, but it was no better than FILTER NOT EXISTS.

>> Osma - could you please try putting the VALUES in each arm of the UNION
>> which gets you to something like the first example.
>
> I will try this as well.

This helped - a lot. The query time is practically the same now as it 
was with Jena 3.1.0. Thanks!

The original query, though, has four UNION branches. It seems a bit 
stupid to put the same VALUES statement (which may have 20 URIs or more) 
in each of those arms, but I will do that if it's the only way to get 
good performance with 3.1.1.

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Placement of VALUES block affects performance in 3.1.1

Posted by Osma Suominen <os...@helsinki.fi>.

Hi Andy!

Thanks a lot for looking into this and your very clear explanation.

01.11.2016, 21:12, Andy Seaborne wrote:
> It has always been inside-out then optimized to use stream based index
> joins.
>
> However in this case "inside out" is confusing because the query has a
> double negation of FILTER NOT EXIST.

Right. Let me explain why the query does this.

The query (or actually a longer variant thereof) is used in the Skosmos 
application to generate a list of search results. At the time this query 
is used the application already has a (possibly long) list of the SKOS 
concept URIs it wants to display. This query is used to look up 
additional information about those concepts.

Instead of performing a separate query for each concept URI, a single 
query with a VALUES block containing up to 20 concept URIs is used. So 
in this case, the VALUES is used thinking of it as a sort of for-each 
loop. A single query for 20 concepts is faster than performing 20 
queries for individual concepts - at least it used to be.

The reason for the double negation is this:

For each concept X, we want to display the SKOS Collections in which 
concept X is a member, but only if the collection consists only of 
siblings of X (i.e. sharing at least one broader concept with X).
In practice, this has to be turned around into a double negation: for 
each collection, check that it doesn't include a concept that doesn't 
share any parent with X.

It's possible that a MINUS expression could be used instead of FILTER 
NOT EXISTS and perform better. I will have to test this. But other than 
switching to MINUS, I can't think of any way to express this constraint 
on collections without using some kind of double negation.

> At 3.1.1 (JENA-1171), EXISTS are analysed whereas previous they were
> skipped which could lead to wrong answers.
>
> Osma - could you please try putting the VALUES in each arm of the UNION
> which gets you to something like the first example.

I will try this as well.

> It can be pushed in because:
>
> join(A, union(B,C)) == union(join(A,B), join(A,C))
>
> now if A is an complex expression, that is a bad idea (probably).
>
> If A is a small VALUES block then it makes sense.  It isn't done though.

Ok. So a potential future optimization perhaps.

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Placement of VALUES block affects performance in 3.1.1

Posted by Andy Seaborne <an...@apache.org>.


On 01/11/16 22:40, james anderson wrote:
> good evening;
>
>> On 2016-11-01, at 20:12, Andy Seaborne <an...@apache.org> wrote:
>>
>>
>>
>> On 01/11/16 12:38, Osma Suominen wrote:
>>> Hi,
>>> [about a query which exercises the interaction between a values clause and statement pattern at locations relatively remote in the query. ]
>>
>> [ the question leads to a point about the scope of the respective ?uri variables, ]
>>
>> The issue is [*], using the variable ?uri again inside an OPTIONAL.
>>
>> It is possible that ?uri will range over more than the VALUES setting and affect the OPTIONAL yet the inner EXISTS usage does not set ?uri and it is not propagated to be joined with the set value.
>>
>> [\u2026]
>>
>> [ \u2026 and to one about algebraic equivalence among expressions and the consequent latitude to relocate the values clause.]
>>
>> It can be pushed in because:
>>
>> join(A, union(B,C)) == union(join(A,B), join(A,C))
>>
>> now if A is an complex expression, that is a bad idea (probably).
>>
>> If A is a small VALUES block then it makes sense.  It isn't done though.
>>
>
> both of which are significant, but remain side matters to the principal issue, which concerns a reasonable expectation - despite that it may be subject to revision, that the values clause introduces bindings with indefinite scope.[1]

The performance issue is that

{ P1 OPTIONAL{P2} VALUES } != { VALUES P1 OPTIONAL{P2} }
but is instead:
{ P1 OPTIONAL{P2} VALUES } == { VALUES...{} { P1 OPTIONAL{P2} } }

but in this data it gets the same answers (presumably) and rather faster.

This is not really about EXISTS at all.  Other {P} will have the same 
effect.

The programming notion of scope is only an analogy for query languages. 
In a query language, same scope means same reference - ie. equivalence. 
"Same scope" in a query language means equality-inner-join at some later 
point to test they are the same value. Equating variables from 
sub-pattern evaluation is like SQL "natural join".

The optimizer is finding situations when equality can be executed as 
equivalence (c.f. index join).

In ARQ execution, variables are transformed by renaming apart early in 
optimization making static scoping a non-issue for the rule based 
optimizations.

	Andy

>
> as sparql plays in increasingly significant role in service implementations, library-level support for query parameters[2] needs to give way to protocol-level mechanisms,[3] whereby the language semantics needs more attention than it has yet received.
> both immediately federated request combinations and request chains which are mediated by a third service need to  propagate intermediate solution sets across requests.
> the use case behind this question is one example.
>
> the implementations for elementary variables[3] do not provide sufficient capacity - not to mention the deficiencies of their substitution models.
> the sparql federated query recommendation suggests a role which the values clause might play,[4] but does not provide an exact interpretation.
>
> has jena\u2019s community given this topic any thought?
>
> best regards, from berlin,
> - - -
> [1] : http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node43.html#SECTION00700000000000000000
> [2] : https://jena.apache.org/documentation/query/parameterized-sparql-strings.html
> [3] : http://docs.rdf4j.org/rest-api/#_repository_queries
> [4] : http://www.w3.org/TR/sparql11-federated-query/#values
>
>
> ---
> james anderson | james@dydra.com | http://dydra.com
>
>
>
>
>
>

Re: Placement of VALUES block affects performance in 3.1.1

Posted by james anderson <ja...@dydra.com>.

good evening;

> On 2016-11-01, at 20:12, Andy Seaborne <an...@apache.org> wrote:
> 
> 
> 
> On 01/11/16 12:38, Osma Suominen wrote:
>> Hi,
>> [about a query which exercises the interaction between a values clause and statement pattern at locations relatively remote in the query. ]
> 
> [ the question leads to a point about the scope of the respective ?uri variables, ]
> 
> The issue is [*], using the variable ?uri again inside an OPTIONAL.
> 
> It is possible that ?uri will range over more than the VALUES setting and affect the OPTIONAL yet the inner EXISTS usage does not set ?uri and it is not propagated to be joined with the set value.
> 
> […]
> 
> [ … and to one about algebraic equivalence among expressions and the consequent latitude to relocate the values clause.]
> 
> It can be pushed in because:
> 
> join(A, union(B,C)) == union(join(A,B), join(A,C))
> 
> now if A is an complex expression, that is a bad idea (probably).
> 
> If A is a small VALUES block then it makes sense.  It isn't done though.
> 

both of which are significant, but remain side matters to the principal issue, which concerns a reasonable expectation - despite that it may be subject to revision, that the values clause introduces bindings with indefinite scope.[1]

as sparql plays in increasingly significant role in service implementations, library-level support for query parameters[2] needs to give way to protocol-level mechanisms,[3] whereby the language semantics needs more attention than it has yet received.
both immediately federated request combinations and request chains which are mediated by a third service need to  propagate intermediate solution sets across requests. 
the use case behind this question is one example.

the implementations for elementary variables[3] do not provide sufficient capacity - not to mention the deficiencies of their substitution models.
the sparql federated query recommendation suggests a role which the values clause might play,[4] but does not provide an exact interpretation.

has jena’s community given this topic any thought?

best regards, from berlin,
- - -
[1] : http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node43.html#SECTION00700000000000000000
[2] : https://jena.apache.org/documentation/query/parameterized-sparql-strings.html
[3] : http://docs.rdf4j.org/rest-api/#_repository_queries
[4] : http://www.w3.org/TR/sparql11-federated-query/#values

---
james anderson | james@dydra.com | http://dydra.com

Re: Placement of VALUES block affects performance in 3.1.1

Posted by Andy Seaborne <an...@apache.org>.


On 01/11/16 12:38, Osma Suominen wrote:
> Hi,
>
> Some further observations. The query I sent earlier was a minimal
> example, and it was possible to fix it by just moving the VALUES block.
> But a slightly more realistic (closer to the original query I'm having
> problems with) example involves a UNION and cannot be fixed so easily -
> placing the VALUES block first doesn't help:
>
> --cut--
> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
> SELECT *
> WHERE {
>   VALUES ?uri { <http://www.yso.fi/onto/yso/p864> }
>   { ?s ?p ?uri }
>   UNION
>   { ?uri ?p ?o
>   OPTIONAL {
>     ?x skos:member ?o .
>     FILTER NOT EXISTS {
>       ?x skos:member ?other .
>       FILTER NOT EXISTS {
>         ?other skos:broader ?uri
                              ^^^^^^ [*]
>       }
>     }
>   } }
> }
> --cut--
>
> Jena 3.1.0 tdbquery: 0.9 seconds
> Jena 3.1.1-SNAPSHOT tdbquery: 12.8 seconds
>
> I'm aware that in SPARQL, evaluation proceeds from the inside out and
> Jena ARQ has moved more and more in this direction with recent releases,
> which may also explain this change.

It has always been inside-out then optimized to use stream based index 
joins.

However in this case "inside out" is confusing because the query has a 
double negation of FILTER NOT EXIST.

At 3.1.1 (JENA-1171), EXISTS are analysed whereas previous they were 
skipped which could lead to wrong answers.

Osma - could you please try putting the VALUES in each arm of the UNION 
which gets you to something like the first example.


The issue is [*], using the variable ?uri again inside an OPTIONAL.

It is possible that ?uri will range over more than the VALUES setting 
and affect the OPTIONAL yet the inner EXISTS usage does not set ?uri and 
it is not propagated to be joined with the set value.

As wikipedia says for correlated subquery in SQL:
"Because the subquery is evaluated once for each row processed by the 
outer query, it can be inefficient."

> But how should VALUES blocks be
> placed for optimal query execution? It seems like a waste not to
> propagate those fixed bindings into inner parts of the query, even
> though that may violate the inside-out order.

It can be pushed in because:

join(A, union(B,C)) == union(join(A,B), join(A,C))

now if A is an complex expression, that is a bad idea (probably).

If A is a small VALUES block then it makes sense.  It isn't done though.

 > In the above query, I
> don't know where to place the VALUES so that the binding for ?uri (in
> effect, changing the variable to a constant) would be applied in all
> parts of the query.

See above.

>
> Placing the VALUES block at the bottom of the query (outside the WHERE
> block) doesn't help either. In fact execution time increases to 17
> seconds with 3.1.1-SNAPSHOT (but is unchanged with 3.1.0).
>
> I tried --engine=ref and it was extremely slow also with 3.1.0, so in
> that sense, nothing has changed, only an optimization has been dropped
> somewhere.
>
> Should I report this as an issue? Or am I just doing something wrong?
>
> -Osma
>
>
>
> On 01/11/16 11:03, Osma Suominen wrote:
>> Hi,
>>
>> I'm investigating a performance regression we're seeing with the current
>> Jena 3.1.1-SNAPSHOT compared to 3.1.0.
>>
>> The data in graph <http://www.yso.fi/onto/yso/> is the YSO ontology,
>> available from http://api.finto.fi/download/yso/yso-skos.ttl
>>
>> This query used to take about 0.2 seconds (with 3.1.0) and now takes
>> about 10 seconds:
>>
>> --cut--
>> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
>> SELECT *
>> FROM NAMED <http://www.yso.fi/onto/yso/>
>> WHERE {
>>    ?uri ?p ?o .
>>    OPTIONAL {
>>      ?x skos:member ?o .
>>      FILTER NOT EXISTS {
>>        ?x skos:member ?other .
>>        FILTER NOT EXISTS {
>>          ?other skos:broader ?uri
>>        }
>>      }
>>    }
>>    VALUES ?uri { <http://www.yso.fi/onto/yso/p864> }
>> }
>> --cut--
>>
>> If I move the VALUES block to the top of the query, right after WHERE,
>> then the query becomes fast again.
>>
>> Is the placement of the VALUES block supposed to affect query evaluation
>> order in this way? It appears to me that in the slow version, ?uri is
>> not bound inside the inner FILTER NOT EXISTS, which causes an explosion
>> of results internally.
>>
>> -Osma
>>
>
>

Re: Placement of VALUES block affects performance in 3.1.1

Posted by Osma Suominen <os...@helsinki.fi>.

Hi,

Some further observations. The query I sent earlier was a minimal 
example, and it was possible to fix it by just moving the VALUES block. 
But a slightly more realistic (closer to the original query I'm having 
problems with) example involves a UNION and cannot be fixed so easily - 
placing the VALUES block first doesn't help:

--cut--
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT *
WHERE {
   VALUES ?uri { <http://www.yso.fi/onto/yso/p864> }
   { ?s ?p ?uri }
   UNION
   { ?uri ?p ?o
   OPTIONAL {
     ?x skos:member ?o .
     FILTER NOT EXISTS {
       ?x skos:member ?other .
       FILTER NOT EXISTS {
         ?other skos:broader ?uri
       }
     }
   } }
}
--cut--

Jena 3.1.0 tdbquery: 0.9 seconds
Jena 3.1.1-SNAPSHOT tdbquery: 12.8 seconds

I'm aware that in SPARQL, evaluation proceeds from the inside out and 
Jena ARQ has moved more and more in this direction with recent releases, 
which may also explain this change. But how should VALUES blocks be 
placed for optimal query execution? It seems like a waste not to 
propagate those fixed bindings into inner parts of the query, even 
though that may violate the inside-out order. In the above query, I 
don't know where to place the VALUES so that the binding for ?uri (in 
effect, changing the variable to a constant) would be applied in all 
parts of the query.

Placing the VALUES block at the bottom of the query (outside the WHERE 
block) doesn't help either. In fact execution time increases to 17 
seconds with 3.1.1-SNAPSHOT (but is unchanged with 3.1.0).

I tried --engine=ref and it was extremely slow also with 3.1.0, so in 
that sense, nothing has changed, only an optimization has been dropped 
somewhere.

Should I report this as an issue? Or am I just doing something wrong?

-Osma

On 01/11/16 11:03, Osma Suominen wrote:
> Hi,
>
> I'm investigating a performance regression we're seeing with the current
> Jena 3.1.1-SNAPSHOT compared to 3.1.0.
>
> The data in graph <http://www.yso.fi/onto/yso/> is the YSO ontology,
> available from http://api.finto.fi/download/yso/yso-skos.ttl
>
> This query used to take about 0.2 seconds (with 3.1.0) and now takes
> about 10 seconds:
>
> --cut--
> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
> SELECT *
> FROM NAMED <http://www.yso.fi/onto/yso/>
> WHERE {
>    ?uri ?p ?o .
>    OPTIONAL {
>      ?x skos:member ?o .
>      FILTER NOT EXISTS {
>        ?x skos:member ?other .
>        FILTER NOT EXISTS {
>          ?other skos:broader ?uri
>        }
>      }
>    }
>    VALUES ?uri { <http://www.yso.fi/onto/yso/p864> }
> }
> --cut--
>
> If I move the VALUES block to the top of the query, right after WHERE,
> then the query becomes fast again.
>
> Is the placement of the VALUES block supposed to affect query evaluation
> order in this way? It appears to me that in the slow version, ?uri is
> not bound inside the inner FILTER NOT EXISTS, which causes an explosion
> of results internally.
>
> -Osma
>

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi