You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Élie Roux <el...@telecom-bretagne.eu> on 2021/10/06 09:31:54 UTC

puzzling performance issue

Dear all,

I'm experiencing a performance issue that I can't understand... I'm using:
- Jena 3.14.0 , Fuseki (I'm testing in the web interface)
- TDB1
- none.opt
- this configuration:
https://github.com/buda-base/buda-base/blob/master/conf/fuseki/ttl.erb
(with some variable substitutions)
- the relevant data is all contained in this file:
http://purl.bdrc.io/graph/MW23703.ttl

a minimal version of the query is the following, with the relevant comments:

prefix : <http://purl.bdrc.io/ontology/core/>
prefix bdr: <http://purl.bdrc.io/resource/>

CONSTRUCT
{
   bdr:MW23703_1183 ?instp ?insto .
   ?t ?tp ?to .
   ?ancestor :hasPart ?ancestorPart .
   # this next line is weird: comment it and you gain 200ms in the
long query, but the variables are not bound anyways
   ?ancestor ?ancestorp ?ancestoro .
}
WHERE
{
   # in this form (no line commented), the query takes 1.5 to 2s
   # comment any of the lines and the performance improvement is very dramatic
   bdr:MW23703_1183 ?instp ?insto . # 200ms alone
   bdr:MW23703_1183 :hasTitle ?t . ?t ?tp ?to . #245ms alone
   bdr:MW23703_1183 :partOf+ ?ancestor . ?ancestor :hasPart
?ancestorPart . # 200ms alone
}


so it seems that the issue is not just one bgp that needs optimization
but some sort of interaction between the various components? Is there
something I'm doing wrong?

Thanks in advance,
-- 
Elie

Re: puzzling performance issue

Posted by Andy Seaborne <an...@apache.org>.

When there are different parts of pattern going to make up different 
parts of the CONSTRUCT template, splitting it up into UNION makes sense. 
It is using the fact that in a CONSTRUCT template, if variables are 
unbound, the triple pattern isn't substantiated but the rest of the 
triples from the template are still generated.

Following on from Richard, usin UNION turns a expontial growth into 
linear growth.

A better solution is to support the idiom better.

That is have: a single request that is

CONSTRUCT { template1 } WHERE { pattern1 } ;
CONSTRUCT { template2 } WHERE { pattern2 } ;
CONSTRUCT { template3 } WHERE { pattern3 }
...

(c.f SPARQL Update).

to form one single result graph.

     Andy

On 07/10/2021 11:57, Élie Roux wrote:
> Thanks a lot for your very informative answer Richard, it's really
> helpful to know when writing queries!
> 
> It seems this is a case where some optimizations might be implemented?
> (I'm afraid this isn't something I could contribute though, sorry)
> 
> Best,
>

Re: puzzling performance issue

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Because it isn't a valid semantics preserving optimization. ARQ only applies optimizations that preserve the semantics of the query, the fact that this is ultimately a CONSTRUCT query doesn't change the semantics of the query evaluation itself, merely the final RDF produced. Making the rewrite you suggest changes the semantics of the query significantly so ARQ is never going to automatically apply that for you.

ARQ allows you to write and use custom optimizers/query rewriters/query engines if you so wish. If you know your data and the types of queries that should/shouldn't be being written then you can provide your own custom components to handle this "optimisation" for your use case.

Thanks,

Rob

On 19/10/2021, 10:09, "Élie Roux" <el...@telecom-bretagne.eu> wrote:

> As others pointed out semantically the evaluation of the two query forms yields very different intermediate results. It's only the presence of the post-processing CONSTRUCT stage that happens to strip out the duplicates/unusable results. Any optimizations MUST preserve the overall semantics of the query to ensure they yield the same results, so no you couldn't apply an optimization in this case. In this specific case CONSTRUCT happens as a post-processing step as part of producing the final query results, it's not actually within the portion of query evaluation that is subject to ARQs optimizer.

Thanks! I still don't understand though... is there a reason why the
ARQs optimizer can't have a boolean flag telling it it can optimize
things for a CONSTRUCT?

Best,
--
Elie

Re: puzzling performance issue

Posted by Élie Roux <el...@telecom-bretagne.eu>.

> As others pointed out semantically the evaluation of the two query forms yields very different intermediate results.  It's only the presence of the post-processing CONSTRUCT stage that happens to strip out the duplicates/unusable results.  Any optimizations MUST preserve the overall semantics of the query to ensure they yield the same results, so no you couldn't apply an optimization in this case.  In this specific case CONSTRUCT happens as a post-processing step as part of producing the final query results, it's not actually within the portion of query evaluation that is subject to ARQs optimizer.

Thanks! I still don't understand though... is there a reason why the
ARQs optimizer can't have a boolean flag telling it it can optimize
things for a CONSTRUCT?

Best,
-- 
Elie

Re: puzzling performance issue

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Not really

This pattern of unconnected BGPs has legitimate use cases. A common one is doing similarity calculations where you use unconnected BGPs to create every possible combination of results and then use BIND and/or FILTER to compute some metric and use that to filter/rank the combinations.

As others pointed out semantically the evaluation of the two query forms yields very different intermediate results. It's only the presence of the post-processing CONSTRUCT stage that happens to strip out the duplicates/unusable results. Any optimizations MUST preserve the overall semantics of the query to ensure they yield the same results, so no you couldn't apply an optimization in this case. In this specific case CONSTRUCT happens as a post-processing step as part of producing the final query results, it's not actually within the portion of query evaluation that is subject to ARQs optimizer.

You could write a query analyzer that would highlight these kinds of potential issues but ultimately it comes down to the query author understanding what they are trying to achieve and making the appropriate choice for their use cases. As I said at the start this pattern has legitimate use cases even if it has performance implications.

Rob

On 07/10/2021, 12:57, "Élie Roux" <el...@telecom-bretagne.eu> wrote:

> Overall, it whether the WHERE answer is 16*26*2636 rows (all one BGP) or
> 16+26+2636 rows (union).

Yes, I understand better now, thanks!

Do you think there might be some optimization at some point for that
case? I suspect this is very common in SPARQL queries out there...

Best,
--
Elie

Re: puzzling performance issue

Posted by Élie Roux <el...@telecom-bretagne.eu>.

> Overall, it whether the WHERE answer is 16*26*2636 rows (all one BGP) or
> 16+26+2636 rows (union).

Yes, I understand better now, thanks!

Do you think there might be some optimization at some point for that
case? I suspect this is very common in SPARQL queries out there...

Best,
-- 
Elie

Re: puzzling performance issue

Posted by Andy Seaborne <an...@apache.org>.


On 07/10/2021 12:30, Élie Roux wrote:
>> if you take this expression
>>
>> WHERE
>> {
>>   {
>>      bdr:MW23703_1183 ?instp ?insto . # 200ms alone
>>   } union {
>>      bdr:MW23703_1183 :hasTitle ?t . ?t ?tp ?to . #245ms alone
>> } union {
>>    bdr:MW23703_1183 :partOf+ ?ancestor . ?ancestor :hasPart
>> ?ancestorPart . # 200ms alone
>>   }
>> }
>>
>> separate it into three distinct queries and aggregate the results with respective (count(*) as ?count) select bindings, what numbers result?
> 
> first bgp: 16
> second: 26
> third: 2636 (same with count unique)

The patterns are not connected.

Ovaerll, it whether the WHERE answer is 16*26*2636 rows (all one BGP) or 
16+26+2636 rows (union).

     Andy

> 
> Best,
>

Re: puzzling performance issue

Posted by Élie Roux <el...@telecom-bretagne.eu>.

> if you take this expression
>
> WHERE
> {
>  {
>     bdr:MW23703_1183 ?instp ?insto . # 200ms alone
>  } union {
>     bdr:MW23703_1183 :hasTitle ?t . ?t ?tp ?to . #245ms alone
> } union {
>   bdr:MW23703_1183 :partOf+ ?ancestor . ?ancestor :hasPart
> ?ancestorPart . # 200ms alone
>  }
> }
>
> separate it into three distinct queries and aggregate the results with respective (count(*) as ?count) select bindings, what numbers result?

first bgp: 16
second: 26
third: 2636 (same with count unique)

Best,
-- 
Elie

Re: puzzling performance issue

Posted by Élie Roux <el...@telecom-bretagne.eu>.

Thanks a lot for your very informative answer Richard, it's really
helpful to know when writing queries!

It seems this is a case where some optimizations might be implemented?
(I'm afraid this isn't something I could contribute though, sorry)

Best,
-- 
Elie

Re: puzzling performance issue

Posted by Richard Cyganiak <ri...@cyganiak.de>.

Queries of the form

    CONSTRUCT {...} WHERE {...}

are evaluated with a three-stage pipeline. First, the query

    SELECT * WHERE {...}

is executed. Second, the CONSTRUCT template is applied to each result row (producing no triple for any triple pattern that has a variable without value in that row). Third, any duplicate triples produced in step 2 are removed.

If you run both versions of your query in the SELECT * WHERE {...} form, you will see that the version without UNION produces a much larger intermediate result. It is only due to the duplicate removal in step 3 that you get the same final result from both versions. This larger intermediate result explains why the version without UNION is much slower.

Hope that helps,
Richard

> On 6 Oct 2021, at 21:59, Élie Roux <el...@telecom-bretagne.eu> wrote:
> 
> After long hours of anxiety, I discovered that using unions as in
> 
> CONSTRUCT
> {
>   bdr:MW23703_1183 ?instp ?insto .
>   ?t ?tp ?to .
>   ?ancestor :hasPart ?ancestorPart .
> }
> WHERE
> {
>  {
>     bdr:MW23703_1183 ?instp ?insto . # 200ms alone
>  } union {
>     bdr:MW23703_1183 :hasTitle ?t . ?t ?tp ?to . #245ms alone
> } union {
>   bdr:MW23703_1183 :partOf+ ?ancestor . ?ancestor :hasPart
> ?ancestorPart . # 200ms alone
>  }
> }
> is much faster than using the bgp in the same group... is there any
> reason for that?
> 
> Best,
> -- 
> Elie

Re: puzzling performance issue

Posted by Élie Roux <el...@telecom-bretagne.eu>.

After long hours of anxiety, I discovered that using unions as in

CONSTRUCT
{
   bdr:MW23703_1183 ?instp ?insto .
   ?t ?tp ?to .
   ?ancestor :hasPart ?ancestorPart .
}
WHERE
{
  {
     bdr:MW23703_1183 ?instp ?insto . # 200ms alone
  } union {
     bdr:MW23703_1183 :hasTitle ?t . ?t ?tp ?to . #245ms alone
 } union {
   bdr:MW23703_1183 :partOf+ ?ancestor . ?ancestor :hasPart
?ancestorPart . # 200ms alone
  }
}
is much faster than using the bgp in the same group... is there any
reason for that?

Best,
-- 
Elie