You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Adam K <ko...@gmail.com> on 2021/02/22 13:18:51 UTC

Re: Optimization of query plan for pipe operator

Hi all, I executed two simple equivalent queries having a big performance
difference on a large dataset:


   1. First matching by two alternative predicates using pipe operator
* SELECT (count(*) as ?total) WHERE { *
* { ?s <http://someURI1 <http://someURI1>>  | <http://someURI1
   <http://someURI1>> ?o .}*
* }*
   this one is very slow and query plan shows the following matching
   pattern:
   (path ?subject (alt  <http://someURI1>  <http://someURI2> ) ?object)))))
   2. If I use UNION operator instead of pipe the query becomes fast
* SELECT (count(*) as ?total) WHERE {*
*   { ?s <http://someURI1 <http://someURI1>> ?o . }**  UNION**  { ?s
   <http://someURI2 <http://someURI2>> ?o . }*
* }*
   query plan here is different and shows UNION of two BGP matches:
   (union (bgp (triple ?s <http://someURI1> ?o )) (bgp (triple ?s <
   http://someURI2> ?o ))))))


Documentation here
https://jena.apache.org/documentation/query/property_paths.html tells that:

   1. "Paths are “simple” if they involve only operators / (sequence), ^
   (reverse, unary or binary) and the form {n}, for some single integer n."
   2. "A path is “complex”  if it involves one or more of the operators
   *,?, + and {}."

These statements do do define implications of | - it should act like union,
but query plan is different - is it a bug or a feature? Is there general
recommendation to use UNION instead of pipe?

Thanks for help!

Re: Optimization of query plan for pipe operator

Posted by Andy Seaborne <an...@apache.org>.

Adam,

While the queries are functionally the same they are executed in 
different ways.

TDB uses NodeIds (64 bit ids) to identify RDF Terms (class Node). The 
mapping between the two is the node table.  Nodes are retrived 
"on-demand" if results or another part of the query need them.

Count is handled specially in TDB. When counting, the actually node form 
(URI, literal etc) isn't rebuilt when all that is needed is to count. 
Just the internal NodeID is enough.

{ ?s <http://someURI1> ?o . }
UNION
{ ?s <http://someURI2> ?o . }

is two TDB-level patterns - the matching is done and results involve 
NodeIds but not Nodes.

In

{ ?s <http://someURI1> | <http://someURI2> ?o .}

(you wrote "<http://someURI1> | <http://someURI1>" -- same URI)

ARQ passes the expression to the path evaluator which works on nodes.
There is some rewrite of paths (e.g. "/") but by default that does not 
include "|".

This only makes a pronounced difference because of the count otherwise I 
don't think you'd see a difference.  The time is going on retrieving 
nodes and it's an HDD.

And if you run warm up query first, which will fill the node cache, the 
speed should be closer to the 1.1s case.

     Andy

On 22/02/2021 15:09, Adam K wrote:
> Hi Andy,
> Thanks for the response. It's tested on Jena 3.10.0 and 3.17.0 on HDD TDB1 - UNION query counts 63966 results in 1.1s while pipe query finished with timeout after 2 minutes. Whole dataset has 1322457 triples.
> Thanks,
> 
> On 2021/02/22 14:01:57, Andy Seaborne <an...@apache.org> wrote:
>> Hi Adam,
>>
>> It would be useful to also know:
>>
>>       which version of Jena this is
>>       What the storage is - in-memory, or TDB
>>           TDB1 or TDB2?
>>           If TDB: What the hardware is disk or SSD?
>>       What the times actually are and what the count result is?
>>
>> Count is handled specially in TDB and maybe that interacts with the "|"
>> usage.
>>
>>       Andy
>>
>> On 22/02/2021 13:18, Adam K wrote:
>>> Hi all, I executed two simple equivalent queries having a big performance
>>> difference on a large dataset:
>>>
>>>
>>>      1. First matching by two alternative predicates using pipe operator
>>> * SELECT (count(*) as ?total) WHERE { *
>>> * { ?s <http://someURI1 <http://someURI1>>  | <http://someURI1
>>>      <http://someURI1>> ?o .}*
>>> * }*
>>>      this one is very slow and query plan shows the following matching
>>>      pattern:
>>>      (path ?subject (alt  <http://someURI1>  <http://someURI2> ) ?object)))))
>>>      2. If I use UNION operator instead of pipe the query becomes fast
>>> * SELECT (count(*) as ?total) WHERE {*
>>> *   { ?s <http://someURI1 <http://someURI1>> ?o . }**  UNION**  { ?s
>>>      <http://someURI2 <http://someURI2>> ?o . }*
>>> * }*
>>>      query plan here is different and shows UNION of two BGP matches:
>>>      (union (bgp (triple ?s <http://someURI1> ?o )) (bgp (triple ?s <
>>>      http://someURI2> ?o ))))))
>>>
>>>
>>> Documentation here
>>> https://jena.apache.org/documentation/query/property_paths.html tells that:
>>>
>>>      1. "Paths are “simple” if they involve only operators / (sequence), ^
>>>      (reverse, unary or binary) and the form {n}, for some single integer n."
>>>      2. "A path is “complex”  if it involves one or more of the operators
>>>      *,?, + and {}."
>>>
>>> These statements do do define implications of | - it should act like union,
>>> but query plan is different - is it a bug or a feature? Is there general
>>> recommendation to use UNION instead of pipe?
>>>
>>> Thanks for help!
>>>
>>

Re: Optimization of query plan for pipe operator

Posted by Adam K <ko...@gmail.com>.

Hi Andy,
Thanks for the response. It's tested on Jena 3.10.0 and 3.17.0 on HDD TDB1 - UNION query counts 63966 results in 1.1s while pipe query finished with timeout after 2 minutes. Whole dataset has 1322457 triples.
Thanks,

On 2021/02/22 14:01:57, Andy Seaborne <an...@apache.org> wrote: 
> Hi Adam,
> 
> It would be useful to also know:
> 
>      which version of Jena this is
>      What the storage is - in-memory, or TDB
>          TDB1 or TDB2?
>          If TDB: What the hardware is disk or SSD?
>      What the times actually are and what the count result is?
> 
> Count is handled specially in TDB and maybe that interacts with the "|" 
> usage.
> 
>      Andy
> 
> On 22/02/2021 13:18, Adam K wrote:
> > Hi all, I executed two simple equivalent queries having a big performance
> > difference on a large dataset:
> > 
> > 
> >     1. First matching by two alternative predicates using pipe operator
> > * SELECT (count(*) as ?total) WHERE { *
> > * { ?s <http://someURI1 <http://someURI1>>  | <http://someURI1
> >     <http://someURI1>> ?o .}*
> > * }*
> >     this one is very slow and query plan shows the following matching
> >     pattern:
> >     (path ?subject (alt  <http://someURI1>  <http://someURI2> ) ?object)))))
> >     2. If I use UNION operator instead of pipe the query becomes fast
> > * SELECT (count(*) as ?total) WHERE {*
> > *   { ?s <http://someURI1 <http://someURI1>> ?o . }**  UNION**  { ?s
> >     <http://someURI2 <http://someURI2>> ?o . }*
> > * }*
> >     query plan here is different and shows UNION of two BGP matches:
> >     (union (bgp (triple ?s <http://someURI1> ?o )) (bgp (triple ?s <
> >     http://someURI2> ?o ))))))
> > 
> > 
> > Documentation here
> > https://jena.apache.org/documentation/query/property_paths.html tells that:
> > 
> >     1. "Paths are “simple” if they involve only operators / (sequence), ^
> >     (reverse, unary or binary) and the form {n}, for some single integer n."
> >     2. "A path is “complex”  if it involves one or more of the operators
> >     *,?, + and {}."
> > 
> > These statements do do define implications of | - it should act like union,
> > but query plan is different - is it a bug or a feature? Is there general
> > recommendation to use UNION instead of pipe?
> > 
> > Thanks for help!
> > 
>

Re: Optimization of query plan for pipe operator

Posted by Andy Seaborne <an...@apache.org>.

Hi Adam,

It would be useful to also know:

     which version of Jena this is
     What the storage is - in-memory, or TDB
         TDB1 or TDB2?
         If TDB: What the hardware is disk or SSD?
     What the times actually are and what the count result is?

Count is handled specially in TDB and maybe that interacts with the "|" 
usage.

     Andy

On 22/02/2021 13:18, Adam K wrote:
> Hi all, I executed two simple equivalent queries having a big performance
> difference on a large dataset:
> 
> 
>     1. First matching by two alternative predicates using pipe operator
> * SELECT (count(*) as ?total) WHERE { *
> * { ?s <http://someURI1 <http://someURI1>>  | <http://someURI1
>     <http://someURI1>> ?o .}*
> * }*
>     this one is very slow and query plan shows the following matching
>     pattern:
>     (path ?subject (alt  <http://someURI1>  <http://someURI2> ) ?object)))))
>     2. If I use UNION operator instead of pipe the query becomes fast
> * SELECT (count(*) as ?total) WHERE {*
> *   { ?s <http://someURI1 <http://someURI1>> ?o . }**  UNION**  { ?s
>     <http://someURI2 <http://someURI2>> ?o . }*
> * }*
>     query plan here is different and shows UNION of two BGP matches:
>     (union (bgp (triple ?s <http://someURI1> ?o )) (bgp (triple ?s <
>     http://someURI2> ?o ))))))
> 
> 
> Documentation here
> https://jena.apache.org/documentation/query/property_paths.html tells that:
> 
>     1. "Paths are “simple” if they involve only operators / (sequence), ^
>     (reverse, unary or binary) and the form {n}, for some single integer n."
>     2. "A path is “complex”  if it involves one or more of the operators
>     *,?, + and {}."
> 
> These statements do do define implications of | - it should act like union,
> but query plan is different - is it a bug or a feature? Is there general
> recommendation to use UNION instead of pipe?
> 
> Thanks for help!
>