You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Zen 98052 <z9...@outlook.com> on 2016/06/09 15:28:08 UTC

not performant query

Hi,

I have a Sparql query below, which doesn't seem efficient.

I noticed when running it, Jena calls execute(OpBGP opBGP, QueryIterator ...) so many times.

I have my own implementation in that function (overrides base class OpExecutor), which it'll make call to our back-end storage.

From qparse output (attached below), it looks like the culprit is because the query has BGPs inside the FILTER, which explains the behavior I am seeing.


Is there a better way to re-write the query below to achieve same result, but more efficient (and lead to better performance)?


Thanks,

Z



/// SPARQL QUERY:


PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX raw: <http://v/raw#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX v: <http://b/dir/>

SELECT (COUNT(?x) AS ?count) WHERE
{
  ?x rdf:type v:Person .
  {
    SELECT ?x WHERE
    {
      ?x v:hasSnapshot ?snapshot .
      ?snapshot rdf:type v:DS .
      ?snapshot v:mdId ?id .
      VALUES ?id { 'b01.xml' 'f5f.xml' }
      MINUS
      {
        ?x v:hasSnapshot ?snapshot .
        ?snapshot rdf:type v:DS .
        ?snapshot v:mdId ?id .
        VALUES ?id { 'def.xml' '191.xml' }
      }
    }
  }
  ?x ?p ?o .
  OPTIONAL
  {
    ?o ?x ?y .
    ?o rdf:type ?type.
    FILTER NOT EXISTS
    {
      { ?o rdf:type v:Dynamic }
      UNION
      { ?o rdf:type v:Static }
    }
  }
}



/// OUTPUT FROM running "qparse --explain --print=op -v":


(prefix ((raw: <http://v/raw#>)
         (rdfs: <http://www.w3.org/2000/01/rdf-schema#>)
         (rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>)
         (owl: <http://www.w3.org/2002/07/owl#>)
         (v: <http://b/dir/>))
  (project (?count)
    (extend ((?count ?.0))
      (group () ((?.0 (count ?x)))
        (conditional
          (sequence
            (join
              (bgp (triple ?x rdf:type v:Person))
              (project (?x)
                (minus
                  (sequence
                    (table (vars ?/id)
                      (row [?/id "b01.xml"])
                      (row [?/id "f5f.xml"])
                    )
                    (bgp
                      (triple ?x v:hasSnapshot ?/snapshot)
                      (triple ?/snapshot rdf:type v:DS)
                      (triple ?/snapshot v:mdId ?/id)
                    ))
                  (sequence
                    (table (vars ?/id)
                      (row [?/id "def.xml"])
                      (row [?/id "191.xml"])
                    )
                    (bgp
                      (triple ?x v:hasSnapshot ?/snapshot)
                      (triple ?/snapshot rdf:type v:DS)
                      (triple ?/snapshot v:mdId ?/id)
                    )))))
            (bgp (triple ?x ?p ?o)))
          (sequence
            (filter (notexists
                       (union
                         (bgp (triple ?o rdf:type v:Dynamic))
                         (bgp (triple ?o rdf:type v:Static))))
              (bgp (triple ?o ?x ?y)))
            (bgp (triple ?o rdf:type ?type))))))))



Re: not performant query

Posted by Zen 98052 <z9...@outlook.com>.
Thanks Andy! Answers inline ...


________________________________
From: Andy Seaborne <an...@apache.org>
Sent: Friday, June 10, 2016 11:06 AM
To: users@jena.apache.org
Subject: Re: not performant query

On 09/06/16 16:28, Zen 98052 wrote:
> Hi,
>
> I have a Sparql query below, which doesn't seem efficient.
>
> I noticed when running it, Jena calls execute(OpBGP opBGP,
> QueryIterator ...) so many times.

The default execution strategy - i.e. for in-memory use - is just that -
a default.

It

If your storage layer has different characteristics, e.g. there is a
certain about of overhead to go and get data, then the default execution
strategy maybe the wrong one.  That's the job of the optimizer and of
OpExecutor.

What does your storage layer look like?

[Z] We use Accumulo as the storage, based on https://wiki.apache.org/incubator/RyaProposal.
Basically, there will be 3 different tables, SPO, POS, and OSP, and based on the BGP, it will look up on one of those tables.
The serialized triple, i.e. SPO (delimited by null char) is stored as the key, which then we can just set the ranges to get all 'rows' that matched the filter efficiently.
Therefore, for each BGP that Jena calls my callback (in execute function with OpBGP arg), it'll submit request to the store, and iterate all rows.

> I have my own implementation in that function (overrides base class
> OpExecutor), which it'll make call to our back-end storage.
>
> From qparse output (attached below), it looks like the culprit is
> because the query has BGPs inside the FILTER, which explains the
> behavior I am seeing.

Possibly - there are several points where costs may arise.

 > ?o rdf:type ?type.
> FILTER NOT EXISTS
>     {
>       { ?o rdf:type v:Dynamic }
>       UNION
>       { ?o rdf:type v:Static }
>     }

FILTER NOT EXISTS {} can usually be written as MINUS or in this case a
expression FILTER on ?type as you have already fetched the rdf:type.

FILTER ( ?o != v:Dynamic && ?o != v:Static )

[Z] there's bug in the query, which '?o rdf:type ?type' pattern shouldn't be there, hence can't follow your suggestion, but it is still a useful tip for me.

The (sequence) is flowing results one-by-one into the nest step.
Depending on the storage, it may be better to switch that rewrite off
and use the hash-join built in - or do your own (parallel hash join maybe?)

Do you implement solving BGPs in your store and not relying on the
iterative solver that is used by default?

[Z] Yes. What other execcution strategies Jena provide (besides the default one)? Also, are there any existing samples?

> Is there a better way to re-write the query below to achieve same
> result, but more efficient (and lead to better performance)?

If you could give some details of the store it would help.  It's hard to
make many suggestions because it is all about the details.

        Andy

>
>
> Thanks,
>
> Z
>


Re: not performant query

Posted by Andy Seaborne <an...@apache.org>.
On 09/06/16 16:28, Zen 98052 wrote:
> Hi,
>
> I have a Sparql query below, which doesn't seem efficient.
>
> I noticed when running it, Jena calls execute(OpBGP opBGP,
> QueryIterator ...) so many times.

The default execution strategy - i.e. for in-memory use - is just that - 
a default.

It

If your storage layer has different characteristics, e.g. there is a 
certain about of overhead to go and get data, then the default execution 
strategy maybe the wrong one.  That's the job of the optimizer and of 
OpExecutor.

What does your storage layer look like?

> I have my own implementation in that function (overrides base class
> OpExecutor), which it'll make call to our back-end storage.
>
> From qparse output (attached below), it looks like the culprit is
> because the query has BGPs inside the FILTER, which explains the
> behavior I am seeing.

Possibly - there are several points where costs may arise.

 > ?o rdf:type ?type.
> FILTER NOT EXISTS
>     {
>       { ?o rdf:type v:Dynamic }
>       UNION
>       { ?o rdf:type v:Static }
>     }

FILTER NOT EXISTS {} can usually be written as MINUS or in this case a 
expression FILTER on ?type as you have already fetched the rdf:type.

FILTER ( ?o != v:Dynamic && ?o != v:Static )



The (sequence) is flowing results one-by-one into the nest step. 
Depending on the storage, it may be better to switch that rewrite off 
and use the hash-join built in - or do your own (parallel hash join maybe?)

Do you implement solving BGPs in your store and not relying on the 
iterative solver that is used by default?

> Is there a better way to re-write the query below to achieve same
> result, but more efficient (and lead to better performance)?

If you could give some details of the store it would help.  It's hard to 
make many suggestions because it is all about the details.

	Andy

>
>
> Thanks,
>
> Z
>


Re: not performant query

Posted by Andy Seaborne <an...@apache.org>.
On 09/06/16 16:28, Zen 98052 wrote:
> Hi,
>
> I have a Sparql query below, which doesn't seem efficient.
>
> I noticed when running it, Jena calls execute(OpBGP opBGP,
> QueryIterator ...) so many times.

The default execution strategy - i.e. for in-memory use - is just that - 
a default.

It

If your storage layer has different characteristics, e.g. there is a 
certain about of overhead to go and get data, then the default execution 
strategy maybe the wrong one.  That's the job of the optimizer and of 
OpExecutor.

What does your storage layer look like?

> I have my own implementation in that function (overrides base class
> OpExecutor), which it'll make call to our back-end storage.
>
> From qparse output (attached below), it looks like the culprit is
> because the query has BGPs inside the FILTER, which explains the
> behavior I am seeing.

Possibly - there are several points where costs may arise.

 > ?o rdf:type ?type.
> FILTER NOT EXISTS
>     {
>       { ?o rdf:type v:Dynamic }
>       UNION
>       { ?o rdf:type v:Static }
>     }

FILTER NOT EXISTS {} can usually be written as MINUS or in this case a 
expression FILTER on ?type as you have already fetched the rdf:type.

FILTER ( ?o != v:Dynamic && ?o != v:Static )



The (sequence) is flowing results one-by-one into the nest step. 
Depending on the storage, it may be better to switch that rewrite off 
and use the hash-join built in - or do your own (parallel hash join maybe?)

Do you implement solving BGPs in your store and not relying on the 
iterative solver that is used by default?

> Is there a better way to re-write the query below to achieve same
> result, but more efficient (and lead to better performance)?

If you could give some details of the store it would help.  It's hard to 
make many suggestions because it is all about the details.

	Andy

>
>
> Thanks,
>
> Z
>