You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Élie Roux <el...@telecom-bretagne.eu> on 2020/01/21 18:24:39 UTC

super slow filter

Dear all,

I have a (relatively large) dataset in Fuseki (default optimization
settings) which I attached the relevant triples. The following query
takes around 160ms (according to Fuseki logs):

construct {
    ?res ?resp ?reso .
}
where {
  {
    bdr:G844 ?rel ?res .
    ?res a bdo:Place .
    ?res ?resp ?reso .
  }
}

It's not great (at all) but I can live with it, the problem is that if
I add this filter:

FILTER (?resp = skos:altLabel || ?resp = skos:prefLabel || ?resp =
skos:placeEvent || ?resp = bdo:placeLat || ?resp = bdo:placeLong ||
?resp = bdo:placeType || ?resp = bdo:placeLocatedIn || ?resp =
owl:sameAs || ?resp = tmp:entityScore)

which despite its length is quite simple, then the Fuseki logs
indicate 1200ms!! I reproduced it at least 20 times.

Attached are the complete slow and fast queries. Is there an obvious
error in my query or it a performance issue from Fuseki? If so I can
report it on JIRA.

Best,
-- 
Elie

Re: super slow filter

Posted by Élie Roux <el...@telecom-bretagne.eu>.

> What’s “relatively large”? 100 ms doesn’t sound that bad.

That depends on how the system is used... In my initial query I have a
union of 6 subqueries that look a bit like that. This brings the query
time to 6s. If I run a few queries every day that doesn't matter but
our system is used in production (sparql is used behind the scenes for
all the requests on our website), and in that kind of context 100ms is
huge, 6s is just unacceptable.

> Re. syntax, I think you could shorten the query using FILTER (?resp IN

Thanks, it's nicer!

(In case anyone wonders, it has no impact on performance)

Best,
-- 
Elie

Re: super slow filter

Posted by Martynas Jusevičius <ma...@atomgraph.com>.

What’s “relatively large”? 100 ms doesn’t sound that bad.

Re. syntax, I think you could shorten the query using FILTER (?resp IN
(...))
https://www.w3.org/TR/sparql11-query/#func-in

On Tue, 21 Jan 2020 at 20.03, Élie Roux <el...@telecom-bretagne.eu>
wrote:

> P.S.: some triples were missing for the dataset to work, here's an
> updated version. I've noted that the performance is directly
> proportional to the number of tests in the FILTER, it's about 100ms by
> comparison... that seems a little excessive...
>
> Best,
> --
> Elie
>

Re: super slow filter

Posted by Élie Roux <el...@telecom-bretagne.eu>.

P.S.: some triples were missing for the dataset to work, here's an
updated version. I've noted that the performance is directly
proportional to the number of tests in the FILTER, it's about 100ms by
comparison... that seems a little excessive...

Best,
-- 
Elie

Re: super slow filter

Posted by Élie Roux <el...@telecom-bretagne.eu>.

> Which describes the various options for optimization.

I have seen the various options. I have not seen what the default one
is when there is no .opt file in the directory (which is my case and
seems to be the initial state). Or did I miss it?

Best,
-- 
Elie

Re: super slow filter

Posted by Rob Vesse <rv...@dotnetrdf.org>.

See https://jena.apache.org/documentation/tdb/optimizer.html#running-tdbstats

Which describes the various options for optimization.

Rob

On 22/01/2020, 09:32, "Élie Roux" <el...@telecom-bretagne.eu> wrote:

    Thanks for your answers! I'm trying to understand why tdbquery doesn't
    return any result but in the meantime:
    
    > Apart from using stats.opt, with the option to manually tune the rules,
    > you have the option to use none.opt to stop any reordering of triple
    > patterns in bgps. That allows you to write the triple patterns in
    > optimal order for your data (which you have in this case).
    
    Actually I'm always careful to write my queries in the correct order,
    assuming that the optimizer does not to reorder my bgp patterns. I
    realized I assumed that that was the default behavior, but I'm not
    sure... I don't have any .opt file in my database directory and I
    can't find any info on what the default behavior is (at least on
    https://jena.apache.org/documentation/tdb/optimizer.html). Is it
    indicated somewhere? I think I'll open a Jira ticket about that.
    
    Best,
    -- 
    Elie

Re: super slow filter

Posted by Élie Roux <el...@telecom-bretagne.eu>.

Thanks for your answers! I'm trying to understand why tdbquery doesn't
return any result but in the meantime:

> Apart from using stats.opt, with the option to manually tune the rules,
> you have the option to use none.opt to stop any reordering of triple
> patterns in bgps. That allows you to write the triple patterns in
> optimal order for your data (which you have in this case).

Actually I'm always careful to write my queries in the correct order,
assuming that the optimizer does not to reorder my bgp patterns. I
realized I assumed that that was the default behavior, but I'm not
sure... I don't have any .opt file in my database directory and I
can't find any info on what the default behavior is (at least on
https://jena.apache.org/documentation/tdb/optimizer.html). Is it
indicated somewhere? I think I'll open a Jira ticket about that.

Best,
-- 
Elie

Re: super slow filter

Posted by Dave Reynolds <da...@gmail.com>.

On 21/01/2020 20:37, Élie Roux wrote:

> which I believe might result in a penalty... although frankly, I still
> can't understand how a very basic bgp like
> 
>   21         (bgp
>   22           (triple bdr:G844 ?rel ?res)
>   23           (triple ?res rdf:type :Place)
>   24           (triple ?res skos:prefLabel ?reso)
> 
> can take 100ms. Is there a way to tune the optimization level of
> features in queries or at the Fuseki level?

As Lorenz says, do you have a stats.opt file?

A possible explanation is that you might be using fixed.opt instead of 
stats.opt (or have some really out of date stats file).

With fixed.opt the optimizer will reorder based on the more grounded 
triples. In your case this is the second pattern in that block:

     (triple ?res rdf:type :Place)

If there are a lot of :Places compared to properties of your particular 
place bdr:G844 then this isn't optimal.

Apart from using stats.opt, with the option to manually tune the rules, 
you have the option to use none.opt to stop any reordering of triple 
patterns in bgps. That allows you to write the triple patterns in 
optimal order for your data (which you have in this case).

Dave

Re: super slow filter

Posted by Élie Roux <el...@telecom-bretagne.eu>.

> Try changing the query to put in a no-op that stops expansion.
>
> SELECT * {
>      :G844 ?rel ?res .
>      ?res a :Place .
>      ?res ?resp ?reso .
>      BIND(1 AS ?X)
>      FILTER (?resp = skos:altLabel || ?resp = skos:prefLabel || ?resp =
> skos:placeEvent || ?resp = bdo:placeLat || ?resp = bdo:placeLong ||
> ?resp = bdo:placeType || ?resp = bdo:placeLocatedIn || ?resp =
> owl:sameAs || ?resp = tmp:entityScore)
> }

Oh, that's a very useful trick, I wasn't aware of it. Thanks a lot!

> BTW
> IN is not always the same as FILTER-||. It is here but IN uses
> "sameTerm", not "="

Thanks for that too!

Best,
-- 
Elie

Re: super slow filter

Posted by Andy Seaborne <an...@apache.org>.

Try changing the query to put in a no-op that stops expansion.

SELECT * {
     :G844 ?rel ?res .
     ?res a :Place .
     ?res ?resp ?reso .
     BIND(1 AS ?X)
     FILTER (?resp = skos:altLabel || ?resp = skos:prefLabel || ?resp =
skos:placeEvent || ?resp = bdo:placeLat || ?resp = bdo:placeLong ||
?resp = bdo:placeType || ?resp = bdo:placeLocatedIn || ?resp =
owl:sameAs || ?resp = tmp:entityScore)
}

BTW
IN is not always the same as FILTER-||. It is here but IN uses 
"sameTerm", not "="

Re: super slow filter

Posted by Élie Roux <el...@telecom-bretagne.eu>.

>      * documenting that fixed.opt is the default when there is no file
>      * documenting that --tdb should be preferred over --loc in most cases
>     in tdbquery
>
> These you can do yourselves, find the relevant part of the website and hit the Improve this Page button at the top and follow the instructions

Ah yes thanks! I've updated the page with the default behavior. For
the second one I was thinking of the help / error messages or the
command line tool.

> This already exists and is done by setting context symbols to true/false as desired for a given optimisation (I think for CLI its something like --set <symbol>=false to disable a given optimisation.  However I don't think this is well documented, you can find the symbols values in the source code - https://github.com/apache/jena/blob/92788c44255569a7c62d915b1e59a7d340917065/jena-arq/src/main/java/org/apache/jena/query/ARQ.java#L323
>
> So for example you might do --set arq:optFilterPlacement=false

I think it would be more optFilterImplicitJoin (transforming the
filter into a union), but I'm not entirely sure... Anyways, what I'm
looking for is a solution to do that in my regular sparql query I send
to Fuseki (on a query by query basis).

Best,
-- 
Elie

Re: super slow filter

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Comments inline:

On 22/01/2020, 10:27, "Élie Roux" <el...@telecom-bretagne.eu> wrote:

    Thanks a lot, after some investigation, here are a few results:
    
    - the problem was that I had no .opt file and that the default
    behavior was fixed.opt (or so it seems), when adding a none.opt (or a
    stats.opt) the performance went from 1200 to 250ms (with the version
    with the big filter version)
    - the version with VALUES went down to 150ms using none.opt or
    stats.opt, that's really cool
    - the version with a big filter went down to 150ms when I turn off
    optimizations (the big union takes more time than a simple filter)
    
    I'll open the following JIRA issues:
     * documenting that fixed.opt is the default when there is no file
     * documenting that --tdb should be preferred over --loc in most cases
    in tdbquery

These you can do yourselves, find the relevant part of the website and hit the Improve this Page button at the top and follow the instructions

     * feature request: ability to turn off some (or all) optimizations
    for a query, a bit like
    https://wiki.blazegraph.com/wiki/index.php/QueryHints

This already exists and is done by setting context symbols to true/false as desired for a given optimisation (I think for CLI its something like --set <symbol>=false to disable a given optimisation.  However I don't think this is well documented, you can find the symbols values in the source code - https://github.com/apache/jena/blob/92788c44255569a7c62d915b1e59a7d340917065/jena-arq/src/main/java/org/apache/jena/query/ARQ.java#L323

So for example you might do --set arq:optFilterPlacement=false 

Rob
    
    I'm satisfied with the current state.
    
    Best,
    -- 
    Elie

Re: super slow filter

Posted by Élie Roux <el...@telecom-bretagne.eu>.

Thanks a lot, after some investigation, here are a few results:

- the problem was that I had no .opt file and that the default
behavior was fixed.opt (or so it seems), when adding a none.opt (or a
stats.opt) the performance went from 1200 to 250ms (with the version
with the big filter version)
- the version with VALUES went down to 150ms using none.opt or
stats.opt, that's really cool
- the version with a big filter went down to 150ms when I turn off
optimizations (the big union takes more time than a simple filter)

I'll open the following JIRA issues:
 * documenting that fixed.opt is the default when there is no file
 * documenting that --tdb should be preferred over --loc in most cases
in tdbquery
 * feature request: ability to turn off some (or all) optimizations
for a query, a bit like
https://wiki.blazegraph.com/wiki/index.php/QueryHints

I'm satisfied with the current state.

Best,
-- 
Elie

Re: super slow filter

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.

If you use TDB as backend, you could run


tdbquery --explain --loc=PATH_TO_YOUR_DB "YOUR QUERY HERE"

and share the output here?

Also, a stats file does exist for your dataset?

On 21.01.20 21:37, Élie Roux wrote:
> I'm starting to see what's going on, it seems that the optimization
> (according to sparql.org) gives
>
>  13     (disjunction
>  14       (assign ((?resp skos:altLabel))
>  15         (bgp
>  16           (triple bdr:G844 ?rel ?res)
>  17           (triple ?res rdf:type :Place)
>  18           (triple ?res skos:altLabel ?reso)
>  19         ))
>  20       (assign ((?resp skos:prefLabel))
>  21         (bgp
>  22           (triple bdr:G844 ?rel ?res)
>  23           (triple ?res rdf:type :Place)
>  24           (triple ?res skos:prefLabel ?reso)
>  25         ))
> [...] etc.
>
> instead of
>
>  13     (filter (in ?resp skos:altLabel skos:prefLabel skos:placeEvent
> :placeLat :placeLong :placeType :placeLocatedIn owl:sameAs
> tmp:entityScore)
>  14       (bgp
>  15         (triple bdr:G844 ?rel ?res)
>  16         (triple ?res rdf:type :Place)
>  17         (triple ?res ?resp ?reso)
>  18       ))))
>
>
> which I believe might result in a penalty... although frankly, I still
> can't understand how a very basic bgp like
>
>  21         (bgp
>  22           (triple bdr:G844 ?rel ?res)
>  23           (triple ?res rdf:type :Place)
>  24           (triple ?res skos:prefLabel ?reso)
>
> can take 100ms. Is there a way to tune the optimization level of
> features in queries or at the Fuseki level?
>
> Best,

Re: super slow filter

Posted by Élie Roux <el...@telecom-bretagne.eu>.

I'm starting to see what's going on, it seems that the optimization
(according to sparql.org) gives

 13     (disjunction
 14       (assign ((?resp skos:altLabel))
 15         (bgp
 16           (triple bdr:G844 ?rel ?res)
 17           (triple ?res rdf:type :Place)
 18           (triple ?res skos:altLabel ?reso)
 19         ))
 20       (assign ((?resp skos:prefLabel))
 21         (bgp
 22           (triple bdr:G844 ?rel ?res)
 23           (triple ?res rdf:type :Place)
 24           (triple ?res skos:prefLabel ?reso)
 25         ))
[...] etc.

instead of

 13     (filter (in ?resp skos:altLabel skos:prefLabel skos:placeEvent
:placeLat :placeLong :placeType :placeLocatedIn owl:sameAs
tmp:entityScore)
 14       (bgp
 15         (triple bdr:G844 ?rel ?res)
 16         (triple ?res rdf:type :Place)
 17         (triple ?res ?resp ?reso)
 18       ))))


which I believe might result in a penalty... although frankly, I still
can't understand how a very basic bgp like

 21         (bgp
 22           (triple bdr:G844 ?rel ?res)
 23           (triple ?res rdf:type :Place)
 24           (triple ?res skos:prefLabel ?reso)

can take 100ms. Is there a way to tune the optimization level of
features in queries or at the Fuseki level?

Best,
-- 
Elie

Re: super slow filter

Posted by Élie Roux <el...@telecom-bretagne.eu>.

> Have you tried using VALUES instead of FILTER ?

I have to say I was expecting it to give different results (in terms
of output) but you're right:

construct {
    ?res ?resp ?reso .
}
where {
  {
    bdr:G844 ?rel ?res .
    ?res a bdo:Place .
    VALUES ?res { skos:altLabel skos:prefLabel skos:placeEvent
bdo:placeLat bdo:placeLong bdo:placeType bdo:placeLocatedIn owl:sameAs
tmp:entityScore }
    ?res ?resp ?reso .
  }
}

gives the same results. Still in 1.2s unfortunately.

Best,
-- 
Elie

Re: super slow filter

Posted by Thomas Francart <th...@sparna.fr>.

Hello

Have you tried using VALUES instead of FILTER ?

Thomas

Le mar. 21 janv. 2020 à 19:24, Élie Roux <el...@telecom-bretagne.eu> a
écrit :

> Dear all,
>
> I have a (relatively large) dataset in Fuseki (default optimization
> settings) which I attached the relevant triples. The following query
> takes around 160ms (according to Fuseki logs):
>
> construct {
>     ?res ?resp ?reso .
> }
> where {
>   {
>     bdr:G844 ?rel ?res .
>     ?res a bdo:Place .
>     ?res ?resp ?reso .
>   }
> }
>
> It's not great (at all) but I can live with it, the problem is that if
> I add this filter:
>
> FILTER (?resp = skos:altLabel || ?resp = skos:prefLabel || ?resp =
> skos:placeEvent || ?resp = bdo:placeLat || ?resp = bdo:placeLong ||
> ?resp = bdo:placeType || ?resp = bdo:placeLocatedIn || ?resp =
> owl:sameAs || ?resp = tmp:entityScore)
>
> which despite its length is quite simple, then the Fuseki logs
> indicate 1200ms!! I reproduced it at least 20 times.
>
> Attached are the complete slow and fast queries. Is there an obvious
> error in my query or it a performance issue from Fuseki? If so I can
> report it on JIRA.
>
> Best,
> --
> Elie
>
-- 

*Thomas Francart* -* SPARNA*
Web de *données* | Architecture de l'*information* | Accès aux
*connaissances*
blog : blog.sparna.fr, site : sparna.fr, linkedin :
fr.linkedin.com/in/thomasfrancart
tel :  +33 (0)6.71.11.25.97, skype : francartthomas