You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Andrii Berezovskyi <an...@kth.se> on 2022/02/09 16:42:52 UTC

Query performance with an ORDER BY

Hello,

We need to do paging in our app and as we don’t have a single property on which we can do a WHERE cutoff, we use an OFFSET/LIMIT. The OFFSET works fairly fast on the small values, however when we add an ORDER BY clause (as per SPARQL 1.1 spec as OFFSET is not guaranteed to make sense otherwise), queries become quite slow (5s vs <<1s).

The query we run:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
DESCRIBE ?s
WHERE
{
                GRAPH <urn:x-arq:DefaultGraph>
                {
                                SELECT DISTINCT ?s
                                WHERE
                                { ?s ?p ?o ;
                                  rdf:type <someType>
                                }
                                ORDER BY ASC(?s)
                                OFFSET 100
                                LIMIT 21
                }
}

An “optimization” below works (<<1s eval time) but obviously will not guarantee that the inner LIMIT is always applied in the same manner and the paging beyond ### items won’t work:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
DESCRIBE ?s
WHERE
{
                GRAPH <urn:x-arq:DefaultGraph>
                {
                                SELECT DISTINCT ?s
                                WHERE
                                { ?s ?p ?o ;
                                  rdf:type <someType>
                                }
                                LIMIT 1000
                }
}
ORDER BY ASC(?s)
OFFSET 100
LIMIT 21

We’ve read that the order is mostly stable even without the ORDER [1]. Is there a better way to do this? We were assuming that sorting on the subject URI will result into some optimization but it does not seem to be the case. Is it possible to add some way to sort on NodeId?

Also, there is a problem with OFFSET/LIMIT approach even without an order as the OFFSET grows [2]. What is the recommended approach to a stable paging that would scale well? In SQL, seek method [3] is considered appropriate for most DBs. I tried replicating that, adding a random int ID to every resource and using FILTER(?ord > ###) to do the paging, using the max value from a page as an argument for the next. However, this again works fast only if the ORDER BY clause is missing, which seems to be essential to get a stable sorting (but at least allows to add a custom sort order without requiring an ORDER BY clause if one wishes to live a dangerous life). What is the best way to do stable paging in Jena/SPARQL?

Thanks in advance,
Andrew

[1]: https://markmail.org/search/?q=offset+order+list%3Aorg.apache.incubator.jena-users#query:offset%20order%20list%3Aorg.apache.incubator.jena-users+page:1+mid:fik5kllpnd4sm3tl+state:results
[2]: https://markmail.org/search/?q=offset+list%3Aorg.apache.incubator.jena-users#query:offset%20list%3Aorg.apache.incubator.jena-users+page:1+mid:uhzxkaxstbzfetns+state:results
[3]: https://blog.jooq.org/faster-sql-paging-with-jooq-using-the-seek-method/

Re: Query performance with an ORDER BY

Posted by Andy Seaborne <an...@apache.org>.

Hi Andrew,

On 09/02/2022 16:42, Andrii Berezovskyi wrote:
> Hello,
> 
> We need to do paging in our app and as we don’t have a single property on which we can do a WHERE cutoff, we use an OFFSET/LIMIT. The OFFSET works fairly fast on the small values, however when we add an ORDER BY clause (as per SPARQL 1.1 spec as OFFSET is not guaranteed to make sense otherwise), queries become quite slow (5s vs <<1s).

The thing about ORDER BY is that it has to scan the whole of the 
possible results to know which the lowest or highest element is. So it 
solves the issue (predictable slices) but in an expensive way.

> The query we run:
> 
> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> DESCRIBE ?s
> WHERE
> {
>                  GRAPH <urn:x-arq:DefaultGraph>
>                  {
>                                  SELECT DISTINCT ?s
>                                  WHERE
>                                  { ?s ?p ?o ;
>                                    rdf:type <someType>
>                                  }

There is no need to have the ?s ?p ?o : { ?s rdf:type <someType> } is 
enough. DISTINCT is not needed.  ?p ?o are not in the projection.

>                                  ORDER BY ASC(?s)
>                                  OFFSET 100
>                                  LIMIT 21
>                  }
> }
> 
> An “optimization” below works (<<1s eval time) but obviously will not guarantee that the inner LIMIT is always applied in the same manner and the paging beyond ### items won’t work:
> 
> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> DESCRIBE ?s
> WHERE
> {
>                  GRAPH <urn:x-arq:DefaultGraph>
>                  {
>                                  SELECT DISTINCT ?s
>                                  WHERE
>                                  { ?s ?p ?o ;
>                                    rdf:type <someType>
>                                  }
>                                  LIMIT 1000
>                  }
> }
> ORDER BY ASC(?s)
> OFFSET 100
> LIMIT 21
> 
> We’ve read that the order is mostly stable even without the ORDER [1]. Is there a better way to do this? We were assuming that sorting on the subject URI will result into some optimization but it does not seem to be the case.

There is an optimization applies (TopN) which applies if but the thing 
about ORDER BY is that it has to scan the whole of the possible results 
to know which the lowest or highest element is. But it is only applied 
for up to top 1000 (the end of offset-limit).  It is a little more 
expensive so when the amount to sort is large enough, it switch back to 
normal sorting. Context setting ARQ.topNSortingThreshold. 999 optimizes, 
1000 does not.

In the absence of updates, then

SELECT ?s { ?s rdf:type <someType>  } OFFSET / LIMIT

will slice the data from TDB in Jena.

> Is it possible to add some way to sort on NodeId?

No - not from outside.

> Also, there is a problem with OFFSET/LIMIT approach even without an order as the OFFSET grows [2]. What is the recommended approach to a stable paging that would scale well? In SQL, seek method [3] is considered appropriate for most DBs. I tried replicating that, adding a random int ID to every resource and using FILTER(?ord > ###) to do the paging, using the max value from a page as an argument for the next. However, this again works fast only if the ORDER BY clause is missing, which seems to be essential to get a stable sorting (but at least allows to add a custom sort order without requiring an ORDER BY clause if one wishes to live a dangerous life). What is the best way to do stable paging in Jena/SPARQL?

Suggestion if you have some control over the client-side or you are 
running with a loacl database (Sorry - I can't rmember what your setup 
is, Fuseki or local).

DESCRIBE complicates the situation.

If there are no bnodes, DESCRIBE is like SELECT * { ?s ?p ?o }.

For many queries (not sorting), the results stream end-to-end.  The 
client low level can issue one

SELECT * { ?s rdf:type <sometype> . ?s ?p ?o }.

and, client-side, groups the results into DESCRIBE or DESCRIBE like chunks.

This relies on how ARQ will execute the query.

It also assumes the paging will consume all the query results, not look 
at a few pages then drop the rest.  If local, ending the query early is 
fine. Remote, it messes with the HTTP connection; a half-way solution is 
query for a few pages of results at a time, rather than one.

====

Testing the water here:

Would a non-standard solution work ?

Either syntax in the query, or (Fuseki) a special protocol to fetch 
pages from a single query.

This would be a Jena-only.

     Andy

> 
> Thanks in advance,
> Andrew
> 
> [1]: https://markmail.org/search/?q=offset+order+list%3Aorg.apache.incubator.jena-users#query:offset%20order%20list%3Aorg.apache.incubator.jena-users+page:1+mid:fik5kllpnd4sm3tl+state:results
> [2]: https://markmail.org/search/?q=offset+list%3Aorg.apache.incubator.jena-users#query:offset%20list%3Aorg.apache.incubator.jena-users+page:1+mid:uhzxkaxstbzfetns+state:results
> [3]: https://blog.jooq.org/faster-sql-paging-with-jooq-using-the-seek-method/