You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Paolo Castagna <ca...@googlemail.com> on 2011/08/25 22:59:47 UTC

On SPARQL queries with DISTINCT + ORDER BY + LIMIT

Hi,
I was about to create a new JIRA issue (an improvement) to 'optimise' SPARQL
queries with DISTINCT + ORDER BY + LIMIT. However, while I was writing it
I convinced myself it's not really necessary. Here is why.

In JENA-89 we implemented a QueryIterTopN using a PriorityQueue to improve
the scalability of ORDER BY + LIMIT queries avoiding a total sort.
In JENA-90 we want to reduce the amount of memory used by QueryIterDistinct
replacing an OpDistinct with an OpReduced for DISTINCT + ORDER BY queries
avoiding to keep an in-memory data structure of all the already seen bindings.

What can we do about DISTINCT + ORDER BY + LIMIT queries?

We could provide a new QueryIterTopNDistinct which adds to a PriorityQueue
if and only if a binding is not already there. So, this can be viewed as a
further improvement of JENA-89.

However, I am not convinced anymore that this is really useful or a good
idea, since we want to use QueryIterTopN (i.e. heap) for relatively small
N in our LIMIT N clause.

If N is large, the optimisation described in JENA-90 kicks in and the
slicing is cheap. If N is small, JENA-89 kicks in and the DISTINCT over
a small number of results is cheap.

Therefore we do not need to do anything special for DISTINCT + ORDER BY +
LIMIT.

It's better, as Andy suggested, to invest on 'clever' caching and merge
joins in TDB. There's not yet a JIRA issue for merge joins in TDB.

Paolo

Re: On SPARQL queries with DISTINCT + ORDER BY + LIMIT

Posted by Paolo Castagna <ca...@googlemail.com>.

Paolo Castagna wrote:
> Hi,
> I was about to create a new JIRA issue (an improvement) to 'optimise' SPARQL
> queries with DISTINCT + ORDER BY + LIMIT. However, while I was writing it
> I convinced myself it's not really necessary. Here is why.
> 
> In JENA-89 we implemented a QueryIterTopN using a PriorityQueue to improve
> the scalability of ORDER BY + LIMIT queries avoiding a total sort.
> In JENA-90 we want to reduce the amount of memory used by QueryIterDistinct
> replacing an OpDistinct with an OpReduced for DISTINCT + ORDER BY queries
> avoiding to keep an in-memory data structure of all the already seen bindings.
> 
> What can we do about DISTINCT + ORDER BY + LIMIT queries?
> 
> We could provide a new QueryIterTopNDistinct which adds to a PriorityQueue
> if and only if a binding is not already there. So, this can be viewed as a
> further improvement of JENA-89.
> 
> However, I am not convinced anymore that this is really useful or a good
> idea, since we want to use QueryIterTopN (i.e. heap) for relatively small
> N in our LIMIT N clause.
> 
> If N is large, the optimisation described in JENA-90 kicks in and the
> slicing is cheap. If N is small, JENA-89 kicks in and the DISTINCT over
> a small number of results is cheap.
> 
> Therefore we do not need to do anything special for DISTINCT + ORDER BY +
> LIMIT.

I was wrong. :-)

Now, I better understand how the two optimizations (in JENA-89 and JENA-90)
interact each others and I think there is another small step left to do.

I described the situation in JENA-108:
https://issues.apache.org/jira/browse/JENA-108

Maybe, I could have done everything in one shot... but, since on this I am
learning while doing, a small step at the time seems the best thing.

Paolo

> It's better, as Andy suggested, to invest on 'clever' caching and merge
> joins in TDB. There's not yet a JIRA issue for merge joins in TDB.
> 
> Paolo