You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Paolo Castagna <ca...@googlemail.com> on 2011/08/25 22:59:47 UTC
On SPARQL queries with DISTINCT + ORDER BY + LIMIT
Hi,
I was about to create a new JIRA issue (an improvement) to 'optimise' SPARQL
queries with DISTINCT + ORDER BY + LIMIT. However, while I was writing it
I convinced myself it's not really necessary. Here is why.
In JENA-89 we implemented a QueryIterTopN using a PriorityQueue to improve
the scalability of ORDER BY + LIMIT queries avoiding a total sort.
In JENA-90 we want to reduce the amount of memory used by QueryIterDistinct
replacing an OpDistinct with an OpReduced for DISTINCT + ORDER BY queries
avoiding to keep an in-memory data structure of all the already seen bindings.
What can we do about DISTINCT + ORDER BY + LIMIT queries?
We could provide a new QueryIterTopNDistinct which adds to a PriorityQueue
if and only if a binding is not already there. So, this can be viewed as a
further improvement of JENA-89.
However, I am not convinced anymore that this is really useful or a good
idea, since we want to use QueryIterTopN (i.e. heap) for relatively small
N in our LIMIT N clause.
If N is large, the optimisation described in JENA-90 kicks in and the
slicing is cheap. If N is small, JENA-89 kicks in and the DISTINCT over
a small number of results is cheap.
Therefore we do not need to do anything special for DISTINCT + ORDER BY +
LIMIT.
It's better, as Andy suggested, to invest on 'clever' caching and merge
joins in TDB. There's not yet a JIRA issue for merge joins in TDB.
Paolo
Re: On SPARQL queries with DISTINCT + ORDER BY + LIMIT
Posted by Paolo Castagna <ca...@googlemail.com>.
Paolo Castagna wrote:
> Hi,
> I was about to create a new JIRA issue (an improvement) to 'optimise' SPARQL
> queries with DISTINCT + ORDER BY + LIMIT. However, while I was writing it
> I convinced myself it's not really necessary. Here is why.
>
> In JENA-89 we implemented a QueryIterTopN using a PriorityQueue to improve
> the scalability of ORDER BY + LIMIT queries avoiding a total sort.
> In JENA-90 we want to reduce the amount of memory used by QueryIterDistinct
> replacing an OpDistinct with an OpReduced for DISTINCT + ORDER BY queries
> avoiding to keep an in-memory data structure of all the already seen bindings.
>
> What can we do about DISTINCT + ORDER BY + LIMIT queries?
>
> We could provide a new QueryIterTopNDistinct which adds to a PriorityQueue
> if and only if a binding is not already there. So, this can be viewed as a
> further improvement of JENA-89.
>
> However, I am not convinced anymore that this is really useful or a good
> idea, since we want to use QueryIterTopN (i.e. heap) for relatively small
> N in our LIMIT N clause.
>
> If N is large, the optimisation described in JENA-90 kicks in and the
> slicing is cheap. If N is small, JENA-89 kicks in and the DISTINCT over
> a small number of results is cheap.
>
> Therefore we do not need to do anything special for DISTINCT + ORDER BY +
> LIMIT.
I was wrong. :-)
Now, I better understand how the two optimizations (in JENA-89 and JENA-90)
interact each others and I think there is another small step left to do.
I described the situation in JENA-108:
https://issues.apache.org/jira/browse/JENA-108
Maybe, I could have done everything in one shot... but, since on this I am
learning while doing, a small step at the time seems the best thing.
Paolo
> It's better, as Andy suggested, to invest on 'clever' caching and merge
> joins in TDB. There's not yet a JIRA issue for merge joins in TDB.
>
> Paolo