You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sling.apache.org by Ian Boston <ie...@tfd.co.uk> on 2009/06/15 18:52:50 UTC

Queries with large result sets and sorting.

Hi,

I want to perform a query where the full result set could be millions  
of items. That set needs to be sorted by the lastModified attribute on  
the node, and I only want to see a small number of items eg 100 after  
a particular date.

If I do this, will there be scalability issues, or is the sorting of a  
date field optimized in the query engine ?

Thanks
Ian

Re: Queries with large result sets and sorting.

Posted by Ian Boston <ie...@tfd.co.uk>.
Will do, thanks for the pointer.
Ian

On 16 Jun 2009, at 16:08, Felix Meschberger wrote:

> Hi Ian,
>
> Maybe you might want to ask this question on the Jackrabbit dev list
> where JCR Query implementation specialist is also lurking.
>
> Regards
> Felix
>
> Ian Boston schrieb:
>> yes it was that thread that triggered the concern.
>>
>> "
>> Obviously, a stupid example, but, unfortunately, not really to much  
>> to
>> do about it except not sorting on large property fields...if you  
>> need to
>> sort on the title of 200.000 docs...you better sort on the  
>> short_title
>> (which I would prefer to be an index only property defined in
>> indexing_configuration, but I think people have different opinions on
>> this) "
>>
>> Does that mean that sorting huge numbers of documents on *small*  
>> fields
>> has a similar problems. Unless there is a pre-query to estimate the
>> number of hits (using the Jackrabbit getTotalCount() IIRC methods),  
>> then
>> its going to be impossible to avoid submitting queries that could  
>> result
>> in a sort on huge numbers of results.
>>
>> I had thought in the special case of sort by date the solution  
>> would be
>> to split the searches up into chunks to avoid a massive sort.
>>
>> eg search for items in the last hour,
>> if there are not enough items the next hour with some extending  
>> range.
>>
>> or perform a sequence of unsorted pre queries on a date range to
>> determine the range required to return a set large enough to sort.
>>
>> On the basis that lastModified is a long, am I worrying  
>> unnecessarily ?
>>
>> Ian
>>
>>
>> On 15 Jun 2009, at 17:59, Marc Speck wrote:
>>
>>> I've just read the thread http://markmail.org/message/wnn2bfwzwx2hn6v4 
>>>  .
>>> Maybe it helps,
>>> Marc
>>>
>>>
>>>
>>> On Mon, Jun 15, 2009 at 6:52 PM, Ian Boston <ie...@tfd.co.uk> wrote:
>>>
>>>> Hi,
>>>>
>>>> I want to perform a query where the full result set could be  
>>>> millions of
>>>> items. That set needs to be sorted by the lastModified attribute  
>>>> on the
>>>> node, and I only want to see a small number of items eg 100 after a
>>>> particular date.
>>>>
>>>> If I do this, will there be scalability issues, or is the sorting  
>>>> of
>>>> a date
>>>> field optimized in the query engine ?
>>>>
>>>> Thanks
>>>> Ian
>>>>
>>
>>
>


Re: Queries with large result sets and sorting.

Posted by Felix Meschberger <fm...@gmail.com>.
Hi Ian,

Maybe you might want to ask this question on the Jackrabbit dev list
where JCR Query implementation specialist is also lurking.

Regards
Felix

Ian Boston schrieb:
> yes it was that thread that triggered the concern.
> 
> "
> Obviously, a stupid example, but, unfortunately, not really to much to
> do about it except not sorting on large property fields...if you need to
> sort on the title of 200.000 docs...you better sort on the short_title
> (which I would prefer to be an index only property defined in
> indexing_configuration, but I think people have different opinions on
> this) "
> 
> Does that mean that sorting huge numbers of documents on *small* fields
> has a similar problems. Unless there is a pre-query to estimate the
> number of hits (using the Jackrabbit getTotalCount() IIRC methods), then
> its going to be impossible to avoid submitting queries that could result
> in a sort on huge numbers of results.
> 
> I had thought in the special case of sort by date the solution would be
> to split the searches up into chunks to avoid a massive sort.
> 
> eg search for items in the last hour,
> if there are not enough items the next hour with some extending range.
> 
> or perform a sequence of unsorted pre queries on a date range to
> determine the range required to return a set large enough to sort.
> 
> On the basis that lastModified is a long, am I worrying unnecessarily ?
> 
> Ian
> 
> 
> On 15 Jun 2009, at 17:59, Marc Speck wrote:
> 
>> I've just read the thread http://markmail.org/message/wnn2bfwzwx2hn6v4 .
>> Maybe it helps,
>> Marc
>>
>>
>>
>> On Mon, Jun 15, 2009 at 6:52 PM, Ian Boston <ie...@tfd.co.uk> wrote:
>>
>>> Hi,
>>>
>>> I want to perform a query where the full result set could be millions of
>>> items. That set needs to be sorted by the lastModified attribute on the
>>> node, and I only want to see a small number of items eg 100 after a
>>> particular date.
>>>
>>> If I do this, will there be scalability issues, or is the sorting of
>>> a date
>>> field optimized in the query engine ?
>>>
>>> Thanks
>>> Ian
>>>
> 
> 


Re: Queries with large result sets and sorting.

Posted by Ian Boston <ie...@tfd.co.uk>.
yes it was that thread that triggered the concern.

"
Obviously, a stupid example, but, unfortunately, not really to much to  
do about it except not sorting on large property fields...if you need  
to sort on the title of 200.000 docs...you better sort on the  
short_title (which I would prefer to be an index only property defined  
in indexing_configuration, but I think people have different opinions  
on this) "

Does that mean that sorting huge numbers of documents on *small*  
fields has a similar problems. Unless there is a pre-query to estimate  
the number of hits (using the Jackrabbit getTotalCount() IIRC  
methods), then its going to be impossible to avoid submitting queries  
that could result in a sort on huge numbers of results.

I had thought in the special case of sort by date the solution would  
be to split the searches up into chunks to avoid a massive sort.

eg search for items in the last hour,
if there are not enough items the next hour with some extending range.

or perform a sequence of unsorted pre queries on a date range to  
determine the range required to return a set large enough to sort.

On the basis that lastModified is a long, am I worrying unnecessarily ?

Ian


On 15 Jun 2009, at 17:59, Marc Speck wrote:

> I've just read the thread http://markmail.org/message/ 
> wnn2bfwzwx2hn6v4 .
> Maybe it helps,
> Marc
>
>
>
> On Mon, Jun 15, 2009 at 6:52 PM, Ian Boston <ie...@tfd.co.uk> wrote:
>
>> Hi,
>>
>> I want to perform a query where the full result set could be  
>> millions of
>> items. That set needs to be sorted by the lastModified attribute on  
>> the
>> node, and I only want to see a small number of items eg 100 after a
>> particular date.
>>
>> If I do this, will there be scalability issues, or is the sorting  
>> of a date
>> field optimized in the query engine ?
>>
>> Thanks
>> Ian
>>


Re: Queries with large result sets and sorting.

Posted by Marc Speck <ma...@gmail.com>.
I've just read the thread http://markmail.org/message/wnn2bfwzwx2hn6v4 .
Maybe it helps,
Marc



On Mon, Jun 15, 2009 at 6:52 PM, Ian Boston <ie...@tfd.co.uk> wrote:

> Hi,
>
> I want to perform a query where the full result set could be millions of
> items. That set needs to be sorted by the lastModified attribute on the
> node, and I only want to see a small number of items eg 100 after a
> particular date.
>
> If I do this, will there be scalability issues, or is the sorting of a date
> field optimized in the query engine ?
>
> Thanks
> Ian
>