You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Aleksey <bi...@gmail.com> on 2013/05/07 04:15:19 UTC

TermRangeQuery performance oddness

Hi guys,

If I run 2 term range queries:

new TermRangeQuery("title", new BytesRef("A"), null, true, true);
and
new TermRangeQuery("title", new BytesRef("Z"), null, true, true);

The one that starts with "Z" is several times faster (I make 1000
queries in a loop to measure). I understand that the first one has
much larger hit number, but if the query is bounded to 50 results, why
does that matter?
At first I thought that it grabs all hits and sorts them, but then it
doesn't seem to make any difference whether or not I pass sort by
"title" parameter to the searcher. Results are either sorted or kind
of random, but speed is the same.
Why is that?

Thank you in advance,

Aleksey

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: TermRangeQuery performance oddness

Posted by Aleksey <bi...@gmail.com>.

Thank you for the detailed answer. I'll look into the FieldCacheRangeFilter.
Is there another way of getting a page of results starting with some
term that does not have similar issue? Basically I want to implement
pagination of docs sorted by title, next page starting from last doc
of the previous, and while, typically, getting the first page is
fastest, in my case it ends up being slowest. Is it possible to
disable counting hits and just return top 50 results that qualify the
query?

Also, if the hit count is a direct influence on the query speed, how
come some other queries that span a larger or similar set are faster?
Say, a TermQuery for a term that happens to be the same for majority
of docs and spans more docs than the above termrange.

Aleksey




On Tue, May 7, 2013 at 12:13 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
> Hi,
>
> The problem is by design: Lucene is an inverted index, so lookups can only be done by single terms and find the documents related to every single term. To execute a range, the query first have to position the terms enum on the first term and then iterate over all *terms* in the index (not documents) until the last term is reached. If the number of terms in the field is large (because you have many distinct values), this takes some time. For every term in the enumeration that matches the range, Lucene has to look up all matching documents in the posting list and report them as hits (using a bitset). The latter (looking up the posting lists involves lots of work), so ranges with thousands of terms will get slow.
>
> So the time depends: How many terms are in your term dictionary between the lower bound and the higher bound of your range, not really the size of the index (although this is quite often directly related).
>
> If you want faster range queries, use maybe NumericRangeQuery, because this has some optimizations on the cost of a large index size. But if you are stuck with text, you may also review FieldCacheRangeFilter (which only works for untokenized fields, but I assume from your example "title" is not tokenized).
>
> The order of results of a range query is in "index order", because there is no TF-IDF ranking involved (all hits have the same score of 1). Index order means the order in which they were indexed.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: Aleksey [mailto:bittercold@gmail.com]
>> Sent: Tuesday, May 07, 2013 4:15 AM
>> To: java-user@lucene.apache.org
>> Subject: TermRangeQuery performance oddness
>>
>> Hi guys,
>>
>> If I run 2 term range queries:
>>
>> new TermRangeQuery("title", new BytesRef("A"), null, true, true); and new
>> TermRangeQuery("title", new BytesRef("Z"), null, true, true);
>>
>> The one that starts with "Z" is several times faster (I make 1000 queries in a
>> loop to measure). I understand that the first one has much larger hit number,
>> but if the query is bounded to 50 results, why does that matter?
>> At first I thought that it grabs all hits and sorts them, but then it doesn't seem
>> to make any difference whether or not I pass sort by "title" parameter to the
>> searcher. Results are either sorted or kind of random, but speed is the same.
>> Why is that?
>>
>> Thank you in advance,
>>
>> Aleksey
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: TermRangeQuery performance oddness

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

The problem is by design: Lucene is an inverted index, so lookups can only be done by single terms and find the documents related to every single term. To execute a range, the query first have to position the terms enum on the first term and then iterate over all *terms* in the index (not documents) until the last term is reached. If the number of terms in the field is large (because you have many distinct values), this takes some time. For every term in the enumeration that matches the range, Lucene has to look up all matching documents in the posting list and report them as hits (using a bitset). The latter (looking up the posting lists involves lots of work), so ranges with thousands of terms will get slow.

So the time depends: How many terms are in your term dictionary between the lower bound and the higher bound of your range, not really the size of the index (although this is quite often directly related).

If you want faster range queries, use maybe NumericRangeQuery, because this has some optimizations on the cost of a large index size. But if you are stuck with text, you may also review FieldCacheRangeFilter (which only works for untokenized fields, but I assume from your example "title" is not tokenized).

The order of results of a range query is in "index order", because there is no TF-IDF ranking involved (all hits have the same score of 1). Index order means the order in which they were indexed.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Aleksey [mailto:bittercold@gmail.com]
> Sent: Tuesday, May 07, 2013 4:15 AM
> To: java-user@lucene.apache.org
> Subject: TermRangeQuery performance oddness
> 
> Hi guys,
> 
> If I run 2 term range queries:
> 
> new TermRangeQuery("title", new BytesRef("A"), null, true, true); and new
> TermRangeQuery("title", new BytesRef("Z"), null, true, true);
> 
> The one that starts with "Z" is several times faster (I make 1000 queries in a
> loop to measure). I understand that the first one has much larger hit number,
> but if the query is bounded to 50 results, why does that matter?
> At first I thought that it grabs all hits and sorts them, but then it doesn't seem
> to make any difference whether or not I pass sort by "title" parameter to the
> searcher. Results are either sorted or kind of random, but speed is the same.
> Why is that?
> 
> Thank you in advance,
> 
> Aleksey
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org