You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Marios Skounakis <ms...@exis.com.gr> on 2006/03/29 12:09:53 UTC

Paging results

  Hi all,

I have the following issue (I am giving a quantified example so we can 
talk more concretely)

My documents have an docId field, stored as a keyword field.

I am executing searches that return between 2000 and 10000 documents and 
sorting the results by relevance (or sometimes alphabetically).

In every query, I need to discard some of the results based on their 
docId. I have a list of the docIds that need to be discarded in an 
array. The size of the list is usually about 100.

I generally perform paging, so I don't need to display all the results, 
but only those of the current page (e.g. results 100 to 120)

Currently, after getting the Hits object from the Searcher, I loop over 
the documents, retrieve their docId, and see if it is in the discard 
list. If not, I put the document in a validResult collection. When I 
have read enough valid results to be able to show the current page, I 
stop. E.g., in the above example, I would read until I had 120 valid 
results.

The problem with this implementation is that I have to read the docId 
for the results that precede the current page, in order to determine if 
they are in the invalid list. So, when showing the pages near the last 
page, I have to read the entire result list. I don't like this because 
this means that the computation required to display a page is not 
constant but depends on the page's position. Anyway, what do the experts 
think? Is this implementation prohibitively expensive?

(As a sidenote, when calling hits.doc(i), does Lucene retrieve the whole 
document, or just a pointer to it, and retrieves the data when doing 
hits.doc(i).getField...?)

An alternative would be to extend the query to exclude the ids in the 
discard list. How would adding 100 exclusion clauses to the query impact 
the query's performance? Are there any studies on search speed in 
relation to the number of query clauses?

Is there another way to do handle this issue?

Many thanks in advance,
Marios Skounakis

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Paging results

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Mar 29, 2006, at 5:09 AM, Marios Skounakis wrote:
> I am executing searches that return between 2000 and 10000  
> documents and sorting the results by relevance (or sometimes  
> alphabetically).
>
> In every query, I need to discard some of the results based on  
> their docId. I have a list of the docIds that need to be discarded  
> in an array. The size of the list is usually about 100.
>
> I generally perform paging, so I don't need to display all the  
> results, but only those of the current page (e.g. results 100 to 120)
>
> Currently, after getting the Hits object from the Searcher, I loop  
> over the documents, retrieve their docId, and see if it is in the  
> discard list. If not, I put the document in a validResult  
> collection. When I have read enough valid results to be able to  
> show the current page, I stop. E.g., in the above example, I would  
> read until I had 120 valid results.
>
> The problem with this implementation is that I have to read the  
> docId for the results that precede the current page, in order to  
> determine if they are in the invalid list. So, when showing the  
> pages near the last page, I have to read the entire result list. I  
> don't like this because this means that the computation required to  
> display a page is not constant but depends on the page's position.  
> Anyway, what do the experts think? Is this implementation  
> prohibitively expensive?

Yes, that implementation is going to be prohibitively expensive.  The  
recommended way to employ a search constraint as you described is to  
use a Filter.  You could write a custom Filter that would use your  
discarded docId array, and turn off those documents from the BitSet,  
and turn on all the rest.  Then you will use the search(Query,  
Filter) method signature for searching.

> (As a sidenote, when calling hits.doc(i), does Lucene retrieve the  
> whole document, or just a pointer to it, and retrieves the data  
> when doing hits.doc(i).getField...?)

It retrieves the whole document currently.

> An alternative would be to extend the query to exclude the ids in  
> the discard list. How would adding 100 exclusion clauses to the  
> query impact the query's performance? Are there any studies on  
> search speed in relation to the number of query clauses?

Adding them to the query would certainly work but a Filter is a  
cleaner and faster way for your scenario.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Paging results

Posted by Volodymyr Bychkoviak <vb...@i-hypergrid.com>.

Hi

Marios Skounakis wrote:
>  Hi all,
>
> I have the following issue (I am giving a quantified example so we can 
> talk more concretely)
>
> My documents have an docId field, stored as a keyword field.
>
> I am executing searches that return between 2000 and 10000 documents 
> and sorting the results by relevance (or sometimes alphabetically).
>
> In every query, I need to discard some of the results based on their 
> docId. I have a list of the docIds that need to be discarded in an 
> array. The size of the list is usually about 100.
>
> I generally perform paging, so I don't need to display all the 
> results, but only those of the current page (e.g. results 100 to 120)
>
> Currently, after getting the Hits object from the Searcher, I loop 
> over the documents, retrieve their docId, and see if it is in the 
> discard list. If not, I put the document in a validResult collection. 
> When I have read enough valid results to be able to show the current 
> page, I stop. E.g., in the above example, I would read until I had 120 
> valid results.
>
> The problem with this implementation is that I have to read the docId 
> for the results that precede the current page, in order to determine 
> if they are in the invalid list. So, when showing the pages near the 
> last page, I have to read the entire result list. I don't like this 
> because this means that the computation required to display a page is 
> not constant but depends on the page's position. Anyway, what do the 
> experts think? Is this implementation prohibitively expensive?
>
> (As a sidenote, when calling hits.doc(i), does Lucene retrieve the 
> whole document, or just a pointer to it, and retrieves the data when 
> doing hits.doc(i).getField...?)
Yes, Lucene does retrieve the whole document on this call.
>
> An alternative would be to extend the query to exclude the ids in the 
> discard list. How would adding 100 exclusion clauses to the query 
> impact the query's performance? Are there any studies on search speed 
> in relation to the number of query clauses?
It's preferred way of doing such things. Adding 100 exclusion clauses 
into query will much less expensive than retrieving documents and 
checking their field.
>
> Is there another way to do handle this issue?
>
> Many thanks in advance,
> Marios Skounakis
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 
regards,
Volodymyr Bychkoviak