You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Michael Sokolov <ms...@safaribooksonline.com> on 2013/03/01 13:41:18 UTC

Re: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

On 2/28/2013 5:05 PM, Uwe Schindler wrote:
> ...  Collector instead of HitCollector (like your ancient Lucene from 2.4), you have to respect the new semantics that are *different* to old HitCollector. Collector works with low-level atomic readers (also in Lucene 3.x), the calls to the "collect(int)" method are *not* using global document IDs, so using a IndexReader from outside does not work and will never work - PERIOD: The document IDs are only *relative* to the atomic reader that was passed to the collector by setNextReader() before a sequence of collect() calls. To make global docIds out of it, you may use readerContext.docBase, but this is slower than using the low-level atomic reader.
>
Uwe, thanks for this lucid explanation!  I wonder if you wouldn't mind 
elaborating a bit on the slowdown you refer to from using docBase to 
absolutize docIDs.  I have a use case where I need to pass control to my 
caller, allowing them to *pull* results - so I don't know how many I 
will need.  In the case where documents are returned in(docID) order, 
the code is actually pretty straightforward: I iterate over the atomic 
readers and pull results from each in turn.  Are you saying that is 
slower because it prevents multi-threading, or is there some other reason?

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

On 03/01/2013 07:56 AM, Uwe Schindler wrote:
> The slowdown happens not on making the doc ids absolute (it is just an addition), the slowdown appears when you retrieve the stored fields on the top-level reader (because the composite top-level reader has to do a binary search in the reader tree to find the correct reader). This answer was related to the code pasted by the user asking this question.
Thanks - that's added a new nugget to my understanding :)

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

Posted by Uwe Schindler <uw...@thetaphi.de>.

The slowdown happens not on making the doc ids absolute (it is just an addition), the slowdown appears when you retrieve the stored fields on the top-level reader (because the composite top-level reader has to do a binary search in the reader tree to find the correct reader). This answer was related to the code pasted by the user asking this question.

If you need top-level doc ids because you present the global doc-ids to the user (e.g. this is how TopScoreDocCollector works), you can of course add the doc base. But inside the collector it makes absolutely no sense to transform the local and relative doc ids to absolute ones just to call a method on a top-level reader that needs to do the opposite with a binary search. In that case, use the AtomicReader directly. If you also access FieldCache, working with absolute doc-ids also brings in waste of megabytes of memory and FieldCache insanity.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Michael Sokolov [mailto:msokolov@safaribooksonline.com]
> Sent: Friday, March 01, 2013 1:41 PM
> To: java-user@lucene.apache.org
> Cc: Uwe Schindler
> Subject: Re: TopDocCollector vs TopScoreDocCollector (semantics changed in
> 4.0, not backward comptabile)
> 
> On 2/28/2013 5:05 PM, Uwe Schindler wrote:
> > ...  Collector instead of HitCollector (like your ancient Lucene from 2.4), you
> have to respect the new semantics that are *different* to old HitCollector.
> Collector works with low-level atomic readers (also in Lucene 3.x), the calls to
> the "collect(int)" method are *not* using global document IDs, so using a
> IndexReader from outside does not work and will never work - PERIOD: The
> document IDs are only *relative* to the atomic reader that was passed to
> the collector by setNextReader() before a sequence of collect() calls. To
> make global docIds out of it, you may use readerContext.docBase, but this is
> slower than using the low-level atomic reader.
> >
> Uwe, thanks for this lucid explanation!  I wonder if you wouldn't mind
> elaborating a bit on the slowdown you refer to from using docBase to
> absolutize docIDs.  I have a use case where I need to pass control to my
> caller, allowing them to *pull* results - so I don't know how many I will need.
> In the case where documents are returned in(docID) order, the code is
> actually pretty straightforward: I iterate over the atomic readers and pull
> results from each in turn.  Are you saying that is slower because it prevents
> multi-threading, or is there some other reason?
> 
> -Mike
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org