You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by mittals <so...@morganstanley.com> on 2009/02/02 12:54:02 UTC

How to extract Document object after the search?

As per Lucene documentation - 
"For good search performance, implementations of this method should not call
Searcher.doc(int) or IndexReader.document(int) on every document number
encountered. Doing so can slow searches by an order of magnitude or more."

My question is - what's the other way to get the Document object to avoid
performance bottleneck?

-- 
View this message in context: http://www.nabble.com/How-to-extract-Document-object-after-the-search--tp21788361p21788361.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to extract Document object after the search?

Posted by Ganesh <em...@yahoo.co.in>.

Searcher.doc(int) or IndexReader.document(int) will give you the document 
object and to my knowledge this is the only way available, however it is not 
advisable to query all documents (MatchAllDocsQuery) and load all document 
objects. While using Searcher.doc(int) or IndexReader.document(int), load 
only the required fields to display the results Searcher.doc(int, 
FieldSelector).

Regards
Ganesh


----- Original Message ----- 
From: "mittals" <so...@morganstanley.com>
To: <ja...@lucene.apache.org>
Sent: Monday, February 02, 2009 5:24 PM
Subject: How to extract Document object after the search?


>
> As per Lucene documentation -
> "For good search performance, implementations of this method should not 
> call
> Searcher.doc(int) or IndexReader.document(int) on every document number
> encountered. Doing so can slow searches by an order of magnitude or more."
>
> My question is - what's the other way to get the Document object to avoid
> performance bottleneck?
>
> -- 
> View this message in context: 
> http://www.nabble.com/How-to-extract-Document-object-after-the-search--tp21788361p21788361.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

Send instant messages to your online friends http://in.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to extract Document object after the search?

Posted by Erick Erickson <er...@gmail.com>.

Here's a writeup I did a couple of years ago that might help...

http://wiki.apache.org/lucene-java/FieldSelectorPerformance?highlight=(fieldselector)

Best
Erick

On Tue, Feb 3, 2009 at 5:06 AM, Ian Lea <ia...@gmail.com> wrote:

> > I have not seen much time difference between when I load the single field
> &
> > all the fields of a document.
>
> That's fine - sometimes it helps, sometimes it doesn't.  Depends on
> the structure of your documents, maybe your hardware, maybe more.  And
> sometimes a small difference, over many documents, can be worth
> having.
>
> > After search, lucene cache the documents into the memory. Is there any
> way
> > to configure the no. of documents to be cached into the memory?
>
> Umm.  No, I don't believe that lucene does explicit document caching.
> Your OS may well cache the data files which can make a significant
> difference.  See also all the recommendations elsewhere about sharing
> and warming searchers.
>
> > what could be the benefit in using FieldSelectorResult.LOAD &
> > FieldSelectorResult.LAZY_LOAD?
>
> If you have a document with, say, 2 small fields and 100 large fields
> and in some particular circumstance you only want the 2 small ones,
> using a FieldSelector like SetBasedFieldSelector, as Uwe suggested,
> can help by telling lucene not to load the 100 large fields unless you
> explicitly ask for them.  Which you won't in this scenario.
>
>
> If you google for something like "lucene lazy loading" you'll find
> lots more info.
>
>
> --
> Ian.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to extract Document object after the search?

Posted by Ian Lea <ia...@gmail.com>.

> I have not seen much time difference between when I load the single field &
> all the fields of a document.

That's fine - sometimes it helps, sometimes it doesn't.  Depends on
the structure of your documents, maybe your hardware, maybe more.  And
sometimes a small difference, over many documents, can be worth
having.

> After search, lucene cache the documents into the memory. Is there any way
> to configure the no. of documents to be cached into the memory?

Umm.  No, I don't believe that lucene does explicit document caching.
Your OS may well cache the data files which can make a significant
difference.  See also all the recommendations elsewhere about sharing
and warming searchers.

> what could be the benefit in using FieldSelectorResult.LOAD &
> FieldSelectorResult.LAZY_LOAD?

If you have a document with, say, 2 small fields and 100 large fields
and in some particular circumstance you only want the 2 small ones,
using a FieldSelector like SetBasedFieldSelector, as Uwe suggested,
can help by telling lucene not to load the 100 large fields unless you
explicitly ask for them.  Which you won't in this scenario.


If you google for something like "lucene lazy loading" you'll find
lots more info.


--
Ian.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: How to extract Document object after the search?

Posted by mittals <so...@morganstanley.com>.

Hi,

I have not seen much time difference between when I load the single field &
all the fields of a document.

After search, lucene cache the documents into the memory. Is there any way
to configure the no. of documents to be cached into the memory?

what could be the benefit in using FieldSelectorResult.LOAD &
FieldSelectorResult.LAZY_LOAD?

Regards,
Sourabh


Uwe Schindler wrote:
> 
> Hi,
> 
> you should generally not download all fields for all documents in the
> HitCollector Loop, if you really need it (because you want to do some
> analysis on the whole result set after search), you should do the
> following:
> 
> - only retrieve those document fields, you really need (using a
> FieldSelector like SetBasedFieldSelector).
> - Do some buffering in the HitCollector: Allocate an array of int for the
> collected doc ids with a size of say 16,000. For each collect() call, add
> the document id to the array. When the array is full and at the end of
> collecting, call a flush method: This method sorts the array by ID
> (because
> if the Ids are in increasing order less seeking is needed) and then calls
> document(id) for each entry in a bulk. This is faster. In older versions
> of
> Lucene array sorting may not needed, but you really should do it (the
> newer
> search API may not return documents in doc order).
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> 
>> -----Original Message-----
>> From: mittals [mailto:sourabh-931.mittal@morganstanley.com]
>> Sent: Monday, February 02, 2009 12:54 PM
>> To: java-user@lucene.apache.org
>> Subject: How to extract Document object after the search?
>> 
>> 
>> As per Lucene documentation -
>> "For good search performance, implementations of this method should not
>> call
>> Searcher.doc(int) or IndexReader.document(int) on every document number
>> encountered. Doing so can slow searches by an order of magnitude or
>> more."
>> 
>> My question is - what's the other way to get the Document object to avoid
>> performance bottleneck?
>> 
>> --
>> View this message in context: http://www.nabble.com/How-to-extract-
>> Document-object-after-the-search--tp21788361p21788361.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/How-to-extract-Document-object-after-the-search--tp21788361p21804802.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: How to extract Document object after the search?

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

you should generally not download all fields for all documents in the
HitCollector Loop, if you really need it (because you want to do some
analysis on the whole result set after search), you should do the following:

- only retrieve those document fields, you really need (using a
FieldSelector like SetBasedFieldSelector).
- Do some buffering in the HitCollector: Allocate an array of int for the
collected doc ids with a size of say 16,000. For each collect() call, add
the document id to the array. When the array is full and at the end of
collecting, call a flush method: This method sorts the array by ID (because
if the Ids are in increasing order less seeking is needed) and then calls
document(id) for each entry in a bulk. This is faster. In older versions of
Lucene array sorting may not needed, but you really should do it (the newer
search API may not return documents in doc order).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: mittals [mailto:sourabh-931.mittal@morganstanley.com]
> Sent: Monday, February 02, 2009 12:54 PM
> To: java-user@lucene.apache.org
> Subject: How to extract Document object after the search?
> 
> 
> As per Lucene documentation -
> "For good search performance, implementations of this method should not
> call
> Searcher.doc(int) or IndexReader.document(int) on every document number
> encountered. Doing so can slow searches by an order of magnitude or more."
> 
> My question is - what's the other way to get the Document object to avoid
> performance bottleneck?
> 
> --
> View this message in context: http://www.nabble.com/How-to-extract-
> Document-object-after-the-search--tp21788361p21788361.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to extract Document object after the search?

Posted by Ian Lea <ia...@gmail.com>.

Hi

That quote is from the javadoc for
HitCollector/TopDocCollector.collect().  You missed out the bit saying
"This is called in an inner search loop".

If, as your subject implies, you want to get at the Document object
AFTER the search, those methods are fine.  Just don't use them for any
more documents than you need, and not inside the collect() method.

If you really need to get at document data inside the inner search
loop I think you'll have to accept the performance hit or look into
advanced stuff like payloads.

--
Ian.

On Mon, Feb 2, 2009 at 11:54 AM, mittals
<so...@morganstanley.com> wrote:
>
> As per Lucene documentation -
> "For good search performance, implementations of this method should not call
> Searcher.doc(int) or IndexReader.document(int) on every document number
> encountered. Doing so can slow searches by an order of magnitude or more."
>
> My question is - what's the other way to get the Document object to avoid
> performance bottleneck?
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org