You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Kirk Roberts <ki...@languagecomputer.com> on 2007/04/17 20:48:43 UTC

MultiSearcher vs MultiReader

I've been on this list long enough to have a vast repository of 
information about using a MultiSearcher versus an IndexSearcher that 
works on a MultiReader.  However, after looking through several hundred 
list postings, I could not find what I was looking for.  So if there is 
a posting or thread (or website, for that matter) that answers all my 
questions, please direct me there and, if so, I apologize in advance.

 From what I have read, people seem to suggest using an IndexSearcher 
initialized with a MultiReader over the use of a MultiSearcher.  This 
begs 2 questions:
1. Under what conditions is a MultiSearcher necessary/optimal?
2. Why doesn't the MultiReader implement the rather nice methods that 
the MultiSearcher has (I'm thinking specifically of subSearcher(int) and 
subDoc(int))?

Right now my application uses the MultiSearcher because I need to know 
the original (single) index the document came from as well as the Lucene 
document ID within that index (so I don't have to hold on to the entire
Lucene Document).  I can calculate both if I get the size of all the 
individual readers, and assume that the ID a MultiReader returns equals 
the original IndexReader's given ID plus the size of every IndexReader 
that precedes it (eg, if I have 3 indexes each of size 10, the Document 
that the MultiReader gives an ID of 23 to will be the 3rd Document in 
the 3rd IndexReader).  This might be safe (and indeed might be exactly 
what the MultiSearcher does), but I haven't dug into the Lucene code 
much to find out.

Currently I'm leaning to switching to a MultiReader, as performance is 
more important than having to write some extra code.  I'm just a little 
confused as to 1) why the MultiSearcher exists (as I can't seem to find 
decent documentation) and 2) how safe my above algorithm is (or if I've 
just completely missed existing functionality that does this).

Links to any detailed documentation would be greatly appreciated,

Kirk

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: MultiSearcher vs MultiReader

Posted by Grant Ingersoll <gs...@apache.org>.

You are correct, sir!  I failed Lucene History 101 :-) And I failed  
my fundamental rule for discussing a search library, which is to do a  
search first to see if the answer already exists!

At any rate, here's some history on it: http://www.gossamer- 
threads.com/lists/lucene/java-dev/22104

Nothing like a trip down memory lane.  Sorry for the diversion.


On Apr 18, 2007, at 6:41 PM, Yonik Seeley wrote:

> On 4/18/07, Grant Ingersoll <gs...@apache.org> wrote:
>> At any rate, MultiSearcher has been around a lot longer (2001 versus
>> 2004, or at least that is what the changelog seems to indicate)
>
> That doesn't sound right... MultiReader is more fundamental as it's  
> needed
> to read any multi-segment index.
>
> Looking back at the history of IndexReader, it looks like there was a
> class called SegmentsReader (note the plural).  I bet it was moved
> to MultiReader and that is why the history is truncated.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: MultiSearcher vs MultiReader

Posted by Yonik Seeley <yo...@apache.org>.

On 4/18/07, Grant Ingersoll <gs...@apache.org> wrote:
> At any rate, MultiSearcher has been around a lot longer (2001 versus
> 2004, or at least that is what the changelog seems to indicate)

That doesn't sound right... MultiReader is more fundamental as it's needed
to read any multi-segment index.

Looking back at the history of IndexReader, it looks like there was a
class called SegmentsReader (note the plural).  I bet it was moved
to MultiReader and that is why the history is truncated.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: MultiSearcher vs MultiReader

Posted by Grant Ingersoll <gs...@apache.org>.

On Apr 20, 2007, at 3:08 PM, Kirk Roberts wrote:

> Grant Ingersoll wrote:
>> I will try to take a crack at these, but not sure I know exactly  
>> what you are looking for, so maybe others can chime in too.
>> At any rate, MultiSearcher has been around a lot longer (2001  
>> versus 2004, or at least that is what the changelog seems to  
>> indicate) and it works over Searchables, including  
>> RemoteSearchable.  So you could use it to combine results from  
>> remote searches as well, MultiReader can only work over  
>> IndexReaders and I am not aware of any way that it can do remote  
>> index reading, so there are different viable use cases for the two.
>
>
> Really I just want to know the fastest mechanism for searching.   
> Since I  don't use a RemoteSearcher, it sounds like using an  
> IndexSearcher on a MultiReader is the way to go.
>
>>> 2. Why doesn't the MultiReader implement the rather nice methods  
>>> that the MultiSearcher has (I'm thinking specifically of  
>>> subSearcher(int) and subDoc(int))?
>> I suppose subDoc might make sense, but subSearcher does not for a  
>> Reader.  Perhaps the private readerIndex() method on MultiReader  
>> is something you are interested in?  Is that getting at what you  
>> want?  Maybe you can submit a patch that makes readerIndex public  
>> if that is what you are interested in?
>
> Obviously the methods would have to be appropriately named :).  It  
> sounds like some development work will have to be done on this  
> then.  I have no problem doing it myself and submitting a patch (I  
> can pick up this discussion on the developer list when I have  
> time), but for now is it safe to assume that if I have the number  
> of documents per IndexReader and the order of the readers that I  
> can calculate the "real" IndexReader and the "real" docid for that  
> sub-IndexReader?  I realize I might not be very clear, so lets see  
> if I can re-state my example more clearly in psedo-code (apologize  
> in advance):

Doesn't MultiReader do this already in the readerIndex() method?  It  
has to figure out which IndexReader the document is in in order to  
retrieve it in the first place.  This is done in readerIndex(int).
Unless I still am not understanding what you mean :-).  Your code  
below looks a lot like what is in readerIndex() though, right?

>
> IndexReader r1 (size = 100 documents)
> IndexReader r2 (size = 50 documents)
> IndexReader r3 (size = 75 documents)
>
> IndexReader[] readers = new IndexReader[] { r1, r2, r3 }
> MultiReader mr = new MultiReader(readers)
>
> // get docid in MultiReader
> int docid = magicFindDocumentFunc(mr);
>
> for (IndexReader r : readers) {
>   if (docid > r.numDocs()) {
>     docid -= r.numDocs()
>   }
>   else {
>     // r is the IndexReader that the desired Document
>     // docid's current value is lucene id of that Document within r
>   }
> }
>
> I know I can get the Document straight from the MultiReader but in  
> my case I need to know which exact IndexReader object the Document  
> is really coming from.
>
> Thanks in advance for any help,
> Kirk
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: MultiSearcher vs MultiReader

Posted by Kirk Roberts <ki...@languagecomputer.com>.

Grant Ingersoll wrote:
> I will try to take a crack at these, but not sure I know exactly what 
> you are looking for, so maybe others can chime in too.
> 
> At any rate, MultiSearcher has been around a lot longer (2001 versus 
> 2004, or at least that is what the changelog seems to indicate) and it 
> works over Searchables, including RemoteSearchable.  So you could use it 
> to combine results from remote searches as well, MultiReader can only 
> work over IndexReaders and I am not aware of any way that it can do 
> remote index reading, so there are different viable use cases for the two.

Really I just want to know the fastest mechanism for searching.  Since I 
  don't use a RemoteSearcher, it sounds like using an IndexSearcher on a 
MultiReader is the way to go.

>> 2. Why doesn't the MultiReader implement the rather nice methods that 
>> the MultiSearcher has (I'm thinking specifically of subSearcher(int) 
>> and subDoc(int))?
> 
> I suppose subDoc might make sense, but subSearcher does not for a 
> Reader.  Perhaps the private readerIndex() method on MultiReader is 
> something you are interested in?  Is that getting at what you want?  
> Maybe you can submit a patch that makes readerIndex public if that is 
> what you are interested in?

Obviously the methods would have to be appropriately named :).  It 
sounds like some development work will have to be done on this then.  I 
have no problem doing it myself and submitting a patch (I can pick up 
this discussion on the developer list when I have time), but for now is 
it safe to assume that if I have the number of documents per IndexReader 
and the order of the readers that I can calculate the "real" IndexReader 
and the "real" docid for that sub-IndexReader?  I realize I might not be 
very clear, so lets see if I can re-state my example more clearly in 
psedo-code (apologize in advance):

IndexReader r1 (size = 100 documents)
IndexReader r2 (size = 50 documents)
IndexReader r3 (size = 75 documents)

IndexReader[] readers = new IndexReader[] { r1, r2, r3 }
MultiReader mr = new MultiReader(readers)

// get docid in MultiReader
int docid = magicFindDocumentFunc(mr);

for (IndexReader r : readers) {
   if (docid > r.numDocs()) {
     docid -= r.numDocs()
   }
   else {
     // r is the IndexReader that the desired Document
     // docid's current value is lucene id of that Document within r
   }
}

I know I can get the Document straight from the MultiReader but in my 
case I need to know which exact IndexReader object the Document is 
really coming from.

Thanks in advance for any help,
Kirk

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: MultiSearcher vs MultiReader

Posted by Grant Ingersoll <gs...@apache.org>.

I will try to take a crack at these, but not sure I know exactly what  
you are looking for, so maybe others can chime in too.

At any rate, MultiSearcher has been around a lot longer (2001 versus  
2004, or at least that is what the changelog seems to indicate) and  
it works over Searchables, including RemoteSearchable.  So you could  
use it to combine results from remote searches as well, MultiReader  
can only work over IndexReaders and I am not aware of any way that it  
can do remote index reading, so there are different viable use cases  
for the two.

On Apr 17, 2007, at 2:48 PM, Kirk Roberts wrote:

> I've been on this list long enough to have a vast repository of  
> information about using a MultiSearcher versus an IndexSearcher  
> that works on a MultiReader.  However, after looking through  
> several hundred list postings, I could not find what I was looking  
> for.  So if there is a posting or thread (or website, for that  
> matter) that answers all my questions, please direct me there and,  
> if so, I apologize in advance.
>
> From what I have read, people seem to suggest using an  
> IndexSearcher initialized with a MultiReader over the use of a  
> MultiSearcher.  This begs 2 questions:
> 1. Under what conditions is a MultiSearcher necessary/optimal?
> 2. Why doesn't the MultiReader implement the rather nice methods  
> that the MultiSearcher has (I'm thinking specifically of subSearcher 
> (int) and subDoc(int))?
>

I suppose subDoc might make sense, but subSearcher does not for a  
Reader.  Perhaps the private readerIndex() method on MultiReader is  
something you are interested in?  Is that getting at what you want?   
Maybe you can submit a patch that makes readerIndex public if that is  
what you are interested in?

Hope this helps,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org