You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Kirk Roberts <ki...@languagecomputer.com> on 2007/04/17 20:48:43 UTC
MultiSearcher vs MultiReader
I've been on this list long enough to have a vast repository of
information about using a MultiSearcher versus an IndexSearcher that
works on a MultiReader. However, after looking through several hundred
list postings, I could not find what I was looking for. So if there is
a posting or thread (or website, for that matter) that answers all my
questions, please direct me there and, if so, I apologize in advance.
From what I have read, people seem to suggest using an IndexSearcher
initialized with a MultiReader over the use of a MultiSearcher. This
begs 2 questions:
1. Under what conditions is a MultiSearcher necessary/optimal?
2. Why doesn't the MultiReader implement the rather nice methods that
the MultiSearcher has (I'm thinking specifically of subSearcher(int) and
subDoc(int))?
Right now my application uses the MultiSearcher because I need to know
the original (single) index the document came from as well as the Lucene
document ID within that index (so I don't have to hold on to the entire
Lucene Document). I can calculate both if I get the size of all the
individual readers, and assume that the ID a MultiReader returns equals
the original IndexReader's given ID plus the size of every IndexReader
that precedes it (eg, if I have 3 indexes each of size 10, the Document
that the MultiReader gives an ID of 23 to will be the 3rd Document in
the 3rd IndexReader). This might be safe (and indeed might be exactly
what the MultiSearcher does), but I haven't dug into the Lucene code
much to find out.
Currently I'm leaning to switching to a MultiReader, as performance is
more important than having to write some extra code. I'm just a little
confused as to 1) why the MultiSearcher exists (as I can't seem to find
decent documentation) and 2) how safe my above algorithm is (or if I've
just completely missed existing functionality that does this).
Links to any detailed documentation would be greatly appreciated,
Kirk
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: MultiSearcher vs MultiReader
Posted by Grant Ingersoll <gs...@apache.org>.
You are correct, sir! I failed Lucene History 101 :-) And I failed
my fundamental rule for discussing a search library, which is to do a
search first to see if the answer already exists!
At any rate, here's some history on it: http://www.gossamer-
threads.com/lists/lucene/java-dev/22104
Nothing like a trip down memory lane. Sorry for the diversion.
On Apr 18, 2007, at 6:41 PM, Yonik Seeley wrote:
> On 4/18/07, Grant Ingersoll <gs...@apache.org> wrote:
>> At any rate, MultiSearcher has been around a lot longer (2001 versus
>> 2004, or at least that is what the changelog seems to indicate)
>
> That doesn't sound right... MultiReader is more fundamental as it's
> needed
> to read any multi-segment index.
>
> Looking back at the history of IndexReader, it looks like there was a
> class called SegmentsReader (note the plural). I bet it was moved
> to MultiReader and that is why the history is truncated.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org
Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: MultiSearcher vs MultiReader
Posted by Yonik Seeley <yo...@apache.org>.
On 4/18/07, Grant Ingersoll <gs...@apache.org> wrote:
> At any rate, MultiSearcher has been around a lot longer (2001 versus
> 2004, or at least that is what the changelog seems to indicate)
That doesn't sound right... MultiReader is more fundamental as it's needed
to read any multi-segment index.
Looking back at the history of IndexReader, it looks like there was a
class called SegmentsReader (note the plural). I bet it was moved
to MultiReader and that is why the history is truncated.
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: MultiSearcher vs MultiReader
Posted by Grant Ingersoll <gs...@apache.org>.
On Apr 20, 2007, at 3:08 PM, Kirk Roberts wrote:
> Grant Ingersoll wrote:
>> I will try to take a crack at these, but not sure I know exactly
>> what you are looking for, so maybe others can chime in too.
>> At any rate, MultiSearcher has been around a lot longer (2001
>> versus 2004, or at least that is what the changelog seems to
>> indicate) and it works over Searchables, including
>> RemoteSearchable. So you could use it to combine results from
>> remote searches as well, MultiReader can only work over
>> IndexReaders and I am not aware of any way that it can do remote
>> index reading, so there are different viable use cases for the two.
>
>
> Really I just want to know the fastest mechanism for searching.
> Since I don't use a RemoteSearcher, it sounds like using an
> IndexSearcher on a MultiReader is the way to go.
>
>>> 2. Why doesn't the MultiReader implement the rather nice methods
>>> that the MultiSearcher has (I'm thinking specifically of
>>> subSearcher(int) and subDoc(int))?
>> I suppose subDoc might make sense, but subSearcher does not for a
>> Reader. Perhaps the private readerIndex() method on MultiReader
>> is something you are interested in? Is that getting at what you
>> want? Maybe you can submit a patch that makes readerIndex public
>> if that is what you are interested in?
>
> Obviously the methods would have to be appropriately named :). It
> sounds like some development work will have to be done on this
> then. I have no problem doing it myself and submitting a patch (I
> can pick up this discussion on the developer list when I have
> time), but for now is it safe to assume that if I have the number
> of documents per IndexReader and the order of the readers that I
> can calculate the "real" IndexReader and the "real" docid for that
> sub-IndexReader? I realize I might not be very clear, so lets see
> if I can re-state my example more clearly in psedo-code (apologize
> in advance):
Doesn't MultiReader do this already in the readerIndex() method? It
has to figure out which IndexReader the document is in in order to
retrieve it in the first place. This is done in readerIndex(int).
Unless I still am not understanding what you mean :-). Your code
below looks a lot like what is in readerIndex() though, right?
>
> IndexReader r1 (size = 100 documents)
> IndexReader r2 (size = 50 documents)
> IndexReader r3 (size = 75 documents)
>
> IndexReader[] readers = new IndexReader[] { r1, r2, r3 }
> MultiReader mr = new MultiReader(readers)
>
> // get docid in MultiReader
> int docid = magicFindDocumentFunc(mr);
>
> for (IndexReader r : readers) {
> if (docid > r.numDocs()) {
> docid -= r.numDocs()
> }
> else {
> // r is the IndexReader that the desired Document
> // docid's current value is lucene id of that Document within r
> }
> }
>
> I know I can get the Document straight from the MultiReader but in
> my case I need to know which exact IndexReader object the Document
> is really coming from.
>
> Thanks in advance for any help,
> Kirk
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org
Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: MultiSearcher vs MultiReader
Posted by Kirk Roberts <ki...@languagecomputer.com>.
Grant Ingersoll wrote:
> I will try to take a crack at these, but not sure I know exactly what
> you are looking for, so maybe others can chime in too.
>
> At any rate, MultiSearcher has been around a lot longer (2001 versus
> 2004, or at least that is what the changelog seems to indicate) and it
> works over Searchables, including RemoteSearchable. So you could use it
> to combine results from remote searches as well, MultiReader can only
> work over IndexReaders and I am not aware of any way that it can do
> remote index reading, so there are different viable use cases for the two.
Really I just want to know the fastest mechanism for searching. Since I
don't use a RemoteSearcher, it sounds like using an IndexSearcher on a
MultiReader is the way to go.
>> 2. Why doesn't the MultiReader implement the rather nice methods that
>> the MultiSearcher has (I'm thinking specifically of subSearcher(int)
>> and subDoc(int))?
>
> I suppose subDoc might make sense, but subSearcher does not for a
> Reader. Perhaps the private readerIndex() method on MultiReader is
> something you are interested in? Is that getting at what you want?
> Maybe you can submit a patch that makes readerIndex public if that is
> what you are interested in?
Obviously the methods would have to be appropriately named :). It
sounds like some development work will have to be done on this then. I
have no problem doing it myself and submitting a patch (I can pick up
this discussion on the developer list when I have time), but for now is
it safe to assume that if I have the number of documents per IndexReader
and the order of the readers that I can calculate the "real" IndexReader
and the "real" docid for that sub-IndexReader? I realize I might not be
very clear, so lets see if I can re-state my example more clearly in
psedo-code (apologize in advance):
IndexReader r1 (size = 100 documents)
IndexReader r2 (size = 50 documents)
IndexReader r3 (size = 75 documents)
IndexReader[] readers = new IndexReader[] { r1, r2, r3 }
MultiReader mr = new MultiReader(readers)
// get docid in MultiReader
int docid = magicFindDocumentFunc(mr);
for (IndexReader r : readers) {
if (docid > r.numDocs()) {
docid -= r.numDocs()
}
else {
// r is the IndexReader that the desired Document
// docid's current value is lucene id of that Document within r
}
}
I know I can get the Document straight from the MultiReader but in my
case I need to know which exact IndexReader object the Document is
really coming from.
Thanks in advance for any help,
Kirk
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: MultiSearcher vs MultiReader
Posted by Grant Ingersoll <gs...@apache.org>.
I will try to take a crack at these, but not sure I know exactly what
you are looking for, so maybe others can chime in too.
At any rate, MultiSearcher has been around a lot longer (2001 versus
2004, or at least that is what the changelog seems to indicate) and
it works over Searchables, including RemoteSearchable. So you could
use it to combine results from remote searches as well, MultiReader
can only work over IndexReaders and I am not aware of any way that it
can do remote index reading, so there are different viable use cases
for the two.
On Apr 17, 2007, at 2:48 PM, Kirk Roberts wrote:
> I've been on this list long enough to have a vast repository of
> information about using a MultiSearcher versus an IndexSearcher
> that works on a MultiReader. However, after looking through
> several hundred list postings, I could not find what I was looking
> for. So if there is a posting or thread (or website, for that
> matter) that answers all my questions, please direct me there and,
> if so, I apologize in advance.
>
> From what I have read, people seem to suggest using an
> IndexSearcher initialized with a MultiReader over the use of a
> MultiSearcher. This begs 2 questions:
> 1. Under what conditions is a MultiSearcher necessary/optimal?
> 2. Why doesn't the MultiReader implement the rather nice methods
> that the MultiSearcher has (I'm thinking specifically of subSearcher
> (int) and subDoc(int))?
>
I suppose subDoc might make sense, but subSearcher does not for a
Reader. Perhaps the private readerIndex() method on MultiReader is
something you are interested in? Is that getting at what you want?
Maybe you can submit a patch that makes readerIndex public if that is
what you are interested in?
Hope this helps,
Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org