You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Alexander Devine <al...@gmail.com> on 2011/10/25 21:24:42 UTC

Possible to do an indexorder sort over a MultiSearcher?

Hi all,

I'm an trying to provide a way to efficiently allow a client to page over
all of the documents in multiple Lucene indexes that I'm querying with a
MultiSearcher (~1-2 million docs). Unfortunately, I can't use the standard
paging algorithm of getting TopDocs to the last record needed and then
skipping all of the preceding pages because the queries get extremely slow
and memory usage becomes prohibitive as the client requests higher and
higher page numbers.

Thus, my workaround for this was to run a search using an indexorder sort
(that is, sort by document ID), and then the client could page over the
results by running a query that says "get me all the documents where the doc
ID is greater than the last doc ID of the previous page". This way the
client only ever asks for a TopDocs the size of a single page, but the
client can still run forward to eventually get all the documents in the
index.

While this works when searching over a single IndexReader, it fails when
using a MultiSearcher for 2 reasons:
1. Sorting by docId doesn't really work in a MultiSearcher because of the
way the searcher munges the IDs. For example, if there are 2 indexes each
with 3 docs #1, #2 and #3, the MultiSearcher will return results that look
like "1,4,2,5,3,6".
2. The "MinimumDocIdQuery" I wrote only works when you pass it the ORIGINAL
doc ID that is local to the index reader, not the one that was munged by
the MultiSearcher.

Does anyone have any advice to work around this? I was thinking if I could
somehow get the "local" document ID back from the MultiSearcher that would
work, as I could return that with my search results (and sorted by that ID
things would look good, e.g. "1, 1, 2, 2, 3, 3"). If anyone has some advice
on how to better solve my original problem, that is, being able to run over
all of the documents in a potentially very large index using time and memory
efficient paging, that would also be appreciated.

Thanks,
Alex

Re: Possible to do an indexorder sort over a MultiSearcher?

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

Additionally, since the latest 3.x version (not sure if its already in 3.4), there is a new searchAfter method in IndexSearcher that allows deep paging. As MultiSearcher is deprecated, it is not supported there, so use MultiReader with IndexSearcher.

Uwe
--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de



Uwe Schindler <uw...@thetaphi.de> schrieb:

Hi,

MultiReader is the way to go. MultiSearcher is broken and therefore deprecated. See javadocs since Lucene 3.1.

Uwe
--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de



Alexander Devine <al...@gmail.com> schrieb:

Hi all,

I'm an trying to provide a way to efficiently allow a client to page over
all of the documents in multiple Lucene indexes that I'm querying with a
MultiSearcher (~1-2 million docs). Unfortunately, I can't use the standard
paging algorithm of getting TopDocs to the last record needed and then
skipping all of the preceding pages because the queries get extremely slow
and memory usage becomes prohibitive as the client requests higher and
higher page numbers.

Thus, my workaround for this was to run a search using an indexorder sort
(that is, sort by document ID), and then the client could page over the
results by running a query that says "get me all the documents where the doc
ID is greater than the last doc ID of the previous page". This way the
client only ever asks for a TopDocs the size of a single page, but the
client can still run forward to eventually get all the documents in the
index.

While this works when searching over a single IndexReader, it fails when
using a MultiSearcher for 2 reasons:
1. Sorting by docId doesn't really work in a MultiSearcher because of the
way the searcher munges the IDs. For example, if there are 2 indexes each
with 3 docs #1, #2 and #3, the MultiSearcher will return results that look
like "1,4,2,5,3,6".
2. The "MinimumDocIdQuery" I wrote only works when you pass it the ORIGINAL
doc ID that is local to the index reader, not the one that was munged by
the MultiSearcher.

Does anyone have any advice to work around this? I was thinking if I could
somehow get the "local" document ID back from the MultiSearcher that would
work, as I could return that with my search results (and sorted by that ID
things would look good, e.g. "1, 1, 2, 2, 3, 3"). If anyone has some advice
on how to better solve my original problem, that is, being able to run over
all of the documents in a potentially very large index using time and memory
efficient paging, that would also be appreciated.

Thanks,
Alex


Re: Possible to do an indexorder sort over a MultiSearcher?

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

MultiReader is the way to go. MultiSearcher is broken and therefore deprecated. See javadocs since Lucene 3.1.

Uwe
--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de



Alexander Devine <al...@gmail.com> schrieb:

Hi all,

I'm an trying to provide a way to efficiently allow a client to page over
all of the documents in multiple Lucene indexes that I'm querying with a
MultiSearcher (~1-2 million docs). Unfortunately, I can't use the standard
paging algorithm of getting TopDocs to the last record needed and then
skipping all of the preceding pages because the queries get extremely slow
and memory usage becomes prohibitive as the client requests higher and
higher page numbers.

Thus, my workaround for this was to run a search using an indexorder sort
(that is, sort by document ID), and then the client could page over the
results by running a query that says "get me all the documents where the doc
ID is greater than the last doc ID of the previous page". This way the
client only ever asks for a TopDocs the size of a single page, but the
client can still run forward to eventually get all the documents in the
index.

While this works when searching over a single IndexReader, it fails when
using a MultiSearcher for 2 reasons:
1. Sorting by docId doesn't really work in a MultiSearcher because of the
way the searcher munges the IDs. For example, if there are 2 indexes each
with 3 docs #1, #2 and #3, the MultiSearcher will return results that look
like "1,4,2,5,3,6".
2. The "MinimumDocIdQuery" I wrote only works when you pass it the ORIGINAL
doc ID that is local to the index reader, not the one that was munged by
the MultiSearcher.

Does anyone have any advice to work around this? I was thinking if I could
somehow get the "local" document ID back from the MultiSearcher that would
work, as I could return that with my search results (and sorted by that ID
things would look good, e.g. "1, 1, 2, 2, 3, 3"). If anyone has some advice
on how to better solve my original problem, that is, being able to run over
all of the documents in a potentially very large index using time and memory
efficient paging, that would also be appreciated.

Thanks,
Alex