You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Joe Shaw <jo...@ximian.com> on 2007/03/07 18:04:46 UTC

Re: Using ParallelReader over large immutable index and small updatable index

Hi,

On Tue, 2007-03-06 at 15:34 -0500, Andy Liu wrote: 
> Is there a working solution out there that would let me use ParallelReader
> to search over a large, immutable index and a smaller, auxillary index that
> is updated frequently?  Currently, from my understanding, the ParallelReader
> fails when one of the indexes is updated because the document ID's get out
> of synch.  Using ParallelReader in this way is attractive for me because it
> would allow me to quickly make updates to only the fields that change.

This is the idea, but from lurking on the archives for the better part
of a year and doing various searches on the archives, I'm not sure
anyone is actually successfully using ParallelReader in practice.

If I'm wrong about that, somebody please speak up!

> The alternative is to use one index.  However, an update would require me to
> delete the entire document (which is quite large in my application) and
> reinsert it after making updates.  This requires a lot more I/O and is a lot
> slower, and I'd like to avoid this if possible.
> 
> I can think of other alternatives, but all involve storing data and/or
> bitsets in memory, which is not very scalable.  I need to be able to handle
> millions of documents.

I'm in a very similar situation and we've taken the latter route.  We
get a bitset for our primary (immutable) index and our secondary
(mutable) one and merge the results based on a unique ID in the matching
document in each index.  This isn't super fast because we have to
instantiate a Document to get at this ID, but because the mutable index
contains a lot less information than the immutable one, it isn't too
bad.  We then use the TermDocs to jump to the ID in the primary index
and set its bit.

It probably doesn't scale to millions of matches, but it scales pretty
well to tens of thousands.  I'd suggest breaking down into smaller
indexes if you can, and run this process across each of them.  That way
it'll take less memory and you can return the matches in batches
per-index.

Joe

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org