You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Andy Liu <an...@gmail.com> on 2007/03/06 21:34:00 UTC

Using ParallelReader over large immutable index and small updatable index

Is there a working solution out there that would let me use ParallelReader
to search over a large, immutable index and a smaller, auxillary index that
is updated frequently?  Currently, from my understanding, the ParallelReader
fails when one of the indexes is updated because the document ID's get out
of synch.  Using ParallelReader in this way is attractive for me because it
would allow me to quickly make updates to only the fields that change.

The alternative is to use one index.  However, an update would require me to
delete the entire document (which is quite large in my application) and
reinsert it after making updates.  This requires a lot more I/O and is a lot
slower, and I'd like to avoid this if possible.

I can think of other alternatives, but all involve storing data and/or
bitsets in memory, which is not very scalable.  I need to be able to handle
millions of documents.

I'm also open to any solution that don't involve ParallelReader that would
help me make quick updates in the most non-disruptive and scalable fashion.
But it just seems that ParallelReader would be perfect for me needs, if I
can get past this issue.

I've seen posts about this issue on the list, but nothing pointing to a
solution.  Can somebody help me out?

Andy

Re: Using ParallelReader over large immutable index and small updatable index

Posted by Joe Shaw <jo...@ximian.com>.

Hi,

On Tue, 2007-03-06 at 15:34 -0500, Andy Liu wrote: 
> Is there a working solution out there that would let me use ParallelReader
> to search over a large, immutable index and a smaller, auxillary index that
> is updated frequently?  Currently, from my understanding, the ParallelReader
> fails when one of the indexes is updated because the document ID's get out
> of synch.  Using ParallelReader in this way is attractive for me because it
> would allow me to quickly make updates to only the fields that change.

This is the idea, but from lurking on the archives for the better part
of a year and doing various searches on the archives, I'm not sure
anyone is actually successfully using ParallelReader in practice.

If I'm wrong about that, somebody please speak up!

> The alternative is to use one index.  However, an update would require me to
> delete the entire document (which is quite large in my application) and
> reinsert it after making updates.  This requires a lot more I/O and is a lot
> slower, and I'd like to avoid this if possible.
> 
> I can think of other alternatives, but all involve storing data and/or
> bitsets in memory, which is not very scalable.  I need to be able to handle
> millions of documents.

I'm in a very similar situation and we've taken the latter route.  We
get a bitset for our primary (immutable) index and our secondary
(mutable) one and merge the results based on a unique ID in the matching
document in each index.  This isn't super fast because we have to
instantiate a Document to get at this ID, but because the mutable index
contains a lot less information than the immutable one, it isn't too
bad.  We then use the TermDocs to jump to the ID in the primary index
and set its bit.

It probably doesn't scale to millions of matches, but it scales pretty
well to tens of thousands.  I'd suggest breaking down into smaller
indexes if you can, and run this process across each of them.  That way
it'll take less memory and you can return the matches in batches
per-index.

Joe

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using ParallelReader over large immutable index and small updatable index

Posted by Andy Liu <an...@gmail.com>.

>From my understanding, MultiSearcher is used to combine two indexes that
have the same fields but different documents.  ParallelReader is used to
combine two indexes that have same documents but different fields.  I'm
trying to do the latter.  Is my understanding correct?  For example, what
I'm trying to do is have one immutable index that has these fields:

field1
field2
field3

and my "update" index that has one field

field4

Both indexes have the same documents, and the docId's are synchronized.
This allows me to execute searches like:

+field1:foo +field4:bar

field4 is a field that would be updated frequently and as real-time as
possible.  However, once I update field4, the docId's are no longer
synchronized, and ParallelReader fails.

Andy

On 3/6/07, Alexey Lef <al...@sciquest.com> wrote:
>
> We use MultiSearcher for a similar scenario. This way you can keep the
> Searcher/Reader for the read-only index alive and refresh the small index
> Searcher whenever an update is made. If you have any cached filters, they
> are mapped to a Reader, so the cached filters for the big index will stay
> alive as well. The only (small) problem I have found so far is how
> MultiSearcher handles custom Similarity (see
> https://issues.apache.org/jira/browse/LUCENE-789).
>
> Hope this helps,
>
> Alexey
>
> -----Original Message-----
> From: Andy Liu [mailto:andyliu1227@gmail.com]
> Sent: Tuesday, March 06, 2007 3:34 PM
> To: java-user@lucene.apache.org
> Subject: Using ParallelReader over large immutable index and small
> updatable index
>
> Is there a working solution out there that would let me use ParallelReader
> to search over a large, immutable index and a smaller, auxillary index
> that
> is updated frequently?  Currently, from my understanding, the
> ParallelReader
> fails when one of the indexes is updated because the document ID's get out
> of synch.  Using ParallelReader in this way is attractive for me because
> it
> would allow me to quickly make updates to only the fields that change.
>
> The alternative is to use one index.  However, an update would require me
> to
> delete the entire document (which is quite large in my application) and
> reinsert it after making updates.  This requires a lot more I/O and is a
> lot
> slower, and I'd like to avoid this if possible.
>
> I can think of other alternatives, but all involve storing data and/or
> bitsets in memory, which is not very scalable.  I need to be able to
> handle
> millions of documents.
>
> I'm also open to any solution that don't involve ParallelReader that would
> help me make quick updates in the most non-disruptive and scalable
> fashion.
> But it just seems that ParallelReader would be perfect for me needs, if I
> can get past this issue.
>
> I've seen posts about this issue on the list, but nothing pointing to a
> solution.  Can somebody help me out?
>
> Andy
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Using ParallelReader over large immutable index and small updatable index

Posted by Alexey Lef <al...@sciquest.com>.

We use MultiSearcher for a similar scenario. This way you can keep the Searcher/Reader for the read-only index alive and refresh the small index Searcher whenever an update is made. If you have any cached filters, they are mapped to a Reader, so the cached filters for the big index will stay alive as well. The only (small) problem I have found so far is how MultiSearcher handles custom Similarity (see https://issues.apache.org/jira/browse/LUCENE-789). 

Hope this helps,

Alexey 

-----Original Message-----
From: Andy Liu [mailto:andyliu1227@gmail.com] 
Sent: Tuesday, March 06, 2007 3:34 PM
To: java-user@lucene.apache.org
Subject: Using ParallelReader over large immutable index and small updatable index

Is there a working solution out there that would let me use ParallelReader
to search over a large, immutable index and a smaller, auxillary index that
is updated frequently?  Currently, from my understanding, the ParallelReader
fails when one of the indexes is updated because the document ID's get out
of synch.  Using ParallelReader in this way is attractive for me because it
would allow me to quickly make updates to only the fields that change.

The alternative is to use one index.  However, an update would require me to
delete the entire document (which is quite large in my application) and
reinsert it after making updates.  This requires a lot more I/O and is a lot
slower, and I'd like to avoid this if possible.

I can think of other alternatives, but all involve storing data and/or
bitsets in memory, which is not very scalable.  I need to be able to handle
millions of documents.

I'm also open to any solution that don't involve ParallelReader that would
help me make quick updates in the most non-disruptive and scalable fashion.
But it just seems that ParallelReader would be perfect for me needs, if I
can get past this issue.

I've seen posts about this issue on the list, but nothing pointing to a
solution.  Can somebody help me out?

Andy

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org