You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/09/22 20:47:17 UTC

IndexWriter.getReader() was Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Slight divergence from the topic...

On Sep 22, 2009, at 10:48 AM, Michael McCandless wrote:

> John are you using IndexWriter.setMergedSegmentWarmer, so that a newly
> merged segment is warmed before it's "put into production" (returned
> by getReader)?

One of the pieces I still am missing from all of this is why isn't  
IW.getReader() now just the preferred way of getting a IndexReader for  
all applications other than those that are completely batch oriented?   
Why bother with IndexReader.reopen()?  IW.getReader() is marked as  
Expert right now, which says to me there are some tradeoffs or that  
one needs to be really careful using it, but I don't see the downside  
other than what appears to be some extra resources consumed and the  
fact that it is brand new code, or at least the downside is not  
documented.

And yet, at the first SF Meetup, I recall having a discussion with  
Michael B. about this approach versus IR.reopen() that left me  
wondering which one is better, since, Lucene has, in fact, always been  
about incremental updates (since there are commercial systems out  
there that require complete re-indexing) and that getting IR.reopen to  
perform is just a matter of tuning one's application in regards to  
reads and writes vs. having to do all this work in the IndexWriter  
that now tightly couples the IndexReader to the IndexWriter.   
Hopefully Michael can refresh my memory on the conversation, as I may  
be remembering incorrectly.

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter.getReader() was Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Jason Rutherglen <ja...@gmail.com>.
> which one is better

Better for what? What use case are you thinking of?

The merge reasons were covered well in the previous thread.
Another gain is the carry over of deletes in RAM.

I'm getting the feeling the Realtime wiki needs a lot of work.
http://wiki.apache.org/lucene-java/NearRealtimeSearch

On Tue, Sep 22, 2009 at 11:47 AM, Grant Ingersoll <gs...@apache.org> wrote:
> Slight divergence from the topic...
>
> On Sep 22, 2009, at 10:48 AM, Michael McCandless wrote:
>
>> John are you using IndexWriter.setMergedSegmentWarmer, so that a newly
>> merged segment is warmed before it's "put into production" (returned
>> by getReader)?
>
> One of the pieces I still am missing from all of this is why isn't
> IW.getReader() now just the preferred way of getting a IndexReader for all
> applications other than those that are completely batch oriented?  Why
> bother with IndexReader.reopen()?  IW.getReader() is marked as Expert right
> now, which says to me there are some tradeoffs or that one needs to be
> really careful using it, but I don't see the downside other than what
> appears to be some extra resources consumed and the fact that it is brand
> new code, or at least the downside is not documented.
>
> And yet, at the first SF Meetup, I recall having a discussion with Michael
> B. about this approach versus IR.reopen() that left me wondering which one
> is better, since, Lucene has, in fact, always been about incremental updates
> (since there are commercial systems out there that require complete
> re-indexing) and that getting IR.reopen to perform is just a matter of
> tuning one's application in regards to reads and writes vs. having to do all
> this work in the IndexWriter that now tightly couples the IndexReader to the
> IndexWriter.  Hopefully Michael can refresh my memory on the conversation,
> as I may be remembering incorrectly.
>
> -Grant
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter.getReader() was Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Tue, Sep 22, 2009 at 3:53 PM, Grant Ingersoll <gs...@apache.org> wrote:

>> But, the returned reader is read-only, so you can't use it to change
>> norms, do deletes, etc.
>
> Yeah, but an IW can do deletes, and if the this IR is coupled to it
> anyway...

True, but IW's deletes are still buffered, and you can't delete by doc
ID with IW.

>> But Directory is too low... we could probably get by with a class that
>> holds the SegmentReader cache (currently IndexWriter.ReaderPool), and
>> the "current" segmentInfos.  IW would interact with this class to get
>> the readers it needs, for applying deletes, merging, as well as
>> posting newly flushed but not yet committed segments, and IR would
>> then pull from this class to get the latest segments in the index and
>> to checkout the readers.
>
> Not sure why Directory, a public well-known class, is considered too low (I
> thought you would say too high!) versus inner classes that assume an
> IndexWriter. The reason I chose Directory is because it is the common thing
> already shared between the two and it is already a public, well-known class
> that requires no extra understanding by users.  It's a first class citizen.
>  By reusing it, apps can be agnostic about where it came from, versus having
> to wire in all this new stuff to handle ReaderPools, etc. versus simply
> reusing the directory stuff.

Sorry, by "low" I meant logically Directory is a lightweight file
access API -- it's at the low level of Lucene's stack.  I'm not sure
we should overload it by storing SegmentInfos, caching SegmentReaders
in it, etc.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter.getReader() was Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 22, 2009, at 3:44 PM, Michael McCandless wrote:

> On Tue, Sep 22, 2009 at 2:53 PM, Grant Ingersoll  
> <gs...@apache.org> wrote:
>> One of the pieces I still am missing from all of this is why isn't
>> IW.getReader() now just the preferred way of getting a IndexReader
>> for all applications other than those that are completely batch
>> oriented?
>>
>> Why bother with IndexReader.reopen()?
>
> I agree, most apps should simply use getReader, as long as they're
> running in the same JVM as the IndexWriter, and, they are holding the
> IW open anyway.
>
> But, the returned reader is read-only, so you can't use it to change
> norms, do deletes, etc.

Yeah, but an IW can do deletes, and if the this IR is coupled to it  
anyway...

>
> The API really shouldn't be marked expert.  I'll go remove that...
>
>> Lucene has, in fact, always been about incremental updates (since
>> there are commercial systems out there that require complete
>> re-indexing)
>
> True, for writing.  But for reading, reopening a reader was very
> costly before 2.9 because FieldCache entry had to be fully recomputed.
> So, switching to per-segment search/collect in 2.9 was the biggest
> step to reducing NRT reopen latency.
>
>> and that getting IR.reopen to perform is just a matter of tuning
>> one's application in regards to reads and writes vs. having to do
>> all this work in the IndexWriter that now tightly couples the
>> IndexReader to the IndexWriter.
>
> The integration with IndexWriter allows a reader to access segments
> that haven't yet been committed to the index.  This saves fsync()'ing
> the written files, saves writing a new segments_N file, saves flushing
> deletes to disk and then reloading them (we just share the BitVector
> directly in RAM now).  On many OS/filesystems fsync is surprisingly
> costly.
>
> LUCENE-1313, the next step for NRT, further reduces NRT reopen latency
> by allowing the small segments to remain in RAM, so when reopening
> your NRT reader after smallish add/deletes no IO is incurred.
>
> Beyond LUCENE-1313 we've discussed making IndexWriter's RAM buffer
> directly searchable, so you don't pay the cost of pinching a new
> segment when an NRT reader is reopened.
>
> Really we only need to further improve the approach here if the
> existing performance proves inadequate... in my limited testing the
> performance was excellent.
>
> Though, our inability to prioritize IO and control the OS's IO cache,
> from java, are likely far bigger impacts on our NRT performance at
> this point, than further improvements in our impl.  I'd love to see a
> Directory impl that "emulates" IO prioritization by making merging IO
> wait whenever search IO is live.  I think we need a JNI extension that
> taps into madvise/posix_fadvise, when possible.
>
>> FWIW, I still don't like the coupling of the two.  I think it would
>> be better if IW allowed you to get a Directory (or some other
>> appropriate representation) representing the in memory segment that
>> can then easily be added to an existing Searcher/Reader.  This would
>> at least decouple the two and instead use the common data structure
>> they both already share, i.e. the Directory.  Whether this is doable
>> or not, I am not sure.
>
> I agree the coupling is overkill.
>
> But Directory is too low... we could probably get by with a class that
> holds the SegmentReader cache (currently IndexWriter.ReaderPool), and
> the "current" segmentInfos.  IW would interact with this class to get
> the readers it needs, for applying deletes, merging, as well as
> posting newly flushed but not yet committed segments, and IR would
> then pull from this class to get the latest segments in the index and
> to checkout the readers.

Not sure why Directory, a public well-known class, is considered too  
low (I thought you would say too high!) versus inner classes that  
assume an IndexWriter. The reason I chose Directory is because it is  
the common thing already shared between the two and it is already a  
public, well-known class that requires no extra understanding by  
users.  It's a first class citizen.  By reusing it, apps can be  
agnostic about where it came from, versus having to wire in all this  
new stuff to handle ReaderPools, etc. versus simply reusing the  
directory stuff.

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter.getReader() was Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Tue, Sep 22, 2009 at 2:53 PM, Grant Ingersoll <gs...@apache.org> wrote:
> One of the pieces I still am missing from all of this is why isn't
> IW.getReader() now just the preferred way of getting a IndexReader
> for all applications other than those that are completely batch
> oriented?
>
> Why bother with IndexReader.reopen()?

I agree, most apps should simply use getReader, as long as they're
running in the same JVM as the IndexWriter, and, they are holding the
IW open anyway.

But, the returned reader is read-only, so you can't use it to change
norms, do deletes, etc.

The API really shouldn't be marked expert.  I'll go remove that...

> Lucene has, in fact, always been about incremental updates (since
> there are commercial systems out there that require complete
> re-indexing)

True, for writing.  But for reading, reopening a reader was very
costly before 2.9 because FieldCache entry had to be fully recomputed.
So, switching to per-segment search/collect in 2.9 was the biggest
step to reducing NRT reopen latency.

> and that getting IR.reopen to perform is just a matter of tuning
> one's application in regards to reads and writes vs. having to do
> all this work in the IndexWriter that now tightly couples the
> IndexReader to the IndexWriter.

The integration with IndexWriter allows a reader to access segments
that haven't yet been committed to the index.  This saves fsync()'ing
the written files, saves writing a new segments_N file, saves flushing
deletes to disk and then reloading them (we just share the BitVector
directly in RAM now).  On many OS/filesystems fsync is surprisingly
costly.

LUCENE-1313, the next step for NRT, further reduces NRT reopen latency
by allowing the small segments to remain in RAM, so when reopening
your NRT reader after smallish add/deletes no IO is incurred.

Beyond LUCENE-1313 we've discussed making IndexWriter's RAM buffer
directly searchable, so you don't pay the cost of pinching a new
segment when an NRT reader is reopened.

Really we only need to further improve the approach here if the
existing performance proves inadequate... in my limited testing the
performance was excellent.

Though, our inability to prioritize IO and control the OS's IO cache,
from java, are likely far bigger impacts on our NRT performance at
this point, than further improvements in our impl.  I'd love to see a
Directory impl that "emulates" IO prioritization by making merging IO
wait whenever search IO is live.  I think we need a JNI extension that
taps into madvise/posix_fadvise, when possible.

> FWIW, I still don't like the coupling of the two.  I think it would
> be better if IW allowed you to get a Directory (or some other
> appropriate representation) representing the in memory segment that
> can then easily be added to an existing Searcher/Reader.  This would
> at least decouple the two and instead use the common data structure
> they both already share, i.e. the Directory.  Whether this is doable
> or not, I am not sure.

I agree the coupling is overkill.

But Directory is too low... we could probably get by with a class that
holds the SegmentReader cache (currently IndexWriter.ReaderPool), and
the "current" segmentInfos.  IW would interact with this class to get
the readers it needs, for applying deletes, merging, as well as
posting newly flushed but not yet committed segments, and IR would
then pull from this class to get the latest segments in the index and
to checkout the readers.

Such a shared "per-segment state" class could also be the basis for
app-specific custom caches to update themselves when new segments are
created, old ones are merged, etc.  Probably this class should break
out SR's core separately.  Hmm.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter.getReader() was Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 22, 2009, at 2:47 PM, Grant Ingersoll wrote:
>
>
> And yet, at the first SF Meetup, I recall having a discussion with  
> Michael B. about this approach versus IR.reopen() that left me  
> wondering which one is better, since, Lucene has, in fact, always  
> been about incremental updates (since there are commercial systems  
> out there that require complete re-indexing) and that getting  
> IR.reopen to perform is just a matter of tuning one's application in  
> regards to reads and writes vs. having to do all this work in the  
> IndexWriter that now tightly couples the IndexReader to the  
> IndexWriter.

FWIW, I still don't like the coupling of the two.  I think it would be  
better if IW allowed you to get a Directory (or some other appropriate  
representation) representing the in memory segment that can then  
easily be added to an existing Searcher/Reader.  This would at least  
decouple the two and instead use the common data structure they both  
already share, i.e. the Directory.  Whether this is doable or not, I  
am not sure.