You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Michael McCandless <lu...@mikemccandless.com> on 2007/03/19 13:09:37 UTC

improving RAM usage by IndexWriter

Hi,

I've been looking into improving performance of IndexWriter,
specifically how it makes use of RAM to buffer added documents.

I've created a new class (MultiDocumentWriter) that can build a single
segment from many documents at once, more efficiently than the current
single document segment approach.  It buffers terms, freqs and
positions in memory and then periodically flushes them together.

This only affects the creation of an initial segment from added
documents.  I haven't changed anything after that, eg how segments are
merged.

The basic ideas are:

  * Write stored fields and term vectors directly to disk (don't
    use up RAM for these).

  * Gather posting lists & term infos in RAM, but periodically do
    in-RAM flushes.  Once RAM is full, flush buffers to disk (and
    merge them later when it's time to make a real segment).

  * When it's time to really build a segment, merge all postings lists
    (RAM and flushed) into the real segment files.

  * Recycle buffers/objects when possible (less stress & time spent on
    GC).

I think some of these changes are similar to how KinoSearch builds a
segment.  But, I haven't made any changes to Lucene's file format nor
added requirements for a global fields schema.

With this change you can now tell IndexWriter how much RAM it can use
before flushing, which I think is better than setting max buffered
docs when documents are variable in size.  This is in fact the only
externally visible API change :)

I'm still working through some lingering issues before I can make a
clean patch, but it now passes all unit tests except the disk full
tests (I think we would need to change error semantics on disk full).

I've run some very initial performance tests and this approach
provides a good speedup when equalizing RAM usage for a fair
comparison, especially when the docs are small.  (Note that this
speedup is just for the "indexing" part, and for many Lucene apps I
think other things (eg Analyzer, retrieving docs from the content
source, etc.) are the bottleneck.

This change also makes "commit only on close" mode (autoCommit=false
to IndexWriter) especially efficient because no segment is produced
until you close the IndexWriter, so no normal segment merging takes
place for the entire session.  You can build a massive index having
created only 1 segment at the end.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: improving RAM usage by IndexWriter

Posted by Michael McCandless <lu...@mikemccandless.com>.
"Chris Hostetter" <ho...@fucit.org> wrote:

> A dirty broken patch is more still better then no patch at all -- worst
> case scenerio: nothing happens; typical scenerio: you get some eyeballs
> reading your patch even if they can't apply it; best case scenerio:
> someone else is really excited by your patch and does a bunch of cool work
> to help you out.
> 
> no matter what happens, you can't lose.

Thanks Hoss, I've done this now: opened LUCENE-843 and attached a
dirty patch with my current state!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: improving RAM usage by IndexWriter

Posted by Chris Hostetter <ho...@fucit.org>.
: It doesn't apply to SVN head just yet (I need to update the payloads
: commit -- Michael: you returned the favor here!), it's not yet thread friendly
: (though I can do that after first patch), it messes up the merge policy
: when you flush by RAM and not by document count (this is actually already
: an issue in Lucene now and I'm not yet sure how to fix it)...

A dirty broken patch is more still better then no patch at all -- worst
case scenerio: nothing happens; typical scenerio: you get some eyeballs
reading your patch even if they can't apply it; best case scenerio:
someone else is really excited by your patch and does a bunch of cool work
to help you out.

no matter what happens, you can't lose.





-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: improving RAM usage by IndexWriter

Posted by Michael McCandless <lu...@mikemccandless.com>.
"Marvin Humphrey" <ma...@rectangular.com> wrote:

> > I think some of these changes are similar to how KinoSearch builds a
> > segment.
> 
> Yup...  sounds familiar.  ;)
> 
> > I'm still working through some lingering issues before I can make a
> > clean patch,
> 
> Well, where is it?  Don't keep it a secret!

Well I still have a few things to do.  It's rather rough now :)

It doesn't apply to SVN head just yet (I need to update the payloads
commit -- Michael: you returned the favor here!), it's not yet thread friendly
(though I can do that after first patch), it messes up the merge policy
when you flush by RAM and not by document count (this is actually already
an issue in Lucene now and I'm not yet sure how to fix it)...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: improving RAM usage by IndexWriter

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mar 19, 2007, at 5:09 AM, Michael McCandless wrote:

> I think some of these changes are similar to how KinoSearch builds a
> segment.

Yup...  sounds familiar.  ;)

> I'm still working through some lingering issues before I can make a
> clean patch,

Well, where is it?  Don't keep it a secret!

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org