You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by SK R <rs...@gmail.com> on 2007/03/23 07:51:04 UTC

MergeFactor and MaxBufferedDocs value should ...?

Hi,
    I've looked the uses of MergeFactor and MaxBufferedDocs.

    If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first 100
segments will be merged in RAMDir when 100 docs arrived. At the end of 350th
doc added to writer , RAMDir have 2 merged segment files + 50 seperate
segment files not merged together and these are flushed to FSDir.

    If wrong, please correct me.

    My doubt is whether we should set MergeFactor & MaxBufferedDocs in
proportional ratio (i.e) MaxBufferedDocs = n*MergeFactor where n = 1,2 ...
to reduce indexing time and get greater performance or no need to worry
about it's relation?


Thanks & Regards
RSK

Re: MergeFactor and MaxBufferedDocs value should ...?

Posted by Erick Erickson <er...@gmail.com>.

I should add that in my situation, the number of documents that
fit in ram is...er...problematical to determine. My current project
is composed of books that I chose to index as a single book at a
time.

Unfortunately, answering the question "how big is a book" doesn't
help much, they range from 2 pages to over 7,000 pages. So how
to set the various indexing parameters, especially maxBufferedDocs
is hard to balance between efficiency and memory. Will I happen
to get a string of 100 large books? If so, I need to set the limit
to a small number. Which will not be terribly efficient for the "usual"
case.

That said, I don't much care about efficiency in this case. I can't
generate the index quickly (20,000+ books) and the factors I've
chosen let me generate it between the time I leave work and the
time I get back in the morning, so I don't really need much more
tweaking.

But this illustrates why I referred to picking factors as a "guess".
With a heterogeneous index where the documents vary widely
in size, picking parameters isn't straight-forward. My current
parameters may not work if I index the documents in a different
order than I am currently. I just don't know.

They may even not work on the next set of data, since much of
the data is OCR and for many books it's pretty trashy and/or
incomplete (imagine the OCR output of a genealogy book that
consists entirely of a stylized tree with the names written
by hand along the branches in many orientations!).

We're promised much better OCR data in the next set of books
we index, which may blow my current indexer out of the watter.

Which is why I'm so glad that the ramSizeInBytes has been
added. It seems to me that I can now create a reasonably
generalized way to index heterogeneous documents with
"good enough" efficiency. I'm imagining keeping a few
simple statistics, like size of incoming document and
change in index size as a result of indexing that doc. This
should allow me to figure out a reasonable factor for
predicting how much the *next* addition will increase the index
and flushing ram based upon that prediction. With, probably,
quite a large safety margin.

I don't really care if I get every last efficiency in this case. What
I *do* care about is that the indexing run completes and this
new capability seems to allow me to insure that without
penalizing the bulk of my indexing because I have  a few edge
cases.

Anyway, thanks for adding this capability, which I'll probably
use in the pretty near future.

And thanks Michael for your explanation of what these factors
really do. It may have been documented before, but this time
it finally is sticking in my aging brain...

Erick

On 3/23/07, Michael McCandless <lu...@mikemccandless.com> wrote:
>
>
> "Erick Erickson" <er...@gmail.com> wrote:
> > I haven't used it yet, but I've seen several references to
> > IndexWriter.ramSizeInBytes() and using it to control when the writer
> > flushes the RAM. This seems like a more deterministic way of
> > making things efficient than trying various combinations of
> > maxBufferedDocs , MergeFactor, etc, all of which are guesses
> > at best.
>
> I agree this is the most efficient way to flush.  The one caveat is
> this Jira issue:
>
>   http://issues.apache.org/jira/browse/LUCENE-845
>
> which can cause over-merging if you make maxBufferedDocs too large.
>
> I think the rule of thumb to avoid this issue is 1) set
> maxBufferedDocs to be no more than 10X the "typical" number of docs
> you will flush, and then 2) flush by RAM usage.
>
> So for example if when you flush by RAM you typically flush "around"
> 200-300 docs, then setting maxBufferedDocs to eg 1000 is good since
> it's far above 200-300 (so it won't trigger a flush when you didn't
> want it to) but it's also well below 10X your range of docs (so it
> won't tickle the above bug).
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: MergeFactor and MaxBufferedDocs value should ...?

Posted by Michael McCandless <lu...@mikemccandless.com>.

"Erick Erickson" <er...@gmail.com> wrote:
> I haven't used it yet, but I've seen several references to
> IndexWriter.ramSizeInBytes() and using it to control when the writer
> flushes the RAM. This seems like a more deterministic way of
> making things efficient than trying various combinations of
> maxBufferedDocs , MergeFactor, etc, all of which are guesses
> at best.

I agree this is the most efficient way to flush.  The one caveat is
this Jira issue:

  http://issues.apache.org/jira/browse/LUCENE-845

which can cause over-merging if you make maxBufferedDocs too large.

I think the rule of thumb to avoid this issue is 1) set
maxBufferedDocs to be no more than 10X the "typical" number of docs
you will flush, and then 2) flush by RAM usage.

So for example if when you flush by RAM you typically flush "around"
200-300 docs, then setting maxBufferedDocs to eg 1000 is good since
it's far above 200-300 (so it won't trigger a flush when you didn't
want it to) but it's also well below 10X your range of docs (so it
won't tickle the above bug).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: MergeFactor and MaxBufferedDocs value should ...?

Posted by Erick Erickson <er...@gmail.com>.

I haven't used it yet, but I've seen several references to
IndexWriter.ramSizeInBytes() and using it to control when the writer
flushes the RAM. This seems like a more deterministic way of
making things efficient than trying various combinations of
maxBufferedDocs , MergeFactor, etc, all of which are guesses
at best.

I'd be really curious if it works for you...

Erick

On 3/23/07, SK R <rs...@gmail.com> wrote:
>
> Please clarify the following.
>
>      1.When will be the segments in RAMDirectory moved (flushed) in to
> FSDirectory?
>
>      2.Segments creation by maxBufferedDocs occur in RAMDir. Where merge
> by
> MergeFactor happen? whether in RAMDir or FSDir?
>
> Thanks in Advance
> RSK
>
>
> On 3/23/07, Michael McCandless <lu...@mikemccandless.com> wrote:
> >
> >
> > "SK R" <rs...@gmail.com> wrote:
> > >     If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first
> 100
> > > segments will be merged in RAMDir when 100 docs arrived. At the end of
> > > 350th
> > > doc added to writer , RAMDir have 2 merged segment files + 50 seperate
> > > segment files not merged together and these are flushed to FSDir.
> > >
> > >     If wrong, please correct me.
> > >
> > >     My doubt is whether we should set MergeFactor & MaxBufferedDocs in
> > > proportional ratio (i.e) MaxBufferedDocs = n*MergeFactor where n = 1,2
> > > ...
> > > to reduce indexing time and get greater performance or no need to
> worry
> > > about it's relation?
> >
> > Actually, maxBufferedDocs is how many docs are held in RAM before
> > flushing to a single segment.  So with 250, after adding the 250th doc
> > the writer will write the first segment; after adding the 500th doc,
> > it writes the second segment, etc.
> >
> > Then, mergeFactor says how many segments can be written before a merge
> > takes place.  A mergeFactor of 10 means after writing 10 such
> > segments from above, they will be merged into a single segment with
> > 2500 docs.  After another 2500 docs you'll have 2 such segments.  Then
> > once you've added your 25000'th doc, all of the 2500 doc segments will
> > be merged into a single 25000 segment doc, etc.
> >
> > To maximize indexing performance you really want maxBufferedDocs to be
> > as large as you can handle (the bigger you make it, the more RAM is
> > required by the writer).
> >
> > I believe (not certain) larger values of mergeFactor will also improve
> > performance since it defers merging as long as possible.  However, the
> > larger you make this, the more segments are allowed to exist in your
> > index, and at some point you will hit file handle limits with your
> > searchers.
> >
> > I don't think these two parameters need to be proportional to one
> > another.  I don't think that will affect performance.
> >
> > Another performance boost is to turn off compound file, but, this has
> > a severe cost of requiring far more file handles during searching.
> >
> > Mike
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: MergeFactor and MaxBufferedDocs value should ...?

Posted by Michael McCandless <lu...@mikemccandless.com>.

"SK R" <rs...@gmail.com> wrote:
>      1.When will be the segments in RAMDirectory moved (flushed) in to
> FSDirectory?

This is maxBufferedDocs.  Right now, every added doc creates its own
segment in the RAMDir.  After maxBufferedDocs, all of these single
documents are merged and flushed to a single segment in FSDir.

This is actually not really a very efficient way for IndexWriter to
use RAM.  I'm working on improving this / speeding it up under this
Jira issue:

    http://issues.apache.org/jira/browse/LUCENE-843

But it will be some time before this is stable & released!

>      2.Segments creation by maxBufferedDocs occur in RAMDir.

Actually, no.  The segments created due to maxBufferedDocs are in FSDir.

> Where merge by MergeFactor happen? whether in RAMDir or FSDir?

This is always in FSDir.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: MergeFactor and MaxBufferedDocs value should ...?

Posted by SK R <rs...@gmail.com>.

Please clarify the following.

     1.When will be the segments in RAMDirectory moved (flushed) in to
FSDirectory?

     2.Segments creation by maxBufferedDocs occur in RAMDir. Where merge by
MergeFactor happen? whether in RAMDir or FSDir?

Thanks in Advance
RSK


On 3/23/07, Michael McCandless <lu...@mikemccandless.com> wrote:
>
>
> "SK R" <rs...@gmail.com> wrote:
> >     If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first 100
> > segments will be merged in RAMDir when 100 docs arrived. At the end of
> > 350th
> > doc added to writer , RAMDir have 2 merged segment files + 50 seperate
> > segment files not merged together and these are flushed to FSDir.
> >
> >     If wrong, please correct me.
> >
> >     My doubt is whether we should set MergeFactor & MaxBufferedDocs in
> > proportional ratio (i.e) MaxBufferedDocs = n*MergeFactor where n = 1,2
> > ...
> > to reduce indexing time and get greater performance or no need to worry
> > about it's relation?
>
> Actually, maxBufferedDocs is how many docs are held in RAM before
> flushing to a single segment.  So with 250, after adding the 250th doc
> the writer will write the first segment; after adding the 500th doc,
> it writes the second segment, etc.
>
> Then, mergeFactor says how many segments can be written before a merge
> takes place.  A mergeFactor of 10 means after writing 10 such
> segments from above, they will be merged into a single segment with
> 2500 docs.  After another 2500 docs you'll have 2 such segments.  Then
> once you've added your 25000'th doc, all of the 2500 doc segments will
> be merged into a single 25000 segment doc, etc.
>
> To maximize indexing performance you really want maxBufferedDocs to be
> as large as you can handle (the bigger you make it, the more RAM is
> required by the writer).
>
> I believe (not certain) larger values of mergeFactor will also improve
> performance since it defers merging as long as possible.  However, the
> larger you make this, the more segments are allowed to exist in your
> index, and at some point you will hit file handle limits with your
> searchers.
>
> I don't think these two parameters need to be proportional to one
> another.  I don't think that will affect performance.
>
> Another performance boost is to turn off compound file, but, this has
> a severe cost of requiring far more file handles during searching.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: MergeFactor and MaxBufferedDocs value should ...?

Posted by Michael McCandless <lu...@mikemccandless.com>.

"SK R" <rs...@gmail.com> wrote:
>     If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first 100
> segments will be merged in RAMDir when 100 docs arrived. At the end of
> 350th
> doc added to writer , RAMDir have 2 merged segment files + 50 seperate
> segment files not merged together and these are flushed to FSDir.
> 
>     If wrong, please correct me.
> 
>     My doubt is whether we should set MergeFactor & MaxBufferedDocs in
> proportional ratio (i.e) MaxBufferedDocs = n*MergeFactor where n = 1,2
> ...
> to reduce indexing time and get greater performance or no need to worry
> about it's relation?

Actually, maxBufferedDocs is how many docs are held in RAM before
flushing to a single segment.  So with 250, after adding the 250th doc
the writer will write the first segment; after adding the 500th doc,
it writes the second segment, etc.

Then, mergeFactor says how many segments can be written before a merge
takes place.  A mergeFactor of 10 means after writing 10 such
segments from above, they will be merged into a single segment with
2500 docs.  After another 2500 docs you'll have 2 such segments.  Then
once you've added your 25000'th doc, all of the 2500 doc segments will
be merged into a single 25000 segment doc, etc.

To maximize indexing performance you really want maxBufferedDocs to be
as large as you can handle (the bigger you make it, the more RAM is
required by the writer).

I believe (not certain) larger values of mergeFactor will also improve
performance since it defers merging as long as possible.  However, the
larger you make this, the more segments are allowed to exist in your
index, and at some point you will hit file handle limits with your
searchers.

I don't think these two parameters need to be proportional to one
another.  I don't think that will affect performance.

Another performance boost is to turn off compound file, but, this has
a severe cost of requiring far more file handles during searching.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: MergeFactor and MaxBufferedDocs value should ...?

Posted by Grant Ingersoll <gs...@apache.org>.

I would also suggest that contrib/benchmark in the source has a nice  
framework for experimenting with different factors for mergeFactor  
and maxBufferedDocs.  It is quite easy to set it up for a new  
collection (i.e. yours) and run experiments that alter these two values.

Below is a sample "algorithm" file that I have been trying out.  To  
make it work on yours, you need only implement a DocMaker that works  
for your collection (you probably already have the stuff for making  
Documents, you just need to implement it in the DocMaker interface  
and plug it in)


merge.factor=merge:10:100:1000:5000:10:10:10:10:100:1000:100:100
max.buffered=max.buffered: 
10:10:10:10:100:1000:10000:21580:21580:21580:1000:10000
compound=true

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory
#directory=RamDirectory

doc.stored=true
doc.tokenized=true
doc.term.vector=false
doc.add.log.step=1000

docs.dir=reuters-out
#docs.dir=reuters-111

#doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker

#query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker
query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker

# task at this depth or less would print when they start
task.max.depth.log=2

log.queries=true
#  
------------------------------------------------------------------------ 
-------------

{ "Rounds"

     ResetSystemErase

     { "Populate-Opt"
         CreateIndex
         { "MAddDocs" AddDoc > : 22000
         Optimize
         CloseIndex
     }

     NewRound

} : 13

RepSumByName
RepSumByPrefRound MAddDocs
RepSumByPrefRound Populate-Opt


On Mar 23, 2007, at 2:51 AM, SK R wrote:

> Hi,
>    I've looked the uses of MergeFactor and MaxBufferedDocs.
>
>    If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first 100
> segments will be merged in RAMDir when 100 docs arrived. At the end  
> of 350th
> doc added to writer , RAMDir have 2 merged segment files + 50 seperate
> segment files not merged together and these are flushed to FSDir.
>
>    If wrong, please correct me.
>
>    My doubt is whether we should set MergeFactor & MaxBufferedDocs in
> proportional ratio (i.e) MaxBufferedDocs = n*MergeFactor where n =  
> 1,2 ...
> to reduce indexing time and get greater performance or no need to  
> worry
> about it's relation?
>
>
> Thanks & Regards
> RSK

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org