You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2007/03/23 16:27:32 UTC

[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483631 ] 

Michael McCandless commented on LUCENE-845:
-------------------------------------------

This bug is actually rather serious.

If you set maxBufferedDocs to a very large number (on the expectation
that it's not used since you will manually flush by RAM usage) then
the merge policy will always merge the index down to 1 segment as soon
as it hits mergeFactor segments.

This will be an O(N^2) slowdown.  EG if based on RAM you are
flushing every 100 docs, then at 1000 docs you will merge to 1
segment.  Then at 1900 docs, you merge to 1 segment again.  At 2800,
3700, 4600, ... (every 900 docs) you keep merging to 1 segment.  Your
indexing process will get very slow because every 900 documents the
entire index is effectively being optimized.

With LUCENE-843 I'm thinking we should deprecate maxBufferedDocs
entirely and switch to flushing by RAM usage instead (you can always
manually flush every N documents in your app if for some reason you
need that).  But obviously we need to resolve this bug first.


> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Steven Parkes <st...@esseff.org>.

> Very long documents are useful for testing for anomalies, but they're 
> not so useful as retrieved documents, nor typical of applications.

That's what I thought, too. I'm kinda curious to see how gutenburg
compares to wikipedia for things like merge policy, in particular,
by-docs vs. by-bytes. But while a curiosity, I don't see that it has all
much practical value, so I'm not sure I can get it to the top of the
to-do list.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Doug Cutting <cu...@apache.org>.

Steven Parkes wrote:
> And what about Project Gutenburg?
> 
> Wikipedia is going to have relatively short text, Gutenburg very long.

Very long documents are useful for testing for anomalies, but they're 
not so useful as retrieved documents, nor typical of applications.  Very 
long hits are awkward for users.  Book search engines usually operate 
best either by breaking texts into small units (chapters, pages, 
overlapping windows, etc.) and searching those rather than the entire 
work, perhaps merging multiple hits from the same work in displayed 
results.  (See, e.g., California Digital Library's XTF system, built by 
Kirk Hastings using Lucene. http://www.cdlib.org/inside/projects/xtf/)

I think Wikipedia is a much more typical use of Lucene.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Steven Parkes <st...@esseff.org>.

And what about Project Gutenburg?

Wikipedia is going to have relatively short text, Gutenburg very long.

-----Original Message-----
From: Steven Parkes [mailto:steven_parkes@esseff.org] 
Sent: Friday, March 23, 2007 2:37 PM
To: java-dev@lucene.apache.org
Subject: RE: [jira] Commented: (LUCENE-845) If you "flush by RAM usage"
then IndexWriter may over-merge

Well, since I want to look at the impact of merge policy, I'll look into
this.

Wikipedia is easy to download (bandwidth notwithdstanding). The bz2'd of
the current English pages is 2.1G. That's certainly a lot of data. It
looks like the English is about 1.8M docs.  All languages is something
like 21M now.

I was also thinking of the TREC data but that seems hard to come by?

-----Original Message-----
From: Grant Ingersoll [mailto:gsingers@apache.org] 
Sent: Friday, March 23, 2007 1:09 PM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage"
then IndexWriter may over-merge

Yeah, I didn't play yet with millions of documents.  We will need a  
bigger test collection, I think!  Although the benchmarker can add as  
many as you want from the same source, index compression will effect  
the results possibly more than a bigger collection with all unique docs.

Maybe it is time to look at adding Wikipedia as a test collection.  I  
think there are something like 18+ million docs in it.

On Mar 23, 2007, at 4:01 PM, Doug Cutting wrote:

> Michael McCandless wrote:
>> Also, one caveat: whenever #docs (21578 for Reuters) divided by
>> maxBuffered docs is less than mergeFactor, you will have no merges
>> take place during your runs.  This greatly skews the results.
>
> Also, my guess is that this index fits entirely in the buffer  
> cache. Things behave quite differently when segments are larger  
> than available memory and merging requires lots of disk i/o.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Steven Parkes <st...@esseff.org>.

Well, since I want to look at the impact of merge policy, I'll look into
this.

Wikipedia is easy to download (bandwidth notwithdstanding). The bz2'd of
the current English pages is 2.1G. That's certainly a lot of data. It
looks like the English is about 1.8M docs.  All languages is something
like 21M now.

I was also thinking of the TREC data but that seems hard to come by?

-----Original Message-----
From: Grant Ingersoll [mailto:gsingers@apache.org] 
Sent: Friday, March 23, 2007 1:09 PM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage"
then IndexWriter may over-merge

Yeah, I didn't play yet with millions of documents.  We will need a  
bigger test collection, I think!  Although the benchmarker can add as  
many as you want from the same source, index compression will effect  
the results possibly more than a bigger collection with all unique docs.

Maybe it is time to look at adding Wikipedia as a test collection.  I  
think there are something like 18+ million docs in it.

On Mar 23, 2007, at 4:01 PM, Doug Cutting wrote:

> Michael McCandless wrote:
>> Also, one caveat: whenever #docs (21578 for Reuters) divided by
>> maxBuffered docs is less than mergeFactor, you will have no merges
>> take place during your runs.  This greatly skews the results.
>
> Also, my guess is that this index fits entirely in the buffer  
> cache. Things behave quite differently when segments are larger  
> than available memory and merging requires lots of disk i/o.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Grant Ingersoll <gs...@apache.org>.

Yeah, I didn't play yet with millions of documents.  We will need a  
bigger test collection, I think!  Although the benchmarker can add as  
many as you want from the same source, index compression will effect  
the results possibly more than a bigger collection with all unique docs.

Maybe it is time to look at adding Wikipedia as a test collection.  I  
think there are something like 18+ million docs in it.

On Mar 23, 2007, at 4:01 PM, Doug Cutting wrote:

> Michael McCandless wrote:
>> Also, one caveat: whenever #docs (21578 for Reuters) divided by
>> maxBuffered docs is less than mergeFactor, you will have no merges
>> take place during your runs.  This greatly skews the results.
>
> Also, my guess is that this index fits entirely in the buffer  
> cache. Things behave quite differently when segments are larger  
> than available memory and merging requires lots of disk i/o.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Doug Cutting <cu...@apache.org>.

Michael McCandless wrote:
> Also, one caveat: whenever #docs (21578 for Reuters) divided by
> maxBuffered docs is less than mergeFactor, you will have no merges
> take place during your runs.  This greatly skews the results.

Also, my guess is that this index fits entirely in the buffer cache. 
Things behave quite differently when segments are larger than available 
memory and merging requires lots of disk i/o.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Michael McCandless <lu...@mikemccandless.com>.

"Grant Ingersoll" <gs...@apache.org> wrote:

> Your timing is ironic.  I was just running some benchmarks for  
> ApacheCon (using contrib/benchmarker) and noticed what I think are  
> similar happenings, so maybe you can validate my assumptions.  I'm  
> not sure if it is because I'm hitting RAM issues or not.
> 
> Below is the algorithm file for use w/ benchmarker.  To run it, save  
> the file, cd into contrib/benchamarker (make sure you get the lastest  
> commits) and run
> ant run-task -Dtask.mem=XXXXm -Dtask.alg=<path to file>
> 
> The basic idea is, there are ~21580 docs in the Reuters, so I wanted  
> to run some experiments around them with different merge factors and  
> max.buffered.  Granted, some of the factors are ridiculous, but I  
> wanted to look at these a bit b/c you see people on the user list  
> from time to time talking/asking about setting really high numbers  
> for mergeFactor and maxBufferedDocs.
> 
> The sweet spot on my machine seems to be mergeFactor == 100,  
> maxBD=1000.  I ran with -Dtask.mem=1024M on a machine with 2gb of  
> RAM.  If I am understanding the numbers correctly, and what you are  
> arguing, this sweet spot happens to coincide approximately with the  
> amount of memory I gave the process.  I probably could play a little  
> bit more with options to reach the inflection point.  So, to some  
> extent, I think your approach for RAM based modeling is worth pursuing.

Interesting results!  Because an even higher maxBufferedDocs (10000 =
299.1 rec/s and 21580 = 271.9 rec/s, @ mergeFactor=100) gave you worse
performance even though they were able to complete (meaning you had
enough RAM to buffer all those docs).  Perhaps this is because GC had
to work harder?  So it seems like at some point the benefits of giving
more RAM taper off.

Also, one caveat: whenever #docs (21578 for Reuters) divided by
maxBuffered docs is less than mergeFactor, you will have no merges
take place during your runs.  This greatly skews the results.

I'm also struggling with this on LUCENE-843 because it makes it far
harder to do apples to apples comparison.  With the patch for
LUCENE-843, many more docs can be buffered into a given fixed MB RAM.
So then it flushes less often and may hit no merges (when the baseline
Lucene trunk does hit merges), or the opposite: it may hit a massive
large merge close to the end when the baseline Lucene trunk did a few
small merges.  We sort of need some metric that can normalize for how
much "merge servicing" took place during a run, or something.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Grant Ingersoll <gs...@apache.org>.

Hi Mike,

Your timing is ironic.  I was just running some benchmarks for  
ApacheCon (using contrib/benchmarker) and noticed what I think are  
similar happenings, so maybe you can validate my assumptions.  I'm  
not sure if it is because I'm hitting RAM issues or not.

Below is the algorithm file for use w/ benchmarker.  To run it, save  
the file, cd into contrib/benchamarker (make sure you get the lastest  
commits) and run
ant run-task -Dtask.mem=XXXXm -Dtask.alg=<path to file>

The basic idea is, there are ~21580 docs in the Reuters, so I wanted  
to run some experiments around them with different merge factors and  
max.buffered.  Granted, some of the factors are ridiculous, but I  
wanted to look at these a bit b/c you see people on the user list  
from time to time talking/asking about setting really high numbers  
for mergeFactor and maxBufferedDocs.

The sweet spot on my machine seems to be mergeFactor == 100,  
maxBD=1000.  I ran with -Dtask.mem=1024M on a machine with 2gb of  
RAM.  If I am understanding the numbers correctly, and what you are  
arguing, this sweet spot happens to coincide approximately with the  
amount of memory I gave the process.  I probably could play a little  
bit more with options to reach the inflection point.  So, to some  
extent, I think your approach for RAM based modeling is worth pursuing.

Mostly this is just food for thought.  I think what I am doing is  
correct, but am open to suggestions.

Here are my results:
  [java] ------------> Report Sum By (any) Name (6 about 66 out of 66)
     [java] Operation      round merge max.buffered   runCnt    
recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
     [java] Rounds_13          0    10           10        1        
286039        183.0    1,563.30   956,043,840  1,065,484,288
     [java] Populate-Opt -  -  - -   - -  -  -  - - -  -  13 -  -    
22003 -  -   184.6 -  1,549.36 - 347,786,464 -  461,652,288
     [java] CreateIndex        -     -            -        
13            1         43.9        0.30   103,676,920    380,309,824
     [java] MAddDocs_22000 -   - -   - -  -  -  - - -  -  13 -  -    
22000 -  -   195.9 -  1,459.75 - 358,755,040 -  461,652,288
     [java] Optimize           -     -            -        
13            1          0.1       89.29   365,944,832    461,652,288
     [java] CloseIndex -  -  - - -   - -  -  -  - - -  -  13 -  -  -   
- 1 -  -   866.7 -  -   0.01 - 347,786,464 -  461,652,288


     [java] ------------> Report sum by Prefix (MAddDocs) and Round  
(13 about 13 out of 66)
     [java] Operation      round merge max.buffered   runCnt    
recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
     [java] MAddDocs_22000     0    10           10        1         
22000        142.3      154.59     6,969,024     12,271,616
     [java] MAddDocs_22000 -   1 -  50 -  -  -   10 -  -   1 -  -    
22000 -  -   159.7 -  - 137.75 -   7,517,728 -   12,861,440
     [java] MAddDocs_22000     2   100           10        1         
22000        156.7      140.38     9,460,648     13,668,352
     [java] MAddDocs_22000 -   3  1000 -  -  -   10 -  -   1 -  -    
22000 -  -   145.4 -  - 151.33 -  29,072,880 -   36,892,672
     [java] MAddDocs_22000     4  2000           10        1         
22000        112.0      196.47    38,067,048     51,974,144
     [java] MAddDocs_22000 -   5 -  10 -  -  -  100 -  -   1 -  -    
22000 -  -   161.9 -  - 135.89 -  40,896,336 -   51,974,144
     [java] MAddDocs_22000     6    10         1000        1         
22000        266.9       82.44    53,033,616     71,766,016
     [java] MAddDocs_22000 -   7 -  10 -  -   10000 -  -   1 -  -    
22000 -  -   288.9 -  -  76.14 - 392,512,032 -  422,649,856
     [java] MAddDocs_22000     8    10        21580        1         
22000        272.0       80.89   708,970,944  1,065,484,288
     [java] MAddDocs_22000 -   9 - 100 -  -   21580 -  -   1 -  -    
22000 -  -   271.9 -  -  80.91 - 767,851,072  1,065,484,288
     [java] MAddDocs_22000    10  1000        21580        1         
22000        275.4       79.89   767,510,464  1,065,484,288
#Sweet Spot for this test
     [java] MAddDocs_22000 -  11 - 100 -  -  - 1000 -  -   1 -  -    
22000 -  -   316.5 -  -  69.52 - 924,356,864  1,065,484,288
     [java] MAddDocs_22000    12   100        10000        1         
22000        299.1       73.56   917,596,992  1,065,484,288


     [java] ------------> Report sum by Prefix (Populate-Opt) and  
Round (13 about 13 out of 66)
     [java] Operation    round merge max.buffered   runCnt    
recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
     [java] Populate-Opt     0    10           10        1         
22003        136.0      161.75     7,331,992     12,271,616
     [java] Populate-Opt -   1 -  50 -  -  -   10 -  -   1 -  -    
22003 -  -   151.8 -  - 144.99 -   8,065,640 -   12,861,440
     [java] Populate-Opt     2   100           10        1         
22003        149.6      147.06     9,927,872     13,668,352
     [java] Populate-Opt -   3  1000 -  -  -   10 -  -   1 -  -    
22003 -  -   138.9 -  - 158.38 -  32,094,624 -   36,892,672
     [java] Populate-Opt     4  2000           10        1         
22003        105.8      207.91    41,058,208     51,974,144
     [java] Populate-Opt -   5 -  10 -  -  -  100 -  -   1 -  -    
22003 -  -   156.0 -  - 141.03 -  41,375,032 -   51,974,144
     [java] Populate-Opt     6    10         1000        1         
22003        249.5       88.20    53,494,472     71,766,016
     [java] Populate-Opt -   7 -  10 -  -   10000 -  -   1 -  -    
22003 -  -   259.5 -  -  84.78 - 226,485,280 -  422,649,856
     [java] Populate-Opt     8    10        21580        1         
22003        254.6       86.44   675,577,344  1,065,484,288
     [java] Populate-Opt -   9 - 100 -  -   21580 -  -   1 -  -    
22003 -  -   253.5 -  -  86.78 - 791,214,016  1,065,484,288
     [java] Populate-Opt    10  1000        21580        1         
22003        258.7       85.06   790,837,440  1,065,484,288
     [java] Populate-Opt -  11 - 100 -  -  - 1000 -  -   1 -  -    
22003 -  -   289.9 -  -  75.89 - 887,718,272  1,065,484,288
     [java] Populate-Opt    12   100        10000        1         
22003        271.3       81.09   956,043,840  1,065,484,288



#last value is more than all the docs in reuters
merge.factor=merge:10:100:1000:5000:10:10:10:10:100:1000
max.buffered=max.buffered:10:10:10:10:100:1000:10000:21580:21580:21580
compound=true

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory
#directory=RamDirectory

doc.stored=true
doc.tokenized=true
doc.term.vector=false
doc.add.log.step=1000

docs.dir=reuters-out
#docs.dir=reuters-111

#doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker

#query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker
query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker

# task at this depth or less would print when they start
task.max.depth.log=2

log.queries=true
#  
------------------------------------------------------------------------ 
-------------

{ "Rounds"

     ResetSystemErase

     { "Populate-Opt"
         CreateIndex
         { "MAddDocs" AddDoc > : 22000
         Optimize
         CloseIndex
     }

     NewRound

} : 10

RepSumByName
RepSumByPrefRound MAddDocs
RepSumByPrefRound Populate-Opt


On Mar 23, 2007, at 11:27 AM, Michael McCandless (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/LUCENE-845? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel#action_12483631 ]
>
> Michael McCandless commented on LUCENE-845:
> -------------------------------------------
>
> This bug is actually rather serious.
>
> If you set maxBufferedDocs to a very large number (on the expectation
> that it's not used since you will manually flush by RAM usage) then
> the merge policy will always merge the index down to 1 segment as soon
> as it hits mergeFactor segments.
>
> This will be an O(N^2) slowdown.  EG if based on RAM you are
> flushing every 100 docs, then at 1000 docs you will merge to 1
> segment.  Then at 1900 docs, you merge to 1 segment again.  At 2800,
> 3700, 4600, ... (every 900 docs) you keep merging to 1 segment.  Your
> indexing process will get very slow because every 900 documents the
> entire index is effectively being optimized.
>
> With LUCENE-843 I'm thinking we should deprecate maxBufferedDocs
> entirely and switch to flushing by RAM usage instead (you can always
> manually flush every N documents in your app if for some reason you
> need that).  But obviously we need to resolve this bug first.
>
>
>> If you "flush by RAM usage" then IndexWriter may over-merge
>> -----------------------------------------------------------
>>
>>                 Key: LUCENE-845
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>>             Project: Lucene - Java
>>          Issue Type: Bug
>>          Components: Index
>>    Affects Versions: 2.1
>>            Reporter: Michael McCandless
>>         Assigned To: Michael McCandless
>>            Priority: Minor
>>
>> I think a good way to maximize performance of Lucene's indexing for a
>> given amount of RAM is to flush (writer.flush()) the added documents
>> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
>> RAM you can afford.
>> But, this can confuse the merge policy and cause over-merging, unless
>> you set maxBufferedDocs properly.
>> This is because the merge policy looks at the current maxBufferedDocs
>> to figure out which segments are level 0 (first flushed) or level 1
>> (merged from <mergeFactor> level 0 segments).
>> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
>> of a segment and "infer" level from this?  Still we would have to be
>> resilient to the application suddenly increasing the RAM allowed.
>> The good news is to workaround this bug I think you just need to
>> ensure that your maxBufferedDocs is less than mergeFactor *
>> typical-number-of-docs-flushed.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org