You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2007/03/22 21:16:32 UTC

[jira] Created: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

If you "flush by RAM usage" then IndexWriter may over-merge
-----------------------------------------------------------

                 Key: LUCENE-845
                 URL: https://issues.apache.org/jira/browse/LUCENE-845
             Project: Lucene - Java
          Issue Type: Bug
          Components: Index
    Affects Versions: 2.1
            Reporter: Michael McCandless
         Assigned To: Michael McCandless
            Priority: Minor


I think a good way to maximize performance of Lucene's indexing for a
given amount of RAM is to flush (writer.flush()) the added documents
whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
RAM you can afford.

But, this can confuse the merge policy and cause over-merging, unless
you set maxBufferedDocs properly.

This is because the merge policy looks at the current maxBufferedDocs
to figure out which segments are level 0 (first flushed) or level 1
(merged from <mergeFactor> level 0 segments).

I'm not sure how to fix this.  Maybe we can look at net size (bytes)
of a segment and "infer" level from this?  Still we would have to be
resilient to the application suddenly increasing the RAM allowed.

The good news is to workaround this bug I think you just need to
ensure that your maxBufferedDocs is less than mergeFactor *
typical-number-of-docs-flushed.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Michael McCandless <lu...@mikemccandless.com>.
"Ning Li" <ni...@gmail.com> wrote:
> On 3/26/07, Michael McCandless (JIRA) <ji...@apache.org> wrote:
> > Ahhh, this is a very good point.  OK I won't deprecate "flushing by
> > doc count" and instead will allow either "flush by RAM usage" (default
> > to this?) or "flush by doc count".
> 
> Just want to clarify: It's either "flush and merge by byte size" or
> "flush and merge by doc count", right?

Good point, to keep the doc IDs identical, the merge policy must also be
identical.  But I think we should still default to "flush by RAM usage"
and "merge by segment size"?  And then developers who rely on doc IDs to
follow a specific controlled pattern (eg for ParallelReader) would set
the writer to flush by doc count and then set it to the "by doc count"
merge policy.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Ning Li <ni...@gmail.com>.
On 3/26/07, Michael McCandless (JIRA) <ji...@apache.org> wrote:
> Ahhh, this is a very good point.  OK I won't deprecate "flushing by
> doc count" and instead will allow either "flush by RAM usage" (default
> to this?) or "flush by doc count".

Just want to clarify: It's either "flush and merge by byte size" or
"flush and merge by doc count", right?

Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520181 ] 

Michael McCandless commented on LUCENE-845:
-------------------------------------------


> This increases file descriptor usage in some cases, right? In the
> old scheme, if you set mergeFactor to 10 and maxBufferedDocs to
> 1000, you'd only get 10 segments with size <= 1000. But with this
> code, you can't bound that anymore. If I create single doc segments
> (perhaps by flushing based on latency), I can get 30 of them?

Right, the # segments allowed in the index will be more than it is w/
the current merge policy if you consistently flush with [far] fewer
docs than maxBufferedDocs is set to.

But, this is actually the essense of the bug.  The case we're trying
to fix is where you set maxBufferedDocs to something really large (say
1,000,000) to avoid flushing by doc count, and you setRamBufferSizeMB
to something like 32 MB.  In this case, the current merge policy would
just keep merging any set of 10 segments with < 1,000,000 docs each,
such that eventually all your indexing time is being spent doing
highly sub-optimal merges.



> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484150 ] 

Yonik Seeley commented on LUCENE-845:
-------------------------------------

> With LUCENE-843 I'm thinking we should deprecate maxBufferedDocs entirely

That might present a problem for users of ParallelReader.  Right now, it's possible to construct two indicies with corresponding docids.... switching to flush-by-ram would makesegment merging unpredictable and destroy the docid matching.

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Steven Parkes <st...@esseff.org>.
> Very long documents are useful for testing for anomalies, but they're 
> not so useful as retrieved documents, nor typical of applications.

That's what I thought, too. I'm kinda curious to see how gutenburg
compares to wikipedia for things like merge policy, in particular,
by-docs vs. by-bytes. But while a curiosity, I don't see that it has all
much practical value, so I'm not sure I can get it to the top of the
to-do list.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Doug Cutting <cu...@apache.org>.
Steven Parkes wrote:
> And what about Project Gutenburg?
> 
> Wikipedia is going to have relatively short text, Gutenburg very long.

Very long documents are useful for testing for anomalies, but they're 
not so useful as retrieved documents, nor typical of applications.  Very 
long hits are awkward for users.  Book search engines usually operate 
best either by breaking texts into small units (chapters, pages, 
overlapping windows, etc.) and searching those rather than the entire 
work, perhaps merging multiple hits from the same work in displayed 
results.  (See, e.g., California Digital Library's XTF system, built by 
Kirk Hastings using Lucene. http://www.cdlib.org/inside/projects/xtf/)

I think Wikipedia is a much more typical use of Lucene.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Steven Parkes <st...@esseff.org>.
And what about Project Gutenburg?

Wikipedia is going to have relatively short text, Gutenburg very long.

-----Original Message-----
From: Steven Parkes [mailto:steven_parkes@esseff.org] 
Sent: Friday, March 23, 2007 2:37 PM
To: java-dev@lucene.apache.org
Subject: RE: [jira] Commented: (LUCENE-845) If you "flush by RAM usage"
then IndexWriter may over-merge

Well, since I want to look at the impact of merge policy, I'll look into
this.

Wikipedia is easy to download (bandwidth notwithdstanding). The bz2'd of
the current English pages is 2.1G. That's certainly a lot of data. It
looks like the English is about 1.8M docs.  All languages is something
like 21M now.

I was also thinking of the TREC data but that seems hard to come by?

-----Original Message-----
From: Grant Ingersoll [mailto:gsingers@apache.org] 
Sent: Friday, March 23, 2007 1:09 PM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage"
then IndexWriter may over-merge

Yeah, I didn't play yet with millions of documents.  We will need a  
bigger test collection, I think!  Although the benchmarker can add as  
many as you want from the same source, index compression will effect  
the results possibly more than a bigger collection with all unique docs.

Maybe it is time to look at adding Wikipedia as a test collection.  I  
think there are something like 18+ million docs in it.

On Mar 23, 2007, at 4:01 PM, Doug Cutting wrote:

> Michael McCandless wrote:
>> Also, one caveat: whenever #docs (21578 for Reuters) divided by
>> maxBuffered docs is less than mergeFactor, you will have no merges
>> take place during your runs.  This greatly skews the results.
>
> Also, my guess is that this index fits entirely in the buffer  
> cache. Things behave quite differently when segments are larger  
> than available memory and merging requires lots of disk i/o.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Steven Parkes <st...@esseff.org>.
Well, since I want to look at the impact of merge policy, I'll look into
this.

Wikipedia is easy to download (bandwidth notwithdstanding). The bz2'd of
the current English pages is 2.1G. That's certainly a lot of data. It
looks like the English is about 1.8M docs.  All languages is something
like 21M now.

I was also thinking of the TREC data but that seems hard to come by?

-----Original Message-----
From: Grant Ingersoll [mailto:gsingers@apache.org] 
Sent: Friday, March 23, 2007 1:09 PM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage"
then IndexWriter may over-merge

Yeah, I didn't play yet with millions of documents.  We will need a  
bigger test collection, I think!  Although the benchmarker can add as  
many as you want from the same source, index compression will effect  
the results possibly more than a bigger collection with all unique docs.

Maybe it is time to look at adding Wikipedia as a test collection.  I  
think there are something like 18+ million docs in it.

On Mar 23, 2007, at 4:01 PM, Doug Cutting wrote:

> Michael McCandless wrote:
>> Also, one caveat: whenever #docs (21578 for Reuters) divided by
>> maxBuffered docs is less than mergeFactor, you will have no merges
>> take place during your runs.  This greatly skews the results.
>
> Also, my guess is that this index fits entirely in the buffer  
> cache. Things behave quite differently when segments are larger  
> than available memory and merging requires lots of disk i/o.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Grant Ingersoll <gs...@apache.org>.
Yeah, I didn't play yet with millions of documents.  We will need a  
bigger test collection, I think!  Although the benchmarker can add as  
many as you want from the same source, index compression will effect  
the results possibly more than a bigger collection with all unique docs.

Maybe it is time to look at adding Wikipedia as a test collection.  I  
think there are something like 18+ million docs in it.

On Mar 23, 2007, at 4:01 PM, Doug Cutting wrote:

> Michael McCandless wrote:
>> Also, one caveat: whenever #docs (21578 for Reuters) divided by
>> maxBuffered docs is less than mergeFactor, you will have no merges
>> take place during your runs.  This greatly skews the results.
>
> Also, my guess is that this index fits entirely in the buffer  
> cache. Things behave quite differently when segments are larger  
> than available memory and merging requires lots of disk i/o.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Doug Cutting <cu...@apache.org>.
Michael McCandless wrote:
> Also, one caveat: whenever #docs (21578 for Reuters) divided by
> maxBuffered docs is less than mergeFactor, you will have no merges
> take place during your runs.  This greatly skews the results.

Also, my guess is that this index fits entirely in the buffer cache. 
Things behave quite differently when segments are larger than available 
memory and merging requires lots of disk i/o.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Michael McCandless <lu...@mikemccandless.com>.
"Grant Ingersoll" <gs...@apache.org> wrote:

> Your timing is ironic.  I was just running some benchmarks for  
> ApacheCon (using contrib/benchmarker) and noticed what I think are  
> similar happenings, so maybe you can validate my assumptions.  I'm  
> not sure if it is because I'm hitting RAM issues or not.
> 
> Below is the algorithm file for use w/ benchmarker.  To run it, save  
> the file, cd into contrib/benchamarker (make sure you get the lastest  
> commits) and run
> ant run-task -Dtask.mem=XXXXm -Dtask.alg=<path to file>
> 
> The basic idea is, there are ~21580 docs in the Reuters, so I wanted  
> to run some experiments around them with different merge factors and  
> max.buffered.  Granted, some of the factors are ridiculous, but I  
> wanted to look at these a bit b/c you see people on the user list  
> from time to time talking/asking about setting really high numbers  
> for mergeFactor and maxBufferedDocs.
> 
> The sweet spot on my machine seems to be mergeFactor == 100,  
> maxBD=1000.  I ran with -Dtask.mem=1024M on a machine with 2gb of  
> RAM.  If I am understanding the numbers correctly, and what you are  
> arguing, this sweet spot happens to coincide approximately with the  
> amount of memory I gave the process.  I probably could play a little  
> bit more with options to reach the inflection point.  So, to some  
> extent, I think your approach for RAM based modeling is worth pursuing.

Interesting results!  Because an even higher maxBufferedDocs (10000 =
299.1 rec/s and 21580 = 271.9 rec/s, @ mergeFactor=100) gave you worse
performance even though they were able to complete (meaning you had
enough RAM to buffer all those docs).  Perhaps this is because GC had
to work harder?  So it seems like at some point the benefits of giving
more RAM taper off.

Also, one caveat: whenever #docs (21578 for Reuters) divided by
maxBuffered docs is less than mergeFactor, you will have no merges
take place during your runs.  This greatly skews the results.

I'm also struggling with this on LUCENE-843 because it makes it far
harder to do apples to apples comparison.  With the patch for
LUCENE-843, many more docs can be buffered into a given fixed MB RAM.
So then it flushes less often and may hit no merges (when the baseline
Lucene trunk does hit merges), or the opposite: it may hit a massive
large merge close to the end when the baseline Lucene trunk did a few
small merges.  We sort of need some metric that can normalize for how
much "merge servicing" took place during a run, or something.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by Grant Ingersoll <gs...@apache.org>.
Hi Mike,

Your timing is ironic.  I was just running some benchmarks for  
ApacheCon (using contrib/benchmarker) and noticed what I think are  
similar happenings, so maybe you can validate my assumptions.  I'm  
not sure if it is because I'm hitting RAM issues or not.

Below is the algorithm file for use w/ benchmarker.  To run it, save  
the file, cd into contrib/benchamarker (make sure you get the lastest  
commits) and run
ant run-task -Dtask.mem=XXXXm -Dtask.alg=<path to file>

The basic idea is, there are ~21580 docs in the Reuters, so I wanted  
to run some experiments around them with different merge factors and  
max.buffered.  Granted, some of the factors are ridiculous, but I  
wanted to look at these a bit b/c you see people on the user list  
from time to time talking/asking about setting really high numbers  
for mergeFactor and maxBufferedDocs.

The sweet spot on my machine seems to be mergeFactor == 100,  
maxBD=1000.  I ran with -Dtask.mem=1024M on a machine with 2gb of  
RAM.  If I am understanding the numbers correctly, and what you are  
arguing, this sweet spot happens to coincide approximately with the  
amount of memory I gave the process.  I probably could play a little  
bit more with options to reach the inflection point.  So, to some  
extent, I think your approach for RAM based modeling is worth pursuing.

Mostly this is just food for thought.  I think what I am doing is  
correct, but am open to suggestions.

Here are my results:
  [java] ------------> Report Sum By (any) Name (6 about 66 out of 66)
     [java] Operation      round merge max.buffered   runCnt    
recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
     [java] Rounds_13          0    10           10        1        
286039        183.0    1,563.30   956,043,840  1,065,484,288
     [java] Populate-Opt -  -  - -   - -  -  -  - - -  -  13 -  -    
22003 -  -   184.6 -  1,549.36 - 347,786,464 -  461,652,288
     [java] CreateIndex        -     -            -        
13            1         43.9        0.30   103,676,920    380,309,824
     [java] MAddDocs_22000 -   - -   - -  -  -  - - -  -  13 -  -    
22000 -  -   195.9 -  1,459.75 - 358,755,040 -  461,652,288
     [java] Optimize           -     -            -        
13            1          0.1       89.29   365,944,832    461,652,288
     [java] CloseIndex -  -  - - -   - -  -  -  - - -  -  13 -  -  -   
- 1 -  -   866.7 -  -   0.01 - 347,786,464 -  461,652,288


     [java] ------------> Report sum by Prefix (MAddDocs) and Round  
(13 about 13 out of 66)
     [java] Operation      round merge max.buffered   runCnt    
recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
     [java] MAddDocs_22000     0    10           10        1         
22000        142.3      154.59     6,969,024     12,271,616
     [java] MAddDocs_22000 -   1 -  50 -  -  -   10 -  -   1 -  -    
22000 -  -   159.7 -  - 137.75 -   7,517,728 -   12,861,440
     [java] MAddDocs_22000     2   100           10        1         
22000        156.7      140.38     9,460,648     13,668,352
     [java] MAddDocs_22000 -   3  1000 -  -  -   10 -  -   1 -  -    
22000 -  -   145.4 -  - 151.33 -  29,072,880 -   36,892,672
     [java] MAddDocs_22000     4  2000           10        1         
22000        112.0      196.47    38,067,048     51,974,144
     [java] MAddDocs_22000 -   5 -  10 -  -  -  100 -  -   1 -  -    
22000 -  -   161.9 -  - 135.89 -  40,896,336 -   51,974,144
     [java] MAddDocs_22000     6    10         1000        1         
22000        266.9       82.44    53,033,616     71,766,016
     [java] MAddDocs_22000 -   7 -  10 -  -   10000 -  -   1 -  -    
22000 -  -   288.9 -  -  76.14 - 392,512,032 -  422,649,856
     [java] MAddDocs_22000     8    10        21580        1         
22000        272.0       80.89   708,970,944  1,065,484,288
     [java] MAddDocs_22000 -   9 - 100 -  -   21580 -  -   1 -  -    
22000 -  -   271.9 -  -  80.91 - 767,851,072  1,065,484,288
     [java] MAddDocs_22000    10  1000        21580        1         
22000        275.4       79.89   767,510,464  1,065,484,288
#Sweet Spot for this test
     [java] MAddDocs_22000 -  11 - 100 -  -  - 1000 -  -   1 -  -    
22000 -  -   316.5 -  -  69.52 - 924,356,864  1,065,484,288
     [java] MAddDocs_22000    12   100        10000        1         
22000        299.1       73.56   917,596,992  1,065,484,288


     [java] ------------> Report sum by Prefix (Populate-Opt) and  
Round (13 about 13 out of 66)
     [java] Operation    round merge max.buffered   runCnt    
recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
     [java] Populate-Opt     0    10           10        1         
22003        136.0      161.75     7,331,992     12,271,616
     [java] Populate-Opt -   1 -  50 -  -  -   10 -  -   1 -  -    
22003 -  -   151.8 -  - 144.99 -   8,065,640 -   12,861,440
     [java] Populate-Opt     2   100           10        1         
22003        149.6      147.06     9,927,872     13,668,352
     [java] Populate-Opt -   3  1000 -  -  -   10 -  -   1 -  -    
22003 -  -   138.9 -  - 158.38 -  32,094,624 -   36,892,672
     [java] Populate-Opt     4  2000           10        1         
22003        105.8      207.91    41,058,208     51,974,144
     [java] Populate-Opt -   5 -  10 -  -  -  100 -  -   1 -  -    
22003 -  -   156.0 -  - 141.03 -  41,375,032 -   51,974,144
     [java] Populate-Opt     6    10         1000        1         
22003        249.5       88.20    53,494,472     71,766,016
     [java] Populate-Opt -   7 -  10 -  -   10000 -  -   1 -  -    
22003 -  -   259.5 -  -  84.78 - 226,485,280 -  422,649,856
     [java] Populate-Opt     8    10        21580        1         
22003        254.6       86.44   675,577,344  1,065,484,288
     [java] Populate-Opt -   9 - 100 -  -   21580 -  -   1 -  -    
22003 -  -   253.5 -  -  86.78 - 791,214,016  1,065,484,288
     [java] Populate-Opt    10  1000        21580        1         
22003        258.7       85.06   790,837,440  1,065,484,288
     [java] Populate-Opt -  11 - 100 -  -  - 1000 -  -   1 -  -    
22003 -  -   289.9 -  -  75.89 - 887,718,272  1,065,484,288
     [java] Populate-Opt    12   100        10000        1         
22003        271.3       81.09   956,043,840  1,065,484,288



#last value is more than all the docs in reuters
merge.factor=merge:10:100:1000:5000:10:10:10:10:100:1000
max.buffered=max.buffered:10:10:10:10:100:1000:10000:21580:21580:21580
compound=true

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory
#directory=RamDirectory

doc.stored=true
doc.tokenized=true
doc.term.vector=false
doc.add.log.step=1000

docs.dir=reuters-out
#docs.dir=reuters-111

#doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker

#query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker
query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker

# task at this depth or less would print when they start
task.max.depth.log=2

log.queries=true
#  
------------------------------------------------------------------------ 
-------------

{ "Rounds"

     ResetSystemErase

     { "Populate-Opt"
         CreateIndex
         { "MAddDocs" AddDoc > : 22000
         Optimize
         CloseIndex
     }

     NewRound

} : 10

RepSumByName
RepSumByPrefRound MAddDocs
RepSumByPrefRound Populate-Opt


On Mar 23, 2007, at 11:27 AM, Michael McCandless (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/LUCENE-845? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel#action_12483631 ]
>
> Michael McCandless commented on LUCENE-845:
> -------------------------------------------
>
> This bug is actually rather serious.
>
> If you set maxBufferedDocs to a very large number (on the expectation
> that it's not used since you will manually flush by RAM usage) then
> the merge policy will always merge the index down to 1 segment as soon
> as it hits mergeFactor segments.
>
> This will be an O(N^2) slowdown.  EG if based on RAM you are
> flushing every 100 docs, then at 1000 docs you will merge to 1
> segment.  Then at 1900 docs, you merge to 1 segment again.  At 2800,
> 3700, 4600, ... (every 900 docs) you keep merging to 1 segment.  Your
> indexing process will get very slow because every 900 documents the
> entire index is effectively being optimized.
>
> With LUCENE-843 I'm thinking we should deprecate maxBufferedDocs
> entirely and switch to flushing by RAM usage instead (you can always
> manually flush every N documents in your app if for some reason you
> need that).  But obviously we need to resolve this bug first.
>
>
>> If you "flush by RAM usage" then IndexWriter may over-merge
>> -----------------------------------------------------------
>>
>>                 Key: LUCENE-845
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>>             Project: Lucene - Java
>>          Issue Type: Bug
>>          Components: Index
>>    Affects Versions: 2.1
>>            Reporter: Michael McCandless
>>         Assigned To: Michael McCandless
>>            Priority: Minor
>>
>> I think a good way to maximize performance of Lucene's indexing for a
>> given amount of RAM is to flush (writer.flush()) the added documents
>> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
>> RAM you can afford.
>> But, this can confuse the merge policy and cause over-merging, unless
>> you set maxBufferedDocs properly.
>> This is because the merge policy looks at the current maxBufferedDocs
>> to figure out which segments are level 0 (first flushed) or level 1
>> (merged from <mergeFactor> level 0 segments).
>> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
>> of a segment and "infer" level from this?  Still we would have to be
>> resilient to the application suddenly increasing the RAM allowed.
>> The good news is to workaround this bug I think you just need to
>> ensure that your maxBufferedDocs is less than mergeFactor *
>> typical-number-of-docs-flushed.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483631 ] 

Michael McCandless commented on LUCENE-845:
-------------------------------------------

This bug is actually rather serious.

If you set maxBufferedDocs to a very large number (on the expectation
that it's not used since you will manually flush by RAM usage) then
the merge policy will always merge the index down to 1 segment as soon
as it hits mergeFactor segments.

This will be an O(N^2) slowdown.  EG if based on RAM you are
flushing every 100 docs, then at 1000 docs you will merge to 1
segment.  Then at 1900 docs, you merge to 1 segment again.  At 2800,
3700, 4600, ... (every 900 docs) you keep merging to 1 segment.  Your
indexing process will get very slow because every 900 documents the
entire index is effectively being optimized.

With LUCENE-843 I'm thinking we should deprecate maxBufferedDocs
entirely and switch to flushing by RAM usage instead (you can always
manually flush every N documents in your app if for some reason you
need that).  But obviously we need to resolve this bug first.


> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520271 ] 

Yonik Seeley commented on LUCENE-845:
-------------------------------------

Is there a change in filedescriptor use if you don't use setRamBufferSizeMB?


> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520378 ] 

Michael McCandless commented on LUCENE-845:
-------------------------------------------

> I think this would be great. It's always been a pet peeve of mine
> that even in low pressure/activity environments, there is often a
> delay from write to read.

I'll open a new issue.

> Sounds like this would help take most of the work/risk off the
> developer.

Precisely!  Out of the box we can have very low latency from
IndexWriter flushing single doc segments, and not having to pay the
O(N^2) merge cost of merging down such segments to be "at all moments"
ready for an IndexReader to open the index, while IndexReader can load
such an index (or re-open by loading only the "new" segments) and very
quickly reduce the # segments so that searching is still fast.



> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Steven Parkes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492814 ] 

Steven Parkes commented on LUCENE-845:
--------------------------------------

Following up on this, it's basically the idea that segments ought to be created/merged both either by-segment-size or by-doc-count but not by a mixture? That wouldn't be suprising ...

It does impact the APIs, though. It's easy enough to imagine, with factored merge policies, both by-doc-count and by-segment-size policies. But the initial segment creation is going to be handled by IndexWriter, so you have to manually make sure you don't set that algorithm and the merge policy in conflict. Not great, but I don't have any great ideas. Could put in an API handshake, but I'm not sure if it's worth the mess?

Also, it sounds like, so far, there's no good way of managing parallel-reader setups w/by-segment-size algorithms, since the algorithm for creating/merging segments has to be globally consistent, not just per index, right?

If that is right, what does that say about making by-segment-size the default? It's gonna break (as in bad results) people relying on that behavior that don't change their code. Is there a community consensus on this? It's not really an API change that would cause a compile/class-load failure, but in some ways, it's worse ...

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520655 ] 

Yonik Seeley commented on LUCENE-845:
-------------------------------------

> But if writer flushes frequently and reader re-opens less frequently
> then it's better to merge on open.

Seems like an odd case though, because if readers aren't opened frequently, then it's a wast to flush small segments so often (and much slower for the writer than not doing so).

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526435 ] 

Yonik Seeley commented on LUCENE-845:
-------------------------------------

Thanks for adding minMergeMB, the default seems fine.
Shound minMergeDocs default to maxBufferedDocs (that should yield the old behavior)?
Although 1000 isn't bad... much to slow indexing a little in the odd app than to break it by running it out of descriptors.

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-845:
--------------------------------------

    Fix Version/s: 2.3

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526435 ] 

yseeley@gmail.com edited comment on LUCENE-845 at 9/11/07 4:57 AM:
--------------------------------------------------------------

Thanks for adding minMergeMB, the default seems fine.
Shound minMergeDocs default to maxBufferedDocs (that should yield the old behavior)?
Although 1000 isn't bad... much better to slow indexing a little in the odd app than to break it by running it out of descriptors.

      was (Author: yseeley@gmail.com):
    Thanks for adding minMergeMB, the default seems fine.
Shound minMergeDocs default to maxBufferedDocs (that should yield the old behavior)?
Although 1000 isn't bad... much to slow indexing a little in the odd app than to break it by running it out of descriptors.
  
> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Steven Parkes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520268 ] 

Steven Parkes commented on LUCENE-845:
--------------------------------------

I understand the merge problem but I'm still concerned about the increased number of file descriptors. Is this a concern?

It seems like there are ways of approaching this, that might be able to fix both problems?

For example, right now (pre-fix), if you have maxBufferedDocs set to 1000, mergeFactor set to 10, and add (for the sake of obvious example) 10 single doc segments, it's going to do a merge to one segment of size 1010, which is not great.

One solution to this would be in cases like this to merge the small segments to one but not include the big segments. So you get [1000 10] where the last segment keeps growing until it reaches 1000. This does more copies than the current case, but always on small segments, with the advantage of a lower bound on the number of file descriptors?

Of course, if no one's worried about this "moderate" (not exactly large, not exactly small) change in file descriptor usage, then it's not a big deal. It doesn't impact my work but I'm not sure about the greater community.

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485722 ] 

Michael McCandless commented on LUCENE-845:
-------------------------------------------

Just recapping some following discussion from java-dev ...

The current merge policy can be thought of logically as two different
steps:

  1. How to determine the "level" of each segment in the index.

  2. How & when to pick which level N segments into a level N+1
     segment.

The current policy determines a segment's level by looking at the doc
count in the segment as well as the current maxBufferedDocs, which is
very problematic when you "flush by RAM usage" instead.  This Jira
issue, then, is proposing to instead look at overall byte size of a
segment for determining its level, while keeping step 2. above.

However, I would propose we also fix LUCENE-854 (which addresses step
2 above and not step 1) at the same time, as a single merge policy,
and maybe at some point in the future make this new merge policy the
default merge policy.


> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493065 ] 

Michael McCandless commented on LUCENE-845:
-------------------------------------------

> Following up on this, it's basically the idea that segments ought to be created/merged both either by-segment-size or by-doc-count but not by a mixture? That wouldn't be suprising ...

Right, but we need the refactored merge policy framework in place
first.  I'll mark this issue dependent on LUCENE-847.

> It does impact the APIs, though. It's easy enough to imagine, with factored merge policies, both by-doc-count and by-segment-size policies. But the initial segment creation is going to be handled by IndexWriter, so you have to manually make sure you don't set that algorithm and the merge policy in conflict. Not great, but I don't have any great ideas. Could put in an API handshake, but I'm not sure if it's worth the mess?

Good question.  I think it's OK (at least for our first go at this --
progress not perfection!) to expect the developer to choose a merge
policy and then to use IndexWriter in a way that's "consistent" with
that merge policy?  I think it's going to get too complex if we try to
formally couple "when to flush/commit" with the merge policy?

But, I think the default merge policy needs to be resilient to people
doing things like changing maxBuffereDocs/mergeFactor partway through
an index, calling flush() whenever they want, etc.  The merge policy
today is not resilient to these "normal" usages of IndexWriter.  So I
think we need to do something here even without the pressure from
LUCENE-843.

> Also, it sounds like, so far, there's no good way of managing parallel-reader setups w/by-segment-size algorithms, since the algorithm for creating/merging segments has to be globally consistent, not just per index, right?

Right.  We clearly need to keep the current "by doc" merge policy
easily available for this use case.

> If that is right, what does that say about making by-segment-size the default? It's gonna break (as in bad results) people relying on that behavior that don't change their code. Is there a community consensus on this? It's not really an API change that would cause a compile/class-load failure, but in some ways, it's worse ...

I think there are actually two questions here:

  1) What exactly makes for a good default merge policy?

     I think the merge policy we have today has some limitations:

       - It's not resilient to "normal" usage of the public APIs in
         IndexWriter.  If you call flush() yourself, if you change
         maxBufferedDocs (and maybe mergeFactor?) partway through an
         index, etc, you can cause disastrous amounts of over-merging
         (that's this issue).

 	 I think the default policy should be entirely resilient to
	 any usage of the public IndexWriter APIs.

       - Default merge policy should strive to minimize net cost
         (amortized over time) of merging, but the current one
         doesn't:

         - When docs differ in size (frequently the case) it will be
           too costly in CPU/IO consumption because small segments are
           merged with large ones.

         - It does too much work in advance (too much "pay it
           forward").  I don't think a merge policy should
           "inadvertently optimize" (I opened LUCENE-854 to describe
           this).

       - It blocks LUCENE-843 (flushing by RAM usage).

         I think Lucene "out of the box" should give you good indexing
         performance.  You should not have to do extra tuning to get
         substantially better performance.  The best way to get that
         is to "flush by RAM" (with LUCENE-843).  But current merge
         policy prevents this (due to this issue).

  2) Can we change the default merge policy?

     I sure hope so, given the issues above.

     I think the majority of Lucene users do the simple "create a
     writer, add/delete docs, close writer, while reader(s) use the
     same index" type of usage and so would benefit by the gained
     performance of LUCENE-843 and LUCENE-854.

     I think (but may be wrong!) it's a minority who use
     ParallelReader and therefore have a reliance on the specific
     merge policy we use today?

Ideally we first commit the "decouple merge policy from IndexWriter"
(LUCENE-847), then we would make a new merge policy that resolves this
issue and LUCENE-854, and make it the default policy.  I think this
policy would look at size (in bytes) of each segment (perhaps
proportionally reducing # bytes according to pending deletes against
that segment), and would merge any adjacent segments (not just
rightmost ones) that are "the most similar" in size.  I think it would
merge N (configurable) at a time and at no time would inadvertently
optimize.

This would mean users of ParallelReader on upgrading to this would
need to change their merge policy to the legacy "merge by doc count"
policy.

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520344 ] 

Michael McCandless commented on LUCENE-845:
-------------------------------------------

> > Here's an idea: maybe we can accept the O(N^2) merge cost, when
> > the segments are "small"?
>
> That's basically the underlying idea I was trying to get at.

Ahh, good!  We agree :)

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520667 ] 

Michael McCandless commented on LUCENE-845:
-------------------------------------------

Agreed.  OK, I think this is a dead end: it adds complexity and won't
help in "typical" uses of Lucene.

So ... my plan of action is to assess the "actual" O(N^2) cost for
IndexWriter to keep the tail short, add a parameter to LogMergePolicy
so that it "floors" the level and always merges segments less than
this floor together, despite the O(N^2) cost.  And then pick a
reasonable default for this floor.


> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520649 ] 

Michael McCandless commented on LUCENE-845:
-------------------------------------------

> Merging small segments in the reader seems like a cool idea on it's
> own.  But if it's an acceptable hit to merge in the reader, why is
> it not in the writer?

Good point.  I think it comes down to how often we expect readers to
refresh vs writers flushing.

If indeed it's 1 to 1 (as the truest "low latency" app would in fact
be, or a "single writer + reader with no separation"), then the writer
should merge them because although it's paying an O(N^2) cost to keep
the tail "short", merging on open would pay even more cost.

But if writer flushes frequently and reader re-opens less frequently
then it's better to merge on open.

Of course, if the O(N^2) cost for IndexWriter to keep a short tail is
in practice not too costly then we should just leave this in
IndexWriter.  I still need to run that test for LUCENE-845.

> Also, would this tail merging on an open be able to reduce the peak
> number of file descriptors?  It seems like to do so, the tail would
> have to be merged *before* other index files were opened, further
> complicating matters.

Right I think to keep peak descriptor usage capped we must merge the
tail, first, then open the remaining segments, which definitely
complicate things...


> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520334 ] 

Yonik Seeley commented on LUCENE-845:
-------------------------------------

You may avoid the cost of a bunch of small merges, but then you pay the price in searching performance.  I'm not sure that's the right tradeoff because if someone wanted to optimize for indexing performance, they would do more in batches.

How does this work when flushing by MB?  If you set setRamBufferSizeMB(32), are you guaranteed that you never have more than 10 segments less than 32MB (ignoring LEVEL_LOG_SPAN for now) if mergeFactor is 10?

Almost seems like we need a minSegmentSize parameter too - using setRamBufferSizeMB confuses two different but related issues.

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520343 ] 

Michael McCandless commented on LUCENE-845:
-------------------------------------------


> You may avoid the cost of a bunch of small merges, but then you pay
> the price in searching performance. I'm not sure that's the right
> tradeoff because if someone wanted to optimize for indexing
> performance, they would do more in batches.

Agreed.

It's like we would want to run "partial optimize" (ie, merge the tail
of "small" segments) on demand, only when a reader is about to
refresh.

Or here's another random idea: maybe IndexReaders should load the tail
of "small segments" into a RAMDirectory, for each one.  Ie, an
IndexReader is given a RAM buffer "budget" and it spends it on any
numerous small segments in the index....?

> How does this work when flushing by MB? If you set
> setRamBufferSizeMB(32), are you guaranteed that you never have more
> than 10 segments less than 32MB (ignoring LEVEL_LOG_SPAN for now) if
> mergeFactor is 10?

No, we have the same challenge of avoiding O(N^2) merge cost.  When
merging by "byte size" of the segments, I don't look at the current
RAM buffer size of the writer.

I feel like there should be a strong separation of "flush params" from
"merge params".

> Almost seems like we need a minSegmentSize parameter too - using
> setRamBufferSizeMB confuses two different but related issues.

Exactly!  I'm thinking that I add "minSegmentSize" to the
LogMergePolicy, which is separate from "maxBufferedDocs" and
"ramBufferSizeMB".  And, we default it to values that seem like an
"acceptable" tradeoff of the cost of O(N^2) merging (based on tests I
will run) vs the cost of slowdown to readers...

I'll run some perf tests.  O(N^2) should be acceptable for certain
segment sizes....


> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520328 ] 

Michael McCandless commented on LUCENE-845:
-------------------------------------------


> Is there a change in filedescriptor use if you don't use setRamBufferSizeMB?

Yes.  EG, if you set maxBufferedDocs to 1000 but then flush after
every added doc, and you add 1000 docs, with the current merge policy,
every 10 flushes you will merge all segments together.  Ie, first
segment has 10 docs, then 20, 30, 40, 50, ..., 1000.  This is where
O(N^2) cost on merging comes from.  But, you will never have more than
10 segments in your index.

Whereas the new merge policy will make levels (segments of size 100,
10, 1) and merge only segments from the same level together.  So merge
cost will be much less (not O(N^2)), but, you will have more max segments
in the index (up to 1 + (mergeFactor-1) * log_mergeFactor(numDocs)),
or 28 segments in this example (I think).

Basically the new merge policy tries to make levels "all the way
down" rather than forcefully stopping when the levels get smaller than
maxBufferedDocs, to avoid the O(N^2) merge cost.

> One solution to this would be in cases like this to merge the small
> segments to one but not include the big segments. So you get [1000
> 10] where the last segment keeps growing until it reaches 1000. This
> does more copies than the current case, but always on small
> segments, with the advantage of a lower bound on the number of file
> descriptors?

I'm not sure that helps?  Because that "small segment" will have to
grow bit by bit up to 1000 (causing the O(N^2) cost).

Note that the goal here is to be able to switch to flushing by RAM
buffer size instead of docCount (and also merge by byte-size of
segments not doc count), by default, in IndexWriter.  But, even once
we do that, if you always flush tiny segments the new merge policy
will still build levels "all the way down".

Here's an idea: maybe we can accept the O(N^2) merge cost, when the
segments are "small"?  Ie, maybe doing 100 sub-optimal merges (in the
example above) does not amount to that much actual cost in practice.
(After all nobody has complained about this :).

I will run some tests.  Clearly at some point the O(N^2) cost will
dominate your indexing time, but maybe we can set a "rough" docCount
below which all segments are counted as a single level and not take
too much of a indexing performance hit.


> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Steven Parkes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520130 ] 

Steven Parkes commented on LUCENE-845:
--------------------------------------

This increases file descriptor usage in some cases, right? In the old scheme, if you set mergeFactor to 10 and maxBufferedDocs to 1000, you'd only get 10 segments with size <= 1000. But with this code, you can't bound that anymore. If I create single doc segments (perhaps by flushing based on latency), I can get 30 of them?

Of course, if what we're trying to do is manage the number of file descriptors, we should just do that, rather than using using maxBufferedDocs as a proxy (with all it's nasty overmerging behavior).

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-845.
---------------------------------------

    Resolution: Fixed

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526403 ] 

Michael McCandless commented on LUCENE-845:
-------------------------------------------

In the latest patch on LUCENE-847 I've added methods to
LogDocMergePolicy (setMinMergeDocs) and LogByteSizeMergePolicy
(setMinMergeMB) to set a floor on the segment levels such that all
segments below this min size are aggressively merged as if they were in
one level.  This effectively "truncates" what would otherwise be a
long tail of segment sizes, when you are flushing many tiny segments
into your index.

In order to pick reasonable defaults for the min segment size, I ran
some benchmarks to measure the indexing cost of truncating the tail.

I processed Wiki content into ~4 KB plain text documents and then
indexed the first 10,000 docs using this alg:

  analyzer=org.apache.lucene.analysis.SimpleAnalyzer
  doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
  directory=FSDirectory
  docs.file=/lucene/wiki4K.txt
  max.buffered = 500

  ResetSystemErase
  CreateIndex
  {AddDoc >: 10000
  CloseIndex

  RepSumByName

I'm using the SerialMergeScheduler.

I modified contrib/benchmark to always flush a new segment after each
added document: this simulates the "worst case" of tiny segments, ie,
lowest latency indexing where every added doc must then be visible to
searchers.

Each time is best of 2 runs.  This is run on Linux (2.6.22.1) Core II
Duo 2.4 Ghz machine with 4 GB RAM, RAID 5 IO system using Java 1.5
-server.

    maxBufferedDocs   seconds    slowdown
    10                40         1.0
    100               50         1.3
    200               59         1.5
    300               64         1.6
    400               72         1.8
    500               80         2.0
    750               97         2.4
   1000              114         2.9
   1500              138         3.5
   2000              169         4.2
   3000              205         5.1
   4000              264         6.6
   5000              320         8.0
   7500              404        10.1
  10000              645        16.1

Here's my thinking:

  * If you are flushing zillions of such tiny segments I think it's OK
    to accept a net/net sizable slowdown of your overall indexing
    speed.  I'll choose a 4X slowdown "tolerance" to choose default
    values.  This corresponds roughly to the "2000" line above.
    However, because I tested on a fairly fast CPU & IO system I'll
    multiply the numbers by 0.5.

  * Given this, I propose we default the minMergeMB
    (LogByteSizeMergePolicy) to 1.6 MB (avg size of real segments at
    the 2000 point above was 3.2 MB) and default minMergeDocs
    (LogDocMergePolicy) to 1000.

  * Note that when you are flushing large segments (larger than these
    min size settings) then there is no slowdown at all because the
    flushed segments are already above the minimum size.

These are just defaults, so a given application can always change
their "min merge size" as needed.


> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520351 ] 

Michael McCandless commented on LUCENE-845:
-------------------------------------------

> Or here's another random idea: maybe IndexReaders should load the
> tail of "small segments" into a RAMDirectory, for each one. Ie, an
> IndexReader is given a RAM buffer "budget" and it spends it on any
> numerous small segments in the index....?

Following up on this ... I think IndexReader could load "small tail
segments" into RAMDirectory and then do a merge on them to make
search even faster.  It should typically be extremely fast if we set the
defaults right, and RAM usage should be quite low since merging
small segments usually gives great compression in net bytes used.

This would allow us to avoid (or, minimize) the O(N^2) cost on merging
to ensure that an index is "at all instants" ready for a reader to
directly load it.  This basically gives us our "merge tail segments on
demand when a reader refreshes".

We can do a combination of these two approaches, whereby the
IndexWriter is free to make use a "long tail" of segments so it
doesn't have O(N^2) slowdown on merge cost, yet a reader pays very
small (one-time) cost for such segments.

I think the combination of these two changes should give a net/net
sizable improvement on "low latency" apps.... because IndexWriter is
free to make miniscule segments (document by document even) and
IndexReader (especially combined with LUCENE-743) can quickly
re-open and do a "mini-optimize" on the tail segments and have
great performance.


> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526460 ] 

Michael McCandless commented on LUCENE-845:
-------------------------------------------

> Shound minMergeDocs default to maxBufferedDocs (that should yield
> the old behavior)?

Good idea -- I think we could do this dynamically so that whenever
IndexWriter is flushing by doc count and the merge policy is
LogDocMergePolicy we "write through" any changes to maxBufferedDocs
--> LogDocMergePolicy.setMinMergeDocs?  I'll take that approach to
keep backwards compatibility.

> Although 1000 isn't bad... much better to slow indexing a little in
> the odd app than to break it by running it out of descriptors.

Agreed, that's the right direction to "err" here.


> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-845:
--------------------------------------

    Attachment: LUCENE-845.patch


First cut patch.  You have to first apply the most recent patch from
LUCENE-847:

  https://issues.apache.org/jira/secure/attachment/12363880/LUCENE-847.patch.txt

and then apply this patch over it.

This patch has two merge policies:

  LogDocMergePolicy

    This is "backwards compatible" to current merge policy, yet,
    resolve this "over-merge issue" by not using the current setting
    of "maxBufferedDocs" when computing levels.  I think it should
    replace the current LogDocMergePolicy from LUCENE-847.

  LogByteSizeMergePolicy

    Chooses merges according to net size in bytes of all files for a
    segment.  I think we should make this one the default merge
    policy, and also change IndexWriter to flush by RAM by default.

They both subclass from abstract base LogMergePolicy and differ only
in the "size" method which defines how you measure a segment's size (#
docs in that segment or net size in bytes of that segment).

The gist of the approach is the same as the current merge policy: you
generally try to merge segments that are "roughly" the same size
(where size can be doc count or byte size), mergeFactor at a time.

The big difference is instead of starting from maxBufferedDocs and
"going up" to determine level, I start from the max segment size (of
all segments in the index) and "go down" to determine level.  This
resolves the bug because levels are "self-defined" by the segments,
rather than by the current value of maxBufferedDocs on IndexWriter.

I then pick merges exactly the same as the current merge policy: if
any level has >= mergeFactor segments, we merge them.

All tests pass, except:

  * One assert in testAddIndexesNoOptimize which was relying on the
    specific invariants of the current merge policy (it's the same
    assert that LUCENE-847 had changed; this assert is testing
    particular corner cases of the current merge policy).  Changing
    the assertEquals to "4" instead of "3" fixes it.

  * TestLogDocMergePolicy (added in LUCENE-847) doesn't compile
    against the new version above because it's using methods that
    don't exist in the new one.


> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Steven Parkes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520360 ] 

Steven Parkes commented on LUCENE-845:
--------------------------------------

	I think the combination of these two changes should give a net/net
	sizable improvement on "low latency" apps....

I think this would be great. It's always been a pet peeve of mine that even in low pressure/activity environments, there is often a delay from write to read.

Sounds like this would help take most of the work/risk off the developer.

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Steven Parkes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520336 ] 

Steven Parkes commented on LUCENE-845:
--------------------------------------

 	Here's an idea: maybe we can accept the O(N^2) merge cost, when the
	segments are "small"?

That's basically the underlying idea I was trying to get at.

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520611 ] 

Yonik Seeley commented on LUCENE-845:
-------------------------------------

Merging small segments in the reader seems like a cool idea on it's own.
But if it's an acceptable hit to merge in the reader, why is it not in the writer?

Think about a writer flushing 10 small segments and a new reader opened each time:
The reader would do ~10*10/2 merges if it just merged the small segments.
If the writer were to do the merging instead, it would need to merge ~10 segments.

Thinking about it anotherway... if there were no separation between reader and writer, and small segments were merged on an open, why not just write out the result so it wouldn't have to be done again?  Now move "merge on an open" to "merge on the close" and that's what IndexWriter currently does.  Why is it OK for a reader to pay the price but not the writer?

Also, would this tail merging on an open be able to reduce the peak number of file descriptors?
It seems like to do so, the tail would have to be merged *before* other index files were opened, further complicating matters.


> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484175 ] 

Michael McCandless commented on LUCENE-845:
-------------------------------------------

>> With LUCENE-843 I'm thinking we should deprecate maxBufferedDocs entirely
>
> That might present a problem for users of ParallelReader. Right now,
> it's possible to construct two indicies with corresponding
> docids.... switching to flush-by-ram would makesegment merging
> unpredictable and destroy the docid matching.

Ahhh, this is a very good point.  OK I won't deprecate "flushing by
doc count" and instead will allow either "flush by RAM usage" (default
to this?) or "flush by doc count".


> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org