You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Burton-West, Tom" <tb...@umich.edu> on 2010/12/16 00:52:57 UTC

Memory use during merges (OOM)

Hello all,

Are there any general guidelines for determining the main factors in memory use during merges?

We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory.
Below is a list of the changes and part of the indexwriter log.  The changes increased the indexing though-put by almost an order of magnitude.
(about 600 documents per hour to about 6000 documents per hour.  Our documents are about 800K)

We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput

Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20?

 Is there any obvious relationship between ramBufferSizeMB and the memory consumed by Solr?
 Are there rules of thumb for the memory needed in terms of the number or size of segments?

Our largest segments prior to the failed merge attempt were between 5GB and 30GB.  The memory allocated to the Solr/tomcat JVM is 10GB.

Tom Burton-West
-----------------------------------------------------------------

Changes to indexing configuration:
mergeScheduler
        before: serialMergeScheduler
        after:    concurrentMergeScheduler
mergeFactor
        before: 10
            after : 20
ramBufferSizeMB
        before: 32
              after: 320

excerpt from indexWriter.log

Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: findMerges: 40 segments
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:   level 7.23609 to 7.98609: 20 segments
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:     0 to 20: add this merge
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:   level 5.44878 to 6.19878: 20 segments
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:     20 to 40: add this merge

...
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: applyDeletes
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted docIDs and 0 deleted queries on 40 segments.
Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit exception flushing deletes
Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit OutOfMemoryError inside updateDocument
tom


Re: Memory use during merges (OOM)

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Thu, Dec 16, 2010 at 5:51 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> If you are doing false deletions (calling .updateDocument when in fact
> the Term you are replacing cannot exist) it'd be best if possible to
> change the app to not call .updateDocument if you know the Term
> doesn't exist.

FWIW, if you're going to add a batch of documents you know aren't
already in the index,
you can use the "overwrite=false" parameter for that Solr update request.

-Yonik
http://www.lucidimagination.com

Re: Memory use during merges (OOM)

Posted by Michael McCandless <lu...@mikemccandless.com>.
Actually terms index is something different.

If you don't use CFS, go and look at the size of *.tii in your index
directory -- those are the terms index.  The terms index picks a
subset of the terms (by default 128) to hold in RAM (plus some
metadata) in order to make seeking to a specific term faster.

Unfortunately they are held in a RAM intensive way, but in the
upcoming 4.0 release we've greatly reduced that.

Mike

On Thu, Dec 16, 2010 at 2:27 PM, Robert Petersen <ro...@buy.com> wrote:
> Thanks Mike!  When you say 'term index of the segment readers', are you referring to the term vectors?
>
> In our case our index of 8 million docs holds pretty 'skinny' docs containing searchable product titles and keywords, with the rest of the doc only holding Ids for faceting upon.  Docs typically only have unique terms per doc, with a lot of overlap of the terms across categories of docs (all similar products).  I'm thinking that our unique terms are low vs the size of our index.  The way we spin out deletes and adds should keep the terms loaded all the time.  Seems like once in a couple weeks a propagation happens which kills the slave farm with OOMs.  We are bumping the heap up a couple gigs every time this happens and hoping it goes away at this point.  That is why I jumped into this discussion, sorry for butting in like that.  you guys are discussing very interesting settings I had not considered before.
>
> Rob
>
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Thursday, December 16, 2010 10:24 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Memory use during merges (OOM)
>
> It's not that it's "bad", it's just that Lucene must do extra work to
> check if these deletes are real or not, and that extra work requires
> loading the terms index which will consume additional RAM.
>
> For most apps, though, the terms index is relatively small and so this
> isn't really an issue.  But if your terms index is large this can
> explain the added RAM usage.
>
> One workaround for large terms index is to set the terms index divisor
> that IndexWriter should use whenever it loads a terms index (this is
> IndexWriter.setReaderTermsIndexDivisor).
>
> Mike
>
> On Thu, Dec 16, 2010 at 12:17 PM, Robert Petersen <ro...@buy.com> wrote:
>> Hello we occasionally bump into the OOM issue during merging after propagation too, and from the discussion below I guess we are doing thousands of 'false deletions' by unique id to make sure certain documents are *not* in the index.  Could anyone explain why that is bad?  I didn't really understand the conclusion below.
>>
>> -----Original Message-----
>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Sent: Thursday, December 16, 2010 2:51 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Memory use during merges (OOM)
>>
>> RAM usage for merging is tricky.
>>
>> First off, merging must hold open a SegmentReader for each segment
>> being merged.  However, it's not necessarily a full segment reader;
>> for example, merging doesn't need the terms index nor norms.  But it
>> will load deleted docs.
>>
>> But, if you are doing deletions (or updateDocument, which is just a
>> delete + add under-the-hood), then this will force the terms index of
>> the segment readers to be loaded, thus consuming more RAM.
>> Furthermore, if the deletions you (by Term/Query) do in fact result in
>> deleted documents (ie they were not "false" deletions), then the
>> merging allocates an int[maxDoc()] for each SegmentReader that has
>> deletions.
>>
>> Finally, if you have multiple merges running at once (see
>> CSM.setMaxMergeCount) that means RAM for each currently running merge
>> is tied up.
>>
>> So I think the gist is... the RAM usage will be in proportion to the
>> net size of the merge (mergeFactor + how big each merged segment is),
>> how many merges you allow concurrently, and whether you do false or
>> true deletions.
>>
>> If you are doing false deletions (calling .updateDocument when in fact
>> the Term you are replacing cannot exist) it'd be best if possible to
>> change the app to not call .updateDocument if you know the Term
>> doesn't exist.
>>
>> Mike
>>
>> On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom <tb...@umich.edu> wrote:
>>> Hello all,
>>>
>>> Are there any general guidelines for determining the main factors in memory use during merges?
>>>
>>> We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory.
>>> Below is a list of the changes and part of the indexwriter log.  The changes increased the indexing though-put by almost an order of magnitude.
>>> (about 600 documents per hour to about 6000 documents per hour.  Our documents are about 800K)
>>>
>>> We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput
>>>
>>> Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20?
>>>
>>>  Is there any obvious relationship between ramBufferSizeMB and the memory consumed by Solr?
>>>  Are there rules of thumb for the memory needed in terms of the number or size of segments?
>>>
>>> Our largest segments prior to the failed merge attempt were between 5GB and 30GB.  The memory allocated to the Solr/tomcat JVM is 10GB.
>>>
>>> Tom Burton-West
>>> -----------------------------------------------------------------
>>>
>>> Changes to indexing configuration:
>>> mergeScheduler
>>>        before: serialMergeScheduler
>>>        after:    concurrentMergeScheduler
>>> mergeFactor
>>>        before: 10
>>>            after : 20
>>> ramBufferSizeMB
>>>        before: 32
>>>              after: 320
>>>
>>> excerpt from indexWriter.log
>>>
>>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: findMerges: 40 segments
>>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:   level 7.23609 to 7.98609: 20 segments
>>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:     0 to 20: add this merge
>>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:   level 5.44878 to 6.19878: 20 segments
>>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:     20 to 40: add this merge
>>>
>>> ...
>>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: applyDeletes
>>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted docIDs and 0 deleted queries on 40 segments.
>>> Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit exception flushing deletes
>>> Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit OutOfMemoryError inside updateDocument
>>> tom
>>>
>>>
>>
>

RE: Memory use during merges (OOM)

Posted by Robert Petersen <ro...@buy.com>.
Thanks Mike!  When you say 'term index of the segment readers', are you referring to the term vectors?

In our case our index of 8 million docs holds pretty 'skinny' docs containing searchable product titles and keywords, with the rest of the doc only holding Ids for faceting upon.  Docs typically only have unique terms per doc, with a lot of overlap of the terms across categories of docs (all similar products).  I'm thinking that our unique terms are low vs the size of our index.  The way we spin out deletes and adds should keep the terms loaded all the time.  Seems like once in a couple weeks a propagation happens which kills the slave farm with OOMs.  We are bumping the heap up a couple gigs every time this happens and hoping it goes away at this point.  That is why I jumped into this discussion, sorry for butting in like that.  you guys are discussing very interesting settings I had not considered before.

Rob


-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Thursday, December 16, 2010 10:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Memory use during merges (OOM)

It's not that it's "bad", it's just that Lucene must do extra work to
check if these deletes are real or not, and that extra work requires
loading the terms index which will consume additional RAM.

For most apps, though, the terms index is relatively small and so this
isn't really an issue.  But if your terms index is large this can
explain the added RAM usage.

One workaround for large terms index is to set the terms index divisor
that IndexWriter should use whenever it loads a terms index (this is
IndexWriter.setReaderTermsIndexDivisor).

Mike

On Thu, Dec 16, 2010 at 12:17 PM, Robert Petersen <ro...@buy.com> wrote:
> Hello we occasionally bump into the OOM issue during merging after propagation too, and from the discussion below I guess we are doing thousands of 'false deletions' by unique id to make sure certain documents are *not* in the index.  Could anyone explain why that is bad?  I didn't really understand the conclusion below.
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Thursday, December 16, 2010 2:51 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Memory use during merges (OOM)
>
> RAM usage for merging is tricky.
>
> First off, merging must hold open a SegmentReader for each segment
> being merged.  However, it's not necessarily a full segment reader;
> for example, merging doesn't need the terms index nor norms.  But it
> will load deleted docs.
>
> But, if you are doing deletions (or updateDocument, which is just a
> delete + add under-the-hood), then this will force the terms index of
> the segment readers to be loaded, thus consuming more RAM.
> Furthermore, if the deletions you (by Term/Query) do in fact result in
> deleted documents (ie they were not "false" deletions), then the
> merging allocates an int[maxDoc()] for each SegmentReader that has
> deletions.
>
> Finally, if you have multiple merges running at once (see
> CSM.setMaxMergeCount) that means RAM for each currently running merge
> is tied up.
>
> So I think the gist is... the RAM usage will be in proportion to the
> net size of the merge (mergeFactor + how big each merged segment is),
> how many merges you allow concurrently, and whether you do false or
> true deletions.
>
> If you are doing false deletions (calling .updateDocument when in fact
> the Term you are replacing cannot exist) it'd be best if possible to
> change the app to not call .updateDocument if you know the Term
> doesn't exist.
>
> Mike
>
> On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom <tb...@umich.edu> wrote:
>> Hello all,
>>
>> Are there any general guidelines for determining the main factors in memory use during merges?
>>
>> We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory.
>> Below is a list of the changes and part of the indexwriter log.  The changes increased the indexing though-put by almost an order of magnitude.
>> (about 600 documents per hour to about 6000 documents per hour.  Our documents are about 800K)
>>
>> We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput
>>
>> Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20?
>>
>>  Is there any obvious relationship between ramBufferSizeMB and the memory consumed by Solr?
>>  Are there rules of thumb for the memory needed in terms of the number or size of segments?
>>
>> Our largest segments prior to the failed merge attempt were between 5GB and 30GB.  The memory allocated to the Solr/tomcat JVM is 10GB.
>>
>> Tom Burton-West
>> -----------------------------------------------------------------
>>
>> Changes to indexing configuration:
>> mergeScheduler
>>        before: serialMergeScheduler
>>        after:    concurrentMergeScheduler
>> mergeFactor
>>        before: 10
>>            after : 20
>> ramBufferSizeMB
>>        before: 32
>>              after: 320
>>
>> excerpt from indexWriter.log
>>
>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: findMerges: 40 segments
>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:   level 7.23609 to 7.98609: 20 segments
>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:     0 to 20: add this merge
>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:   level 5.44878 to 6.19878: 20 segments
>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:     20 to 40: add this merge
>>
>> ...
>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: applyDeletes
>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted docIDs and 0 deleted queries on 40 segments.
>> Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit exception flushing deletes
>> Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit OutOfMemoryError inside updateDocument
>> tom
>>
>>
>

Re: Memory use during merges (OOM)

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Dec 16, 2010 at 2:09 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Thanks Mike,
>
>>>But, if you are doing deletions (or updateDocument, which is just a
>>>delete + add under-the-hood), then this will force the terms index of
>>>the segment readers to be loaded, thus consuming more RAM.
>
> Out of 700,000 docs, by the time we get to doc 600,000, there is a good chance a few documents have been updated, which would cause a delete +add.

OK so you should do the .updateDocument not .addDocument.

>>>One workaround for large terms index is to set the terms index divisor
>>>.that IndexWriter should use whenever it loads a terms index (this is
>>>IndexWriter.setReaderTermsIndexDivisor).
>
> I always get confused about the two different divisors and their names in the solrconfig.xml file
>
> We are setting  termInfosIndexDivisor, which I think translates to the Lucene IndexWriter.setReaderTermsIndexDivisor
>
> <indexReaderFactory name="IndexReaderFactory" class="org.apache.solr.core.StandardIndexReaderFactory">
>    <int name="termInfosIndexDivisor">8</int>
>  </indexReaderFactory >
>
> The other one is termIndexInterval which is set on the writer and determines what gets written to the tii file.  I don't remember how to set this in Solr.
>
> Are we setting the right one to reduce RAM usage during merging?

It's even more confusing!

There are three settings.  First tells IW how frequent the index terms
are (default is 128).  Second tells IndexReader whether to sub-sample
these on load (default is 1, meaning load all indexed terms; but if
you set it to 2 then 2*128 = every 256th term is loaded).  Third, IW
has the same setting (subsampling) to be used whenever it internally
must open a reader (eg to apply deletes).

The last two are really the same setting, just that one is passed when
you open IndexReader yourself, and the other is passed whenever IW
needs to open a reader.

But, I'm not sure how these settings are named in solrconfig.xml.

>> So I think the gist is... the RAM usage will be in proportion to the
>> net size of the merge (mergeFactor + how big each merged segment is),
>> how many merges you allow concurrently, and whether you do false or
>> true deletions
>
> Does an optimize do something differently?

No, optimize is the same deal.  But, because it's a big merge
(especially the last one), it's the highest RAM usage of all merges.

Mike

RE: Memory use during merges (OOM)

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Thanks Robert, 

We will try the termsIndexInterval as a workaround.   I have also opened a JIRA issue: https://issues.apache.org/jira/browse/SOLR-2290.
Hope I found the right sections of the Lucene code.  I'm just now in the process of looking at the Solr IndexReaderFactory and SolrIndexWriter and SolrIndexConfig  trying to better understand how solrconfig.xml gets instantiated and how it affects the readers and writers.

Tom
________________________________________
From: Robert Muir [rcmuir@gmail.com]

On Thu, Dec 16, 2010 at 4:03 PM, Burton-West, Tom <tb...@umich.edu> wrote:
>>>Your setting isn't being applied to the reader IW uses during
>>>merging... its only for readers Solr opens from directories
>>>explicitly.
>>>I think you should open a jira issue!
>
> Do I understand correctly that this setting in theory could be applied to the reader IW uses during merging but is not currently being applied?

yes, i'm not really sure (especially given the "name=") if you can/or
it was planned to have multiple IR factories in solr, e.g. a separate
one for spellchecking.
so i'm not sure if we should (hackishly) steal this parameter from the
IR factory (it is common to all IRFactories, not just
StandardIRFactory) and apply it to to IW..

but we could at least expose the divisor param separately to the IW
config so you have some way of setting it.

>
> <indexReaderFactory name="IndexReaderFactory" class="org.apache.solr.core.StandardIndexReaderFactory">
>    <int name="termInfosIndexDivisor">8</int>
>  </indexReaderFactory >
>
> I understand the tradeoffs for doing this during searching, but not the trade-offs for doing this during merging.  Is the use during merging the similar to the use during searching?
>
>  i.e. Some process has to look up data for a particular term as opposed to having to iterate through all the terms?
>  (Haven't yet dug into the merging/indexing code).

it needs it for applying deletes...

as a workaround (if you are reindexing), maybe instead of using the
Terms Index Divisor=8 you could set the Terms Index Interval = 1024 (8
* 128) ?

this will solve your merging problem, and have the same perf
characteristics of divisor=8, except you cant "go back down" like you
can with the divisor without reindexing with a smaller interval...

if you've already tested that performance with the divisor of 8 is
acceptable, or in your case maybe necessary!, it sort of makes sense
to 'bake it in' by setting your divisor back to 1 and your interval =
1024 instead...

Re: Memory use during merges (OOM)

Posted by Robert Muir <rc...@gmail.com>.
On Thu, Dec 16, 2010 at 4:03 PM, Burton-West, Tom <tb...@umich.edu> wrote:
>>>Your setting isn't being applied to the reader IW uses during
>>>merging... its only for readers Solr opens from directories
>>>explicitly.
>>>I think you should open a jira issue!
>
> Do I understand correctly that this setting in theory could be applied to the reader IW uses during merging but is not currently being applied?

yes, i'm not really sure (especially given the "name=") if you can/or
it was planned to have multiple IR factories in solr, e.g. a separate
one for spellchecking.
so i'm not sure if we should (hackishly) steal this parameter from the
IR factory (it is common to all IRFactories, not just
StandardIRFactory) and apply it to to IW..

but we could at least expose the divisor param separately to the IW
config so you have some way of setting it.

>
> <indexReaderFactory name="IndexReaderFactory" class="org.apache.solr.core.StandardIndexReaderFactory">
>    <int name="termInfosIndexDivisor">8</int>
>  </indexReaderFactory >
>
> I understand the tradeoffs for doing this during searching, but not the trade-offs for doing this during merging.  Is the use during merging the similar to the use during searching?
>
>  i.e. Some process has to look up data for a particular term as opposed to having to iterate through all the terms?
>  (Haven't yet dug into the merging/indexing code).

it needs it for applying deletes...

as a workaround (if you are reindexing), maybe instead of using the
Terms Index Divisor=8 you could set the Terms Index Interval = 1024 (8
* 128) ?

this will solve your merging problem, and have the same perf
characteristics of divisor=8, except you cant "go back down" like you
can with the divisor without reindexing with a smaller interval...

if you've already tested that performance with the divisor of 8 is
acceptable, or in your case maybe necessary!, it sort of makes sense
to 'bake it in' by setting your divisor back to 1 and your interval =
1024 instead...

Re: Memory use during merges (OOM)

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Dec 16, 2010 at 4:03 PM, Burton-West, Tom <tb...@umich.edu> wrote:
>>>Your setting isn't being applied to the reader IW uses during
>>>merging... its only for readers Solr opens from directories
>>>explicitly.
>>>I think you should open a jira issue!
>
> Do I understand correctly that this setting in theory could be applied to the reader IW uses during merging but is not currently being applied?
>
> <indexReaderFactory name="IndexReaderFactory" class="org.apache.solr.core.StandardIndexReaderFactory">
>    <int name="termInfosIndexDivisor">8</int>
>  </indexReaderFactory >

Yes.

> I understand the tradeoffs for doing this during searching, but not the trade-offs for doing this during merging.  Is the use during merging the similar to the use during searching?
>
>  i.e. Some process has to look up data for a particular term as opposed to having to iterate through all the terms?
>  (Haven't yet dug into the merging/indexing code).

It's not used during merging, only for applying deletes.  But, yes, we
do a lookup of the Term (or Terms inside Query, if you
delete-by-Query) from the terms index.

Mike

RE: Memory use during merges (OOM)

Posted by "Burton-West, Tom" <tb...@umich.edu>.
>>Your setting isn't being applied to the reader IW uses during
>>merging... its only for readers Solr opens from directories
>>explicitly.
>>I think you should open a jira issue!

Do I understand correctly that this setting in theory could be applied to the reader IW uses during merging but is not currently being applied?   

<indexReaderFactory name="IndexReaderFactory" class="org.apache.solr.core.StandardIndexReaderFactory">
    <int name="termInfosIndexDivisor">8</int>
  </indexReaderFactory >

I understand the tradeoffs for doing this during searching, but not the trade-offs for doing this during merging.  Is the use during merging the similar to the use during searching? 

 i.e. Some process has to look up data for a particular term as opposed to having to iterate through all the terms?  
 (Haven't yet dug into the merging/indexing code).   

Tom


-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 

> We are setting  termInfosIndexDivisor, which I think translates to the Lucene IndexWriter.setReaderTermsIndexDivisor
>
>

Re: Memory use during merges (OOM)

Posted by Robert Muir <rc...@gmail.com>.
On Thu, Dec 16, 2010 at 2:09 PM, Burton-West, Tom <tb...@umich.edu> wrote:
>
> I always get confused about the two different divisors and their names in the solrconfig.xml file

This one (for the writer) isnt configurable by Solr. want to open an issue?

>
> We are setting  termInfosIndexDivisor, which I think translates to the Lucene IndexWriter.setReaderTermsIndexDivisor
>
> <indexReaderFactory name="IndexReaderFactory" class="org.apache.solr.core.StandardIndexReaderFactory">
>    <int name="termInfosIndexDivisor">8</int>
>  </indexReaderFactory >
>
> The other one is termIndexInterval which is set on the writer and determines what gets written to the tii file.  I don't remember how to set this in Solr.
>
> Are we setting the right one to reduce RAM usage during merging?
>

When you write the terms, it creates a terms dictionary, and a terms
index. The termsIndexInterval (default 128) controls how many terms go
into the index.
For example every 128th term.

The divisor just samples this at runtime... e.g. with your divisor of
8 its only reading every 8th term from the index [or every 8*128th
term is read into ram, another way to see it].

Your setting isn't being applied to the reader IW uses during
merging... its only for readers Solr opens from directories
explicitly.
I think you should open a jira issue!

RE: Memory use during merges (OOM)

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Thanks Mike,

>>But, if you are doing deletions (or updateDocument, which is just a
>>delete + add under-the-hood), then this will force the terms index of
>>the segment readers to be loaded, thus consuming more RAM.

Out of 700,000 docs, by the time we get to doc 600,000, there is a good chance a few documents have been updated, which would cause a delete +add.  


>>One workaround for large terms index is to set the terms index divisor
>>.that IndexWriter should use whenever it loads a terms index (this is
>>IndexWriter.setReaderTermsIndexDivisor).

I always get confused about the two different divisors and their names in the solrconfig.xml file

We are setting  termInfosIndexDivisor, which I think translates to the Lucene IndexWriter.setReaderTermsIndexDivisor

<indexReaderFactory name="IndexReaderFactory" class="org.apache.solr.core.StandardIndexReaderFactory">
    <int name="termInfosIndexDivisor">8</int>
  </indexReaderFactory >

The other one is termIndexInterval which is set on the writer and determines what gets written to the tii file.  I don't remember how to set this in Solr.

Are we setting the right one to reduce RAM usage during merging?


> So I think the gist is... the RAM usage will be in proportion to the
> net size of the merge (mergeFactor + how big each merged segment is),
> how many merges you allow concurrently, and whether you do false or
> true deletions

Does an optimize do something differently?  

Tom


Re: Memory use during merges (OOM)

Posted by Michael McCandless <lu...@mikemccandless.com>.
It's not that it's "bad", it's just that Lucene must do extra work to
check if these deletes are real or not, and that extra work requires
loading the terms index which will consume additional RAM.

For most apps, though, the terms index is relatively small and so this
isn't really an issue.  But if your terms index is large this can
explain the added RAM usage.

One workaround for large terms index is to set the terms index divisor
that IndexWriter should use whenever it loads a terms index (this is
IndexWriter.setReaderTermsIndexDivisor).

Mike

On Thu, Dec 16, 2010 at 12:17 PM, Robert Petersen <ro...@buy.com> wrote:
> Hello we occasionally bump into the OOM issue during merging after propagation too, and from the discussion below I guess we are doing thousands of 'false deletions' by unique id to make sure certain documents are *not* in the index.  Could anyone explain why that is bad?  I didn't really understand the conclusion below.
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Thursday, December 16, 2010 2:51 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Memory use during merges (OOM)
>
> RAM usage for merging is tricky.
>
> First off, merging must hold open a SegmentReader for each segment
> being merged.  However, it's not necessarily a full segment reader;
> for example, merging doesn't need the terms index nor norms.  But it
> will load deleted docs.
>
> But, if you are doing deletions (or updateDocument, which is just a
> delete + add under-the-hood), then this will force the terms index of
> the segment readers to be loaded, thus consuming more RAM.
> Furthermore, if the deletions you (by Term/Query) do in fact result in
> deleted documents (ie they were not "false" deletions), then the
> merging allocates an int[maxDoc()] for each SegmentReader that has
> deletions.
>
> Finally, if you have multiple merges running at once (see
> CSM.setMaxMergeCount) that means RAM for each currently running merge
> is tied up.
>
> So I think the gist is... the RAM usage will be in proportion to the
> net size of the merge (mergeFactor + how big each merged segment is),
> how many merges you allow concurrently, and whether you do false or
> true deletions.
>
> If you are doing false deletions (calling .updateDocument when in fact
> the Term you are replacing cannot exist) it'd be best if possible to
> change the app to not call .updateDocument if you know the Term
> doesn't exist.
>
> Mike
>
> On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom <tb...@umich.edu> wrote:
>> Hello all,
>>
>> Are there any general guidelines for determining the main factors in memory use during merges?
>>
>> We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory.
>> Below is a list of the changes and part of the indexwriter log.  The changes increased the indexing though-put by almost an order of magnitude.
>> (about 600 documents per hour to about 6000 documents per hour.  Our documents are about 800K)
>>
>> We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput
>>
>> Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20?
>>
>>  Is there any obvious relationship between ramBufferSizeMB and the memory consumed by Solr?
>>  Are there rules of thumb for the memory needed in terms of the number or size of segments?
>>
>> Our largest segments prior to the failed merge attempt were between 5GB and 30GB.  The memory allocated to the Solr/tomcat JVM is 10GB.
>>
>> Tom Burton-West
>> -----------------------------------------------------------------
>>
>> Changes to indexing configuration:
>> mergeScheduler
>>        before: serialMergeScheduler
>>        after:    concurrentMergeScheduler
>> mergeFactor
>>        before: 10
>>            after : 20
>> ramBufferSizeMB
>>        before: 32
>>              after: 320
>>
>> excerpt from indexWriter.log
>>
>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: findMerges: 40 segments
>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:   level 7.23609 to 7.98609: 20 segments
>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:     0 to 20: add this merge
>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:   level 5.44878 to 6.19878: 20 segments
>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:     20 to 40: add this merge
>>
>> ...
>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: applyDeletes
>> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted docIDs and 0 deleted queries on 40 segments.
>> Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit exception flushing deletes
>> Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit OutOfMemoryError inside updateDocument
>> tom
>>
>>
>

RE: Memory use during merges (OOM)

Posted by Robert Petersen <ro...@buy.com>.
Hello we occasionally bump into the OOM issue during merging after propagation too, and from the discussion below I guess we are doing thousands of 'false deletions' by unique id to make sure certain documents are *not* in the index.  Could anyone explain why that is bad?  I didn't really understand the conclusion below. 

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Thursday, December 16, 2010 2:51 AM
To: solr-user@lucene.apache.org
Subject: Re: Memory use during merges (OOM)

RAM usage for merging is tricky.

First off, merging must hold open a SegmentReader for each segment
being merged.  However, it's not necessarily a full segment reader;
for example, merging doesn't need the terms index nor norms.  But it
will load deleted docs.

But, if you are doing deletions (or updateDocument, which is just a
delete + add under-the-hood), then this will force the terms index of
the segment readers to be loaded, thus consuming more RAM.
Furthermore, if the deletions you (by Term/Query) do in fact result in
deleted documents (ie they were not "false" deletions), then the
merging allocates an int[maxDoc()] for each SegmentReader that has
deletions.

Finally, if you have multiple merges running at once (see
CSM.setMaxMergeCount) that means RAM for each currently running merge
is tied up.

So I think the gist is... the RAM usage will be in proportion to the
net size of the merge (mergeFactor + how big each merged segment is),
how many merges you allow concurrently, and whether you do false or
true deletions.

If you are doing false deletions (calling .updateDocument when in fact
the Term you are replacing cannot exist) it'd be best if possible to
change the app to not call .updateDocument if you know the Term
doesn't exist.

Mike

On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Hello all,
>
> Are there any general guidelines for determining the main factors in memory use during merges?
>
> We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory.
> Below is a list of the changes and part of the indexwriter log.  The changes increased the indexing though-put by almost an order of magnitude.
> (about 600 documents per hour to about 6000 documents per hour.  Our documents are about 800K)
>
> We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput
>
> Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20?
>
>  Is there any obvious relationship between ramBufferSizeMB and the memory consumed by Solr?
>  Are there rules of thumb for the memory needed in terms of the number or size of segments?
>
> Our largest segments prior to the failed merge attempt were between 5GB and 30GB.  The memory allocated to the Solr/tomcat JVM is 10GB.
>
> Tom Burton-West
> -----------------------------------------------------------------
>
> Changes to indexing configuration:
> mergeScheduler
>        before: serialMergeScheduler
>        after:    concurrentMergeScheduler
> mergeFactor
>        before: 10
>            after : 20
> ramBufferSizeMB
>        before: 32
>              after: 320
>
> excerpt from indexWriter.log
>
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: findMerges: 40 segments
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:   level 7.23609 to 7.98609: 20 segments
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:     0 to 20: add this merge
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:   level 5.44878 to 6.19878: 20 segments
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:     20 to 40: add this merge
>
> ...
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: applyDeletes
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted docIDs and 0 deleted queries on 40 segments.
> Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit exception flushing deletes
> Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit OutOfMemoryError inside updateDocument
> tom
>
>

Re: Memory use during merges (OOM)

Posted by Michael McCandless <lu...@mikemccandless.com>.
RAM usage for merging is tricky.

First off, merging must hold open a SegmentReader for each segment
being merged.  However, it's not necessarily a full segment reader;
for example, merging doesn't need the terms index nor norms.  But it
will load deleted docs.

But, if you are doing deletions (or updateDocument, which is just a
delete + add under-the-hood), then this will force the terms index of
the segment readers to be loaded, thus consuming more RAM.
Furthermore, if the deletions you (by Term/Query) do in fact result in
deleted documents (ie they were not "false" deletions), then the
merging allocates an int[maxDoc()] for each SegmentReader that has
deletions.

Finally, if you have multiple merges running at once (see
CSM.setMaxMergeCount) that means RAM for each currently running merge
is tied up.

So I think the gist is... the RAM usage will be in proportion to the
net size of the merge (mergeFactor + how big each merged segment is),
how many merges you allow concurrently, and whether you do false or
true deletions.

If you are doing false deletions (calling .updateDocument when in fact
the Term you are replacing cannot exist) it'd be best if possible to
change the app to not call .updateDocument if you know the Term
doesn't exist.

Mike

On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Hello all,
>
> Are there any general guidelines for determining the main factors in memory use during merges?
>
> We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory.
> Below is a list of the changes and part of the indexwriter log.  The changes increased the indexing though-put by almost an order of magnitude.
> (about 600 documents per hour to about 6000 documents per hour.  Our documents are about 800K)
>
> We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput
>
> Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20?
>
>  Is there any obvious relationship between ramBufferSizeMB and the memory consumed by Solr?
>  Are there rules of thumb for the memory needed in terms of the number or size of segments?
>
> Our largest segments prior to the failed merge attempt were between 5GB and 30GB.  The memory allocated to the Solr/tomcat JVM is 10GB.
>
> Tom Burton-West
> -----------------------------------------------------------------
>
> Changes to indexing configuration:
> mergeScheduler
>        before: serialMergeScheduler
>        after:    concurrentMergeScheduler
> mergeFactor
>        before: 10
>            after : 20
> ramBufferSizeMB
>        before: 32
>              after: 320
>
> excerpt from indexWriter.log
>
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: findMerges: 40 segments
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:   level 7.23609 to 7.98609: 20 segments
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:     0 to 20: add this merge
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:   level 5.44878 to 6.19878: 20 segments
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP:     20 to 40: add this merge
>
> ...
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: applyDeletes
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted docIDs and 0 deleted queries on 40 segments.
> Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit exception flushing deletes
> Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit OutOfMemoryError inside updateDocument
> tom
>
>

Re: Memory use during merges (OOM)

Posted by Upayavira <uv...@odoko.co.uk>.
How long does it take to reach this OOM situation? Is it possible for
you to try a merge with each setting in turn, and evaluate what impact
they each have? That is, indexing speed and memory consumption? It might
be interesting to watch garbage collection too while it is running with
jstat, as that could be your speed bottleneck.

Upayavira

On Wed, 15 Dec 2010 18:52 -0500, "Burton-West, Tom" <tb...@umich.edu>
wrote:
> Hello all,
> 
> Are there any general guidelines for determining the main factors in
> memory use during merges?
> 
> We recently changed our indexing configuration to speed up indexing but
> in the process of doing a very large merge we are running out of memory.
> Below is a list of the changes and part of the indexwriter log.  The
> changes increased the indexing though-put by almost an order of
> magnitude.
> (about 600 documents per hour to about 6000 documents per hour.  Our
> documents are about 800K)
> 
> We are trying to determine which of the changes to tweak to avoid the
> OOM, but still keep the benefit of the increased indexing throughput
> 
> Is it likely that the changes to ramBufferSizeMB are the culprit or could
> it be the mergeFactor change from 10-20?
> 
>  Is there any obvious relationship between ramBufferSizeMB and the memory
>  consumed by Solr?
>  Are there rules of thumb for the memory needed in terms of the number or
>  size of segments?
> 
> Our largest segments prior to the failed merge attempt were between 5GB
> and 30GB.  The memory allocated to the Solr/tomcat JVM is 10GB.
> 
> Tom Burton-West
> -----------------------------------------------------------------
> 
> Changes to indexing configuration:
> mergeScheduler
>         before: serialMergeScheduler
>         after:    concurrentMergeScheduler
> mergeFactor
>         before: 10
>             after : 20
> ramBufferSizeMB
>         before: 32
>               after: 320
> 
> excerpt from indexWriter.log
> 
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010;
> http-8091-Processor70]: LMP: findMerges: 40 segments
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010;
> http-8091-Processor70]: LMP:   level 7.23609 to 7.98609: 20 segments
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010;
> http-8091-Processor70]: LMP:     0 to 20: add this merge
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010;
> http-8091-Processor70]: LMP:   level 5.44878 to 6.19878: 20 segments
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010;
> http-8091-Processor70]: LMP:     20 to 40: add this merge
> 
> ...
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010;
> http-8091-Processor70]: applyDeletes
> Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010;
> http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0
> deleted docIDs and 0 deleted queries on 40 segments.
> Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010;
> http-8091-Processor70]: hit exception flushing deletes
> Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010;
> http-8091-Processor70]: hit OutOfMemoryError inside updateDocument
> tom
>