You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shamik Bandopadhyay <sh...@gmail.com> on 2017/09/18 14:24:43 UTC

Solr nodes crashing (OOM) after 6.6 upgrade

Hi,

   I recently upgraded to Solr 6.6 from 5.5. After running for a couple of
days, the entire Solr cluster suddenly came down with OOM exception. Once
the servers are being restarted, the memory footprint stays stable for a
while before the sudden spike in memory occurs. The heap surges up quickly
and hits the max causing the JVM to shut down due to OOM. It starts with
one server but eventually trickles downs to the rest of the nodes, bringing
the entire cluster down within a span of 10-15 mins.

The cluster consists of 6 nodes with two shards having 2 replicas each.
There are two collections with total index size close to 24 gb. Each server
has 8 CPUs with 30gb memory. Solr is running on an embedded jetty on jdk
1.8. The JVM parameters are identical to 5.5:

SOLR_JAVA_MEM="-Xms1000m -Xmx290000m"

GC_LOG_OPTS="-verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails \
  -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
-XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime"

GC_TUNE="-XX:NewRatio=3 \
-XX:SurvivorRatio=4 \
-XX:TargetSurvivorRatio=90 \
-XX:MaxTenuringThreshold=8 \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
-XX:+CMSScavengeBeforeRemark \
-XX:PretenureSizeThreshold=64m \
-XX:+UseCMSInitiatingOccupancyOnly \
-XX:CMSInitiatingOccupancyFraction=50 \
-XX:CMSMaxAbortablePrecleanTime=6000 \
-XX:+CMSParallelRemarkEnabled \
-XX:+ParallelRefProcEnabled"

I've tried G1GC based on Shawn's WIKI, but didn't make any difference.
Though G1GC seemed to do well with GC initially, it showed similar
behaviour during the spike. It prompted me to revert back to CMS.

I'm doing a hard commit every 5 mins.

SOLR_OPTS="$SOLR_OPTS -Xss256k"
SOLR_OPTS="$SOLR_OPTS -Dsolr.autoCommit.maxTime=300000"
SOLR_OPTS="$SOLR_OPTS -Dsolr.clustering.enabled=true"
SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=120000"

Othe Solr configurations:

<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>

Cache settings:

<maxBooleanClauses>4096</maxBooleanClauses>
<slowQueryThresholdMillis>1000</slowQueryThresholdMillis>
<filterCache class="solr.FastLRUCache" size="20000" initialSize="4096"
autowarmCount="512"/>
<queryResultCache class="solr.LRUCache" size="2000" initialSize="500"
autowarmCount="100"/>
<documentCache class="solr.LRUCache" size="60000" initialSize="5000"
autowarmCount="0"/>
<cache name="perSegFilter" class="solr.search.LRUCache" size="10"
initialSize="0" autowarmCount="10" regenerator="solr.NoOpRegenerator" />
<fieldValueCache class="solr.FastLRUCache" size="20000"
autowarmCount="4096" showItems="1024" />
<cache enable="${solr.ltr.enabled:false}" name="QUERY_DOC_FV"
class="solr.search.LRUCache" size="4096" initialSize="2048"
autowarmCount="4096" regenerator="solr.search.NoOpRegenerator" />
<enableLazyFieldLoading>true</enableLazyFieldLoading>
<queryResultWindowSize>200</queryResultWindowSize>
<queryResultMaxDocsCached>400</queryResultMaxDocsCached>

I'm not sure what has changed so drastically in 6.6 compared to 5.5. I
never had a single OOM in 5.5 which has been running for a couple of years.
Moreover, the memory footprint was much less with 15gb set as Xmx. All my
facet parameters have docvalues enabled, it should handle the memory part
efficiently.

I'm struggling to figure out the root cause. Does 6.6 command more memory
than what is currently available on our servers (30gb)? What might be the
probable cause for this sort of scenario? What are the best practices to
troubleshoot such issues?

Any pointers will be appreciated.

Thanks,
Shamik

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by Damien Kamerman <da...@gmail.com>.

A suggester rebuild will mmap the entire index. So'll you need free memory
for depending on your index size.

On 19 September 2017 at 13:47, shamik <sh...@gmail.com> wrote:

> I agree, should have made it clear in my initial post. The reason I thought
> it's little trivial since the newly introduced collection has only few
> hundred documents and is not being used in search yet. Neither it's being
> indexed at a regular interval. The cache parameters are kept to a minimum
> as
> well. But there might be overheads of a simply creating a collection which
> I'm not aware of.
>
> I did bring down the heap size to 8gb, changed to G1 and reduced the cache
> params. The memory so far has been holding up but will wait for a while
> before passing on a judgment.
>
> <filterCache class="solr.FastLRUCache" size="256" initialSize="256"
> autowarmCount="0"/>
> <queryResultCache class="solr.LRUCache" size="256" initialSize="256"
> autowarmCount="0"/>
> <documentCache class="solr.LRUCache" size="256" initialSize="256"
> autowarmCount="0"/>
> <cache name="perSegFilter" class="solr.search.LRUCache" size="10"
> initialSize="0" autowarmCount="10" regenerator="solr.NoOpRegenerator" />
> <fieldValueCache class="solr.FastLRUCache" size="256" autowarmCount="256"
> showItems="0" />
>
> The change seemed to have increased the number of slow queries (1000 ms),
> but I'm willing to address the OOM over performance at this point. One
> thing
> I realized is that I provided the wrong index size here. It's 49gb instead
> of 25, which I mistakenly picked from one shard. I hope the heap size will
> continue to sustain for the index size.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by Michael Kuhlmann <ku...@solr.info>.

Hi Shamik,

funny enough, we had a similar issue with our old legacy application
that still used plain Lucene code in a JBoss container.

Same, there were no specific queries or updates causing this, the
performance just broke completely without unusual usage. GC was raising
up to 99% or so. Sometimes it came back after some while but most often
we had to completely restart JBoss for that.

I never figured out what the root cause was, but my suspicion still is
that Lucene was innocent. I rather suspect Rackspace's hypervisor to be
the blamable.

So maybe you can give it a try and have a look at the Amazon cloud settings?

Best,
Michael

Am 22.09.2017 um 12:00 schrieb shamik:
> All the tuning and scaling down of memory seemed to be stable for a couple of
> days but then came down due to a huge spike in CPU usage, contributed by G1
> Old Generation GC. I'm really puzzled why the instances are suddenly
> behaving like this. It's not that a sudden surge of load contributed to
> this, query and indexing load seemed to be comparable with the previous time
> frame. Just wondering if the hardware itself is not adequate enough for 6.6.
> The instances are all running on 8 CPU / 30gb m3.2xlarge EC2 instances.
> 
> Does anyone ever face issues similar to this?
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by shamik <sh...@gmail.com>.

Susheel, my inference was based on the Qtime value from Solr log and not
based on application log. Before the CPU spike, the query time didn’t give
any indication that they are slow in the process of slowing down. As the GC
suddenly triggers a high CPU usage, query execution slows down or chocks,
but that can easily be attributed to the lack of available processing power.

I’m curious to know what’s the recommended hardware for 6.6 having 50gb
index size and 15 million+ documents.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by Susheel Kumar <su...@gmail.com>.

It may happen that you may never find the queries/query time being logged
for the queries which caused OOM and your app never got chance to log how
much time it took...

So if you had proper exception handled in your client code, you may see
exception being logged but not see the query time for such queries.

Thnx

On Fri, Sep 22, 2017 at 6:32 AM, shamik <sh...@gmail.com> wrote:

> I usually log queries that took more than 1sec. Based on the logs, I
> haven't
> seen anything alarming or surge in terms of slow queries, especially around
> the time when the CPU spike happened.
>
> I don't necessarily have the data for deep paging, but the usage of sort
> parameter (date in our case) has been typically low. We also restrict 10
> results per page for pagination. Are there are recommendations around this?
>
> Again, I don't want to sound like a broken record, but I still don't get
> the
> part why these issues crop in 6.6 as compared to 5.5
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by shamik <sh...@gmail.com>.

I usually log queries that took more than 1sec. Based on the logs, I haven't
seen anything alarming or surge in terms of slow queries, especially around
the time when the CPU spike happened.

I don't necessarily have the data for deep paging, but the usage of sort
parameter (date in our case) has been typically low. We also restrict 10
results per page for pagination. Are there are recommendations around this?

Again, I don't want to sound like a broken record, but I still don't get the
part why these issues crop in 6.6 as compared to 5.5  



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by Emir Arnautović <em...@sematext.com>.

It does not have to be query load - it can be one heavy query that cause memory consumption (heavy faceting, deep paging,…) and after that GC jumps in. Maybe you could start with log and see if there are queries that have large QTime,

Emir

> On 22 Sep 2017, at 12:00, shamik <sh...@gmail.com> wrote:
> 
> All the tuning and scaling down of memory seemed to be stable for a couple of
> days but then came down due to a huge spike in CPU usage, contributed by G1
> Old Generation GC. I'm really puzzled why the instances are suddenly
> behaving like this. It's not that a sudden surge of load contributed to
> this, query and indexing load seemed to be comparable with the previous time
> frame. Just wondering if the hardware itself is not adequate enough for 6.6.
> The instances are all running on 8 CPU / 30gb m3.2xlarge EC2 instances.
> 
> Does anyone ever face issues similar to this?
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by shamik <sh...@gmail.com>.

All the tuning and scaling down of memory seemed to be stable for a couple of
days but then came down due to a huge spike in CPU usage, contributed by G1
Old Generation GC. I'm really puzzled why the instances are suddenly
behaving like this. It's not that a sudden surge of load contributed to
this, query and indexing load seemed to be comparable with the previous time
frame. Just wondering if the hardware itself is not adequate enough for 6.6.
The instances are all running on 8 CPU / 30gb m3.2xlarge EC2 instances.

Does anyone ever face issues similar to this?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by Susheel Kumar <su...@gmail.com>.

+1. Asking for way more than anything you need may result into OOM.  rows
and facet.limit should be carefully passed.

On Tue, Sep 19, 2017 at 1:23 PM, Toke Eskildsen <to...@kb.dk> wrote:

> shamik <sh...@gmail.com> wrote:
> > I've facet.limit=-1 configured for few search types, but facet.mincount
> is
> > always set as 1. Didn't know that's detrimental to doc values.
>
> It is if you have a lot (1000+) of unique values in your facet field,
> especially when you have more than 1 shard. Only ask for the number you
> need. Same goes for rows BTW.
>
> - Toke Eskildsen
>

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by Toke Eskildsen <to...@kb.dk>.

shamik <sh...@gmail.com> wrote:
> I've facet.limit=-1 configured for few search types, but facet.mincount is
> always set as 1. Didn't know that's detrimental to doc values.

It is if you have a lot (1000+) of unique values in your facet field, especially when you have more than 1 shard. Only ask for the number you need. Same goes for rows BTW.

- Toke Eskildsen

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by shamik <sh...@gmail.com>.

Emir, after digging deeper into the logs (using new relic/solr admin) during
the outage, it looks like a combination of query load and indexing process
triggered it. Based on the earlier pattern, memory would tend to increase at
a steady pace, but then surge all of a sudden, triggering OOM. After I
scaled down the heap size as per Walter's suggestion, the memory seemed to
have been holding up. But there's a possibility the lower heap size might
have restricted the GC to utilize higher CPU. The cache size has been scaled
down, I'm hoping it's no longer adding an overhead after every commit.

I've facet.limit=-1 configured for few search types, but facet.mincount is
always set as 1. Didn't know that's detrimental to doc values.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by Emir Arnautović <em...@sematext.com>.

Hi Shamik,
Can you tell us a bit more about how you use Solr before it OOM. Do you observe some heavy indexing or it happens during higher query load. Does memory slowly increases or jumps suddenly? Do you have any monitoring tool to see if you can correlate some metric with memory increase?
You mentioned that you have doc values on fields used for faceting, but that will not save you if you do faceting on high cardinality fields with facet.limit=-1&facet.mincount=0 or something similar.

In the worshippers case, you can take a heap dump and see what’s in it.

Regards,
Emir

> On 19 Sep 2017, at 10:11, shamik <sh...@gmail.com> wrote:
> 
> Thanks, the change seemed to have addressed the memory issue (so far), but on
> the contrary, the GC chocked the CPUs stalling everything. The CPU
> utilization across the cluster clocked close to 400%, literally stalling
> everything.On a first look, the G1-Old generation looks to be the culprit
> that took up 80% of the CPU. Not sure what triggered really triggered it as
> the GC seemed to have stable till then. The other thing I noticed was the
> mlt queries (I'm using mlt query parser for cloud support) took a huge
> amount of time to respond (10 sec+) during the CPU spike compared to the
> rest. Again, that might just due to the CPU.
> 
> The index might not be a large one to merit a couple of shards, but it has
> never been an issue for past couple of years on 5.5. We never had a single
> outage related to memory or CPU. The query/indexing load has increased over
> time, but it has been linear. I'm little baffled why would 6.6 behave so
> differently. Perhaps the hardware is not adequate enough? I'm running on 8
> core / 30gb machine with SSD.
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by shamik <sh...@gmail.com>.

Thanks, the change seemed to have addressed the memory issue (so far), but on
the contrary, the GC chocked the CPUs stalling everything. The CPU
utilization across the cluster clocked close to 400%, literally stalling
everything.On a first look, the G1-Old generation looks to be the culprit
that took up 80% of the CPU. Not sure what triggered really triggered it as
the GC seemed to have stable till then. The other thing I noticed was the
mlt queries (I'm using mlt query parser for cloud support) took a huge
amount of time to respond (10 sec+) during the CPU spike compared to the
rest. Again, that might just due to the CPU.

The index might not be a large one to merit a couple of shards, but it has
never been an issue for past couple of years on 5.5. We never had a single
outage related to memory or CPU. The query/indexing load has increased over
time, but it has been linear. I'm little baffled why would 6.6 behave so
differently. Perhaps the hardware is not adequate enough? I'm running on 8
core / 30gb machine with SSD.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by Walter Underwood <wu...@wunderwood.org>.

With frequent commits, autowarming isn’t very useful. Even with a daily bulk update, I use explicit warming queries.

For our textbooks collection, I configure the twenty top queries and the twenty most common words in the index. Neither list changes much. If we used facets, I’d warm those, too.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 19, 2017, at 12:18 AM, Toke Eskildsen <to...@kb.dk> wrote:
> 
> On Mon, 2017-09-18 at 20:47 -0700, shamik wrote:
>> I did bring down the heap size to 8gb, changed to G1 and reduced the
>> cache params. The memory so far has been holding up but will wait for
>> a while before passing on a judgment. 
> 
> Sounds reasonable.
> 
>> <filterCache class="solr.FastLRUCache" size="256" initialSize="256"
>> autowarmCount="0"/>
> [...]
> 
>> The change seemed to have increased the number of slow queries (1000
>> ms), but I'm willing to address the OOM over performance at this
>> point.
> 
> You over-compensated by switching from an enormous cache with excessive
> warming to a small cache with no warming. Try setting autowarmCount to
> 20 or something like that. Also make an explicit warming query that
> facets on all your facet-fields, to initialize the underlying
> structures.
> 
>> One thing I realized is that I provided the wrong index size here.
>> It's 49gb instead of 25, which I mistakenly picked from one shard.
> 
> Quite independent from all of this, your index is not a large one; it
> might work better for you to store it as a single shard (with
> replicas), to avoid the overhead of the distributes processing needed
> for multi-shard. The overhead is especially visible when doing a lot of
> String faceting.
> 
>> I hope the heap size will continue to sustain for the index size. 
> 
> You can check the memory usage in the admin GUI.
> 
> - Toke Eskildsen, Royal Danish Library
>

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by Toke Eskildsen <to...@kb.dk>.

On Mon, 2017-09-18 at 20:47 -0700, shamik wrote:
> I did bring down the heap size to 8gb, changed to G1 and reduced the
> cache params. The memory so far has been holding up but will wait for
> a while before passing on a judgment. 

Sounds reasonable.

> <filterCache class="solr.FastLRUCache" size="256" initialSize="256"
> autowarmCount="0"/>
[...]

> The change seemed to have increased the number of slow queries (1000
> ms), but I'm willing to address the OOM over performance at this
> point.

You over-compensated by switching from an enormous cache with excessive
warming to a small cache with no warming. Try setting autowarmCount to
20 or something like that. Also make an explicit warming query that
facets on all your facet-fields, to initialize the underlying
structures.

>  One thing I realized is that I provided the wrong index size here.
> It's 49gb instead of 25, which I mistakenly picked from one shard.

Quite independent from all of this, your index is not a large one; it
might work better for you to store it as a single shard (with
replicas), to avoid the overhead of the distributes processing needed
for multi-shard. The overhead is especially visible when doing a lot of
String faceting.

>  I hope the heap size will continue to sustain for the index size. 

You can check the memory usage in the admin GUI.

- Toke Eskildsen, Royal Danish Library

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by shamik <sh...@gmail.com>.

I agree, should have made it clear in my initial post. The reason I thought
it's little trivial since the newly introduced collection has only few
hundred documents and is not being used in search yet. Neither it's being
indexed at a regular interval. The cache parameters are kept to a minimum as
well. But there might be overheads of a simply creating a collection which
I'm not aware of.

I did bring down the heap size to 8gb, changed to G1 and reduced the cache
params. The memory so far has been holding up but will wait for a while
before passing on a judgment. 

<filterCache class="solr.FastLRUCache" size="256" initialSize="256"
autowarmCount="0"/>
<queryResultCache class="solr.LRUCache" size="256" initialSize="256"
autowarmCount="0"/>
<documentCache class="solr.LRUCache" size="256" initialSize="256"
autowarmCount="0"/>
<cache name="perSegFilter" class="solr.search.LRUCache" size="10"
initialSize="0" autowarmCount="10" regenerator="solr.NoOpRegenerator" />
<fieldValueCache class="solr.FastLRUCache" size="256" autowarmCount="256"
showItems="0" />

The change seemed to have increased the number of slow queries (1000 ms),
but I'm willing to address the OOM over performance at this point. One thing
I realized is that I provided the wrong index size here. It's 49gb instead
of 25, which I mistakenly picked from one shard. I hope the heap size will
continue to sustain for the index size. 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by Erick Erickson <er...@gmail.com>.

Shamik:

bq: The part I'm trying to understand is whether the memory footprint
is higher for 6.6...

bq:  it has two collections, one being introduced with 6.6 upgrade

If I'm reading this right, you added another collection to the system
as part of the upgrade. Of course it will take more memory. Especially
if your new collection is configured to, say, inefficiently use
caches, or you group or sort or facet on fields that are not
docValues. Or.....

That information would have saved people quite a bit of time if you'd
posted it first.

Best,
Erick

On Mon, Sep 18, 2017 at 9:03 AM, shamik <sh...@gmail.com> wrote:
> Walter, thanks again. Here's some information on the index and search
> feature.
>
> The index size is close to 25gb, with 20 million documents. it has two
> collections, one being introduced with 6.6 upgrade. The primary collection
> carries the bulk of the index, newly formed one being aimed at getting
> populated going forward. Besides keyword search, the search has a bunch of
> facets, which are configured to use docvalues. The notable search features
> being used are highlighter, query elevation, mlt and suggester. The other
> change from 5.5 was to replace Porter Stemmer with Lemmatizer in the
> analysis channel.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by shamik <sh...@gmail.com>.

Walter, thanks again. Here's some information on the index and search
feature.

The index size is close to 25gb, with 20 million documents. it has two
collections, one being introduced with 6.6 upgrade. The primary collection
carries the bulk of the index, newly formed one being aimed at getting
populated going forward. Besides keyword search, the search has a bunch of
facets, which are configured to use docvalues. The notable search features
being used are highlighter, query elevation, mlt and suggester. The other
change from 5.5 was to replace Porter Stemmer with Lemmatizer in the
analysis channel.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by shamik <sh...@gmail.com>.

Thanks for your suggesting, I'm going to tune it and bring it down. It just
happened to carry over from 5.5 settings. Based on Walter's suggestion, I'm
going to reduce the heap size and see if it addresses the problem.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by Joe Obernberger <jo...@gmail.com>.

Very nice article - thank you!  Is there a similar article available 
when the index is on HDFS?  Sorry to hijack!  I'm very interested in how 
we can improve cache/general performance when running with HDFS.

-Joe


On 9/18/2017 11:35 AM, Erick Erickson wrote:
> <filterCache class="solr.FastLRUCache" size="20000" initialSize="4096"
> autowarmCount="512"/>
>
> This is suspicious too. Each entry is up to about
> maxDoc/8 bytes + (string size of fq clause) long
> and you can have up to 20,000 of them. An autowarm count of 512 is
> almost never  a good thing.
>
> Walter's comments about your memory are spot on of course, see:
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> Best,
> Erick
>
> On Mon, Sep 18, 2017 at 7:59 AM, Walter Underwood <wu...@wunderwood.org> wrote:
>> 29G on a 30G machine is still a bad config. That leaves no space for the OS, file buffers, or any other processes.
>>
>> Try with 8G.
>>
>> Also, give us some information about the number of docs, size of the indexes, and the kinds of search features you are using.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>>> On Sep 18, 2017, at 7:55 AM, shamik <sh...@gmail.com> wrote:
>>>
>>> Apologies, 290gb was a typo on my end, it should read 29gb instead. I started
>>> with my 5.5 configurations of limiting the RAM to 15gb. But it started going
>>> down once it reached the 15gb ceiling. I tried bumping it up to 29gb since
>>> memory seemed to stabilize at 22gb after running for few hours, of course,
>>> it didn't help eventually. I did try the G1 collector. Though garbage
>>> collection was happening more efficiently compared to CMS, it brought the
>>> nodes down after a while.
>>>
>>> The part I'm trying to understand is whether the memory footprint is higher
>>> for 6.6 and whether I need an instance with higher ram (>30gb in my case). I
>>> haven't added any post 5.5 feature to rule out the possibility of a memory
>>> leak.
>>>
>>>
>>>
>>> --
>>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> ---
> This email has been checked for viruses by AVG.
> http://www.avg.com
>

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by Erick Erickson <er...@gmail.com>.

<filterCache class="solr.FastLRUCache" size="20000" initialSize="4096"
autowarmCount="512"/>

This is suspicious too. Each entry is up to about
maxDoc/8 bytes + (string size of fq clause) long
and you can have up to 20,000 of them. An autowarm count of 512 is
almost never  a good thing.

Walter's comments about your memory are spot on of course, see:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Best,
Erick

On Mon, Sep 18, 2017 at 7:59 AM, Walter Underwood <wu...@wunderwood.org> wrote:
> 29G on a 30G machine is still a bad config. That leaves no space for the OS, file buffers, or any other processes.
>
> Try with 8G.
>
> Also, give us some information about the number of docs, size of the indexes, and the kinds of search features you are using.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Sep 18, 2017, at 7:55 AM, shamik <sh...@gmail.com> wrote:
>>
>> Apologies, 290gb was a typo on my end, it should read 29gb instead. I started
>> with my 5.5 configurations of limiting the RAM to 15gb. But it started going
>> down once it reached the 15gb ceiling. I tried bumping it up to 29gb since
>> memory seemed to stabilize at 22gb after running for few hours, of course,
>> it didn't help eventually. I did try the G1 collector. Though garbage
>> collection was happening more efficiently compared to CMS, it brought the
>> nodes down after a while.
>>
>> The part I'm trying to understand is whether the memory footprint is higher
>> for 6.6 and whether I need an instance with higher ram (>30gb in my case). I
>> haven't added any post 5.5 feature to rule out the possibility of a memory
>> leak.
>>
>>
>>
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by Walter Underwood <wu...@wunderwood.org>.

29G on a 30G machine is still a bad config. That leaves no space for the OS, file buffers, or any other processes.

Try with 8G.

Also, give us some information about the number of docs, size of the indexes, and the kinds of search features you are using.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 18, 2017, at 7:55 AM, shamik <sh...@gmail.com> wrote:
> 
> Apologies, 290gb was a typo on my end, it should read 29gb instead. I started
> with my 5.5 configurations of limiting the RAM to 15gb. But it started going
> down once it reached the 15gb ceiling. I tried bumping it up to 29gb since
> memory seemed to stabilize at 22gb after running for few hours, of course,
> it didn't help eventually. I did try the G1 collector. Though garbage
> collection was happening more efficiently compared to CMS, it brought the
> nodes down after a while.
> 
> The part I'm trying to understand is whether the memory footprint is higher
> for 6.6 and whether I need an instance with higher ram (>30gb in my case). I
> haven't added any post 5.5 feature to rule out the possibility of a memory
> leak.
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by shamik <sh...@gmail.com>.

Apologies, 290gb was a typo on my end, it should read 29gb instead. I started
with my 5.5 configurations of limiting the RAM to 15gb. But it started going
down once it reached the 15gb ceiling. I tried bumping it up to 29gb since
memory seemed to stabilize at 22gb after running for few hours, of course,
it didn't help eventually. I did try the G1 collector. Though garbage
collection was happening more efficiently compared to CMS, it brought the
nodes down after a while.

The part I'm trying to understand is whether the memory footprint is higher
for 6.6 and whether I need an instance with higher ram (>30gb in my case). I
haven't added any post 5.5 feature to rule out the possibility of a memory
leak.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr nodes crashing (OOM) after 6.6 upgrade

Posted by Walter Underwood <wu...@wunderwood.org>.

You are running with a 290 Gb heap (!!!!) on a 30 Gb machine. That is the worst Java config I have ever seen.

Use this:

SOLR_JAVA_MEM="-Xms8g -Xmx8g”

That starts with an 8 Gb heap and stays there.

Also, you might think about simplifying the GC configuration. Or if you are on a recent release of Java 8, using the G1 collector. We’re getting great performance with this config:

SOLR_HEAP=8g
# Use G1 GC  -- wunder 2017-01-23
# Settings from https://wiki.apache.org/solr/ShawnHeisey
GC_TUNE=" \
-XX:+UseG1GC \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=200 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
"

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 18, 2017, at 7:24 AM, Shamik Bandopadhyay <sh...@gmail.com> wrote:
> 
> Hi,
> 
>   I recently upgraded to Solr 6.6 from 5.5. After running for a couple of
> days, the entire Solr cluster suddenly came down with OOM exception. Once
> the servers are being restarted, the memory footprint stays stable for a
> while before the sudden spike in memory occurs. The heap surges up quickly
> and hits the max causing the JVM to shut down due to OOM. It starts with
> one server but eventually trickles downs to the rest of the nodes, bringing
> the entire cluster down within a span of 10-15 mins.
> 
> The cluster consists of 6 nodes with two shards having 2 replicas each.
> There are two collections with total index size close to 24 gb. Each server
> has 8 CPUs with 30gb memory. Solr is running on an embedded jetty on jdk
> 1.8. The JVM parameters are identical to 5.5:
> 
> SOLR_JAVA_MEM="-Xms1000m -Xmx290000m"
> 
> GC_LOG_OPTS="-verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails \
>  -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
> -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime"
> 
> GC_TUNE="-XX:NewRatio=3 \
> -XX:SurvivorRatio=4 \
> -XX:TargetSurvivorRatio=90 \
> -XX:MaxTenuringThreshold=8 \
> -XX:+UseConcMarkSweepGC \
> -XX:+UseParNewGC \
> -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
> -XX:+CMSScavengeBeforeRemark \
> -XX:PretenureSizeThreshold=64m \
> -XX:+UseCMSInitiatingOccupancyOnly \
> -XX:CMSInitiatingOccupancyFraction=50 \
> -XX:CMSMaxAbortablePrecleanTime=6000 \
> -XX:+CMSParallelRemarkEnabled \
> -XX:+ParallelRefProcEnabled"
> 
> I've tried G1GC based on Shawn's WIKI, but didn't make any difference.
> Though G1GC seemed to do well with GC initially, it showed similar
> behaviour during the spike. It prompted me to revert back to CMS.
> 
> I'm doing a hard commit every 5 mins.
> 
> SOLR_OPTS="$SOLR_OPTS -Xss256k"
> SOLR_OPTS="$SOLR_OPTS -Dsolr.autoCommit.maxTime=300000"
> SOLR_OPTS="$SOLR_OPTS -Dsolr.clustering.enabled=true"
> SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=120000"
> 
> Othe Solr configurations:
> 
> <autoSoftCommit>
> <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
> </autoSoftCommit>
> 
> Cache settings:
> 
> <maxBooleanClauses>4096</maxBooleanClauses>
> <slowQueryThresholdMillis>1000</slowQueryThresholdMillis>
> <filterCache class="solr.FastLRUCache" size="20000" initialSize="4096"
> autowarmCount="512"/>
> <queryResultCache class="solr.LRUCache" size="2000" initialSize="500"
> autowarmCount="100"/>
> <documentCache class="solr.LRUCache" size="60000" initialSize="5000"
> autowarmCount="0"/>
> <cache name="perSegFilter" class="solr.search.LRUCache" size="10"
> initialSize="0" autowarmCount="10" regenerator="solr.NoOpRegenerator" />
> <fieldValueCache class="solr.FastLRUCache" size="20000"
> autowarmCount="4096" showItems="1024" />
> <cache enable="${solr.ltr.enabled:false}" name="QUERY_DOC_FV"
> class="solr.search.LRUCache" size="4096" initialSize="2048"
> autowarmCount="4096" regenerator="solr.search.NoOpRegenerator" />
> <enableLazyFieldLoading>true</enableLazyFieldLoading>
> <queryResultWindowSize>200</queryResultWindowSize>
> <queryResultMaxDocsCached>400</queryResultMaxDocsCached>
> 
> I'm not sure what has changed so drastically in 6.6 compared to 5.5. I
> never had a single OOM in 5.5 which has been running for a couple of years.
> Moreover, the memory footprint was much less with 15gb set as Xmx. All my
> facet parameters have docvalues enabled, it should handle the memory part
> efficiently.
> 
> I'm struggling to figure out the root cause. Does 6.6 command more memory
> than what is currently available on our servers (30gb)? What might be the
> probable cause for this sort of scenario? What are the best practices to
> troubleshoot such issues?
> 
> Any pointers will be appreciated.
> 
> Thanks,
> Shamik