You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Robert Petersen <ro...@buy.com> on 2010/12/01 00:04:14 UTC

entire farm fails at the same time with OOM issues

Greetings, we are running one master and four slaves of our multicore
solr setup.  We just served searches for our catalog of 8 million
products with this farm during black Friday and cyber Monday, our
busiest days of the year, and the servers did not break a sweat!  Index
size is about 28GB.

 

However, twice now recently during a time of low load we have had a fire
drill where I have seen tomcat/solr fail and become unresponsive after
some OOM heap errors.  Solr wouldn't even serve up its admin pages.
I've had to go in and manually knock tomcat out of memory and then
restart it.  These solr slaves are load balanced and the load balancers
always probe the solr slaves so if they stop serving up searches they
are automatically removed from the load balancer.  When all four fail at
the same time we have an issue!

 

My question is this.  Why in the world would all of my slaves, after
running fine for some days, suddenly all at the exact same minute
experience OOM heap errors and go dead?  The load balancer kicks them
all out at the same time each time.  Each slave only talks to the master
and not to each other, but the master show no errors in the logs at all.
Something must be triggering this though.  The only other odd thing I
saw in the logs was after the first OOM errors were recorded, the slaves
started occasionally not being able to get to the master.

 

This behavior makes me a little nervous...    =:-o  eek!

 

 

Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat  

 

Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
64GB memory etc etc

Re: entire farm fails at the same time with OOM issues

Posted by Ken Krugler <kk...@transpac.com>.

On Nov 30, 2010, at 5:16pm, Robert Petersen wrote:

> What would I do with the heap dump though?  Run one of those java heap
> analyzers looking for memory leaks or something?  I have no experience
> with thoseI saw there was a bug fix in solr 1.4.1 for a 100 byte  
> memory
> leak occurring on each commit, but it would take thousands of  
> commits to
> make that add up to anything right?

Typically when I run out of memory in Solr, it's during an index  
update, when the new index searcher is getting warmed up.

Looking at the heap often shows ways to reduce memory requirements,  
e.g. you'll see a really big chunk used for a sorted field.

See http://wiki.apache.org/solr/SolrCaching and http://wiki.apache.org/solr/SolrPerformanceFactors 
  for more details.

-- Ken


>
> -----Original Message-----
> From: Ken Krugler [mailto:kkrugler_lists@transpac.com]
> Sent: Tuesday, November 30, 2010 3:12 PM
> To: solr-user@lucene.apache.org
> Subject: Re: entire farm fails at the same time with OOM issues
>
> Hi Robert,
>
> I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError
> and -XX:HeapDumpPath=<path to where you want the file to go>, so then
> you have something to look at versus a Gedankenexperiment :)
>
> -- Ken
>
> On Nov 30, 2010, at 3:04pm, Robert Petersen wrote:
>
>> Greetings, we are running one master and four slaves of our multicore
>> solr setup.  We just served searches for our catalog of 8 million
>> products with this farm during black Friday and cyber Monday, our
>> busiest days of the year, and the servers did not break a sweat!
>> Index
>> size is about 28GB.
>>
>> However, twice now recently during a time of low load we have had a
>> fire
>> drill where I have seen tomcat/solr fail and become unresponsive  
>> after
>> some OOM heap errors.  Solr wouldn't even serve up its admin pages.
>> I've had to go in and manually knock tomcat out of memory and then
>> restart it.  These solr slaves are load balanced and the load
>> balancers
>> always probe the solr slaves so if they stop serving up searches they
>> are automatically removed from the load balancer.  When all four
>> fail at
>> the same time we have an issue!
>>
>> My question is this.  Why in the world would all of my slaves, after
>> running fine for some days, suddenly all at the exact same minute
>> experience OOM heap errors and go dead?  The load balancer kicks them
>> all out at the same time each time.  Each slave only talks to the
>> master
>> and not to each other, but the master show no errors in the logs at
>> all.
>> Something must be triggering this though.  The only other odd thing I
>> saw in the logs was after the first OOM errors were recorded, the
>> slaves
>> started occasionally not being able to get to the master.
>>
>> This behavior makes me a little nervous...    =:-o  eek!
>>
>>
>>
>>
>>
>> Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat
>>
>>
>>
>> Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
>> 64GB memory etc etc
>>
>>
>>
>>
>>
>>
>>
>
> --------------------------------------------
> <http://ken-blog.krugler.org>
> +1 530-265-2225
>
>
>
>
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

RE: entire farm fails at the same time with OOM issues

Posted by Robert Petersen <ro...@buy.com>.

What would I do with the heap dump though?  Run one of those java heap
analyzers looking for memory leaks or something?  I have no experience
with thoseI saw there was a bug fix in solr 1.4.1 for a 100 byte memory
leak occurring on each commit, but it would take thousands of commits to
make that add up to anything right?

-----Original Message-----
From: Ken Krugler [mailto:kkrugler_lists@transpac.com] 
Sent: Tuesday, November 30, 2010 3:12 PM
To: solr-user@lucene.apache.org
Subject: Re: entire farm fails at the same time with OOM issues

Hi Robert,

I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError  
and -XX:HeapDumpPath=<path to where you want the file to go>, so then  
you have something to look at versus a Gedankenexperiment :)

-- Ken

On Nov 30, 2010, at 3:04pm, Robert Petersen wrote:

> Greetings, we are running one master and four slaves of our multicore
> solr setup.  We just served searches for our catalog of 8 million
> products with this farm during black Friday and cyber Monday, our
> busiest days of the year, and the servers did not break a sweat!   
> Index
> size is about 28GB.
>
> However, twice now recently during a time of low load we have had a  
> fire
> drill where I have seen tomcat/solr fail and become unresponsive after
> some OOM heap errors.  Solr wouldn't even serve up its admin pages.
> I've had to go in and manually knock tomcat out of memory and then
> restart it.  These solr slaves are load balanced and the load  
> balancers
> always probe the solr slaves so if they stop serving up searches they
> are automatically removed from the load balancer.  When all four  
> fail at
> the same time we have an issue!
>
> My question is this.  Why in the world would all of my slaves, after
> running fine for some days, suddenly all at the exact same minute
> experience OOM heap errors and go dead?  The load balancer kicks them
> all out at the same time each time.  Each slave only talks to the  
> master
> and not to each other, but the master show no errors in the logs at  
> all.
> Something must be triggering this though.  The only other odd thing I
> saw in the logs was after the first OOM errors were recorded, the  
> slaves
> started occasionally not being able to get to the master.
>
> This behavior makes me a little nervous...    =:-o  eek!
>
>
>
>
>
> Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat
>
>
>
> Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
> 64GB memory etc etc
>
>
>
>
>
>
>

--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225






--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: entire farm fails at the same time with OOM issues

Posted by Ken Krugler <kk...@transpac.com>.

Hi Robert,

I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError  
and -XX:HeapDumpPath=<path to where you want the file to go>, so then  
you have something to look at versus a Gedankenexperiment :)

-- Ken

On Nov 30, 2010, at 3:04pm, Robert Petersen wrote:

> Greetings, we are running one master and four slaves of our multicore
> solr setup.  We just served searches for our catalog of 8 million
> products with this farm during black Friday and cyber Monday, our
> busiest days of the year, and the servers did not break a sweat!   
> Index
> size is about 28GB.
>
> However, twice now recently during a time of low load we have had a  
> fire
> drill where I have seen tomcat/solr fail and become unresponsive after
> some OOM heap errors.  Solr wouldn't even serve up its admin pages.
> I've had to go in and manually knock tomcat out of memory and then
> restart it.  These solr slaves are load balanced and the load  
> balancers
> always probe the solr slaves so if they stop serving up searches they
> are automatically removed from the load balancer.  When all four  
> fail at
> the same time we have an issue!
>
> My question is this.  Why in the world would all of my slaves, after
> running fine for some days, suddenly all at the exact same minute
> experience OOM heap errors and go dead?  The load balancer kicks them
> all out at the same time each time.  Each slave only talks to the  
> master
> and not to each other, but the master show no errors in the logs at  
> all.
> Something must be triggering this though.  The only other odd thing I
> saw in the logs was after the first OOM errors were recorded, the  
> slaves
> started occasionally not being able to get to the master.
>
> This behavior makes me a little nervous...    =:-o  eek!
>
>
>
>
>
> Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat
>
>
>
> Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
> 64GB memory etc etc
>
>
>
>
>
>
>

--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225






--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

RE: entire farm fails at the same time with OOM issues

Posted by Robert Petersen <ro...@buy.com>.

Good idea.  Our farm is behind Akamai so that should be ok to do.

-----Original Message-----
From: Peter Karich [mailto:peathal@yahoo.de] 
Sent: Wednesday, December 01, 2010 12:21 PM
To: solr-user@lucene.apache.org
Subject: Re: entire farm fails at the same time with OOM issues


  also try to minimize maxWarming searchers to 1(?) or 2.
And decrease cache usage (especially autowarming) if possible at all. 
But again: only if it doesn't affect performance ...

Regards,
Peter.

> On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersen<ro...@buy.com>
wrote:
>> My question is this.  Why in the world would all of my slaves, after
>> running fine for some days, suddenly all at the exact same minute
>> experience OOM heap errors and go dead?
> If there is no change in query traffic when this happens, then it's
> due to what the index looks like.
>
> My guess is a large index merge happened, which means that when the
> searchers re-open on the new index, it requires more memory than
> normal (much less can be shared with the previous index).
>
> I'd try bumping the heap a little bit, and then optimizing once a day
> during off-peak hours.
> If you still get OOM errors, bump the heap a little more.
>
> -Yonik
> http://www.lucidimagination.com

Re: entire farm fails at the same time with OOM issues

Posted by Peter Karich <pe...@yahoo.de>.

  also try to minimize maxWarming searchers to 1(?) or 2.
And decrease cache usage (especially autowarming) if possible at all. 
But again: only if it doesn't affect performance ...

Regards,
Peter.

> On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersen<ro...@buy.com>  wrote:
>> My question is this.  Why in the world would all of my slaves, after
>> running fine for some days, suddenly all at the exact same minute
>> experience OOM heap errors and go dead?
> If there is no change in query traffic when this happens, then it's
> due to what the index looks like.
>
> My guess is a large index merge happened, which means that when the
> searchers re-open on the new index, it requires more memory than
> normal (much less can be shared with the previous index).
>
> I'd try bumping the heap a little bit, and then optimizing once a day
> during off-peak hours.
> If you still get OOM errors, bump the heap a little more.
>
> -Yonik
> http://www.lucidimagination.com

RE: entire farm fails at the same time with OOM issues

Posted by Chris Hostetter <ho...@fucit.org>.

I'm not sure if you resolved this issue, but...

: It has typically been when query traffic was lowest!  We are at 12 GB 

...that doesn't mean it couldn't have been query load related.  it's 
possible that some unusual query (ie: trying to sort on many fields at 
the same time?) could have forced the memory usage to spike (because of 
hte field cache).  depending on how your load balancer is setup the OOM on 
one box could have caused the it to fail over to the next box, which also 
OOMed, etc...

the really anoying part is how hard this sort of thing is to detect, 
because your servlet containers request log usually won't log a request 
untill after it's finished and all the data has been written bac kto the 
client -- it may have never been logged because of the OOM.

If your Load balancer keeps a request log, you could try checing it.  this 
could be something as simple as a bot doing a slow crawl of some very 
badly constructed URLs


-Hoss

RE: entire farm fails at the same time with OOM issues

Posted by Robert Petersen <ro...@buy.com>.

It has typically been when query traffic was lowest!  We are at 12 GB heap, so I will try to bump it to 14 GB.  We have 64GB main memory installed now.  Here is our settings, do these look OK?

export JAVA_OPTS="-Xmx12228m -Xms12228m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode"

-----Original Message-----
From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik Seeley
Sent: Tuesday, November 30, 2010 6:44 PM
To: solr-user@lucene.apache.org
Subject: Re: entire farm fails at the same time with OOM issues

On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersen <ro...@buy.com> wrote:
> My question is this.  Why in the world would all of my slaves, after
> running fine for some days, suddenly all at the exact same minute
> experience OOM heap errors and go dead?

If there is no change in query traffic when this happens, then it's
due to what the index looks like.

My guess is a large index merge happened, which means that when the
searchers re-open on the new index, it requires more memory than
normal (much less can be shared with the previous index).

I'd try bumping the heap a little bit, and then optimizing once a day
during off-peak hours.
If you still get OOM errors, bump the heap a little more.

-Yonik
http://www.lucidimagination.com

Re: entire farm fails at the same time with OOM issues

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersen <ro...@buy.com> wrote:
> My question is this.  Why in the world would all of my slaves, after
> running fine for some days, suddenly all at the exact same minute
> experience OOM heap errors and go dead?

If there is no change in query traffic when this happens, then it's
due to what the index looks like.

My guess is a large index merge happened, which means that when the
searchers re-open on the new index, it requires more memory than
normal (much less can be shared with the previous index).

I'd try bumping the heap a little bit, and then optimizing once a day
during off-peak hours.
If you still get OOM errors, bump the heap a little more.

-Yonik
http://www.lucidimagination.com