You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Simon Wistow <si...@thegestalt.org> on 2011/01/24 19:32:38 UTC

Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

We have two slaves replicating off one master every 2 minutes.

Both using the CMS + ParNew Garbage collector. Specifically

-server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 
-XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing

but periodically they both get into a GC storm and just keel over.

Looking through the GC logs the amount of memory reclaimed in each GC 
run gets less and less until we get a concurrent mode failure and then 
Solr effectively dies.

Is it possible there's a memory leak? I note that later versions of 
Lucene have fixed a few leaks. Our current versions are relatively old

	Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 
18:06:42

	Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55

so I'm wondering if upgrading to later version of Lucene might help (of 
course it might not but I'm trying to investigate all options at this 
point). If so what's the best way to go about this? Can I just grab the 
Lucene jars and drop them somewhere (or unpack and then repack the solr 
war file?). Or should I use a nightly solr 1.4?

Or am I barking up completely the wrong tree? I'm trawling through heap 
logs and gc logs at the moment trying to to see what other tuning I can 
do but any other hints, tips, tricks or cluebats gratefully received. 
Even if it's just "Yeah, we had that problem and we added more slaves 
and periodically restarted them"

thanks,

Simon

Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

Posted by Simon Wistow <si...@thegestalt.org>.

On Wed, Apr 06, 2011 at 12:05:57AM +0200, Jan Høydahl said:
> Just curious, was there any resolution to this?

Not really.

We tuned the GC pretty aggressively - we use these options

-server 
-Xmx20G -Xms20G -Xss10M
-XX:+UseConcMarkSweepGC 
-XX:+UseParNewGC 
-XX:+CMSIncrementalMode 
-XX:+CMSIncrementalPacing
-XX:SoftRefLRUPolicyMSPerMB=10

and we've played a little with CompressOops and AggressiveOpts.

We also backported the MMapDirectory factory to 1.4.1 and that helped a 
lot.

We do still gets spikes of long (5s-20s queries) a few times an hour 
which don't appear to be caused by any kind of "Query of Death". 
Occasionally (once every few days) one of the slaves will experience a 
period of sustained slowness but recovers by itself in less than a 
minute.

According to our GC logs we haven't had a full GC for a long time. 

Currently the state of play is that we commit on our master every 5000ms 
and replicate from the slaves every 2 minutes. Our reponse times for 
searches on the slaves are about 180-270ms but if we turn off 
replication then we get 60-90ms. So something is clearly "up" with that.

Having talked to the good people at Lucid we're going to try playing 
around with commit intervals, upping our mergeFactor from 10 to 25 and 
maybe using the BalancedSegmentMergePolicy. 

The system seems to be stable at the moment which is good but obviously 
we'd like to lower our query times if possible.

Hopefully this might be of some use to somebody out there, sometime.

Simon

Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi,

Just curious, was there any resolution to this?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 8. feb. 2011, at 03.40, Markus Jelsma wrote:

> Do you have GC logging enabled? Tail -f the log file and you'll see what CMS is 
> telling you. Tuning the occupation fraction of the tenured generation to a 
> lower value than default and telling the JVM to only use your value to 
> initiate a collection can help a lot. The same goes for sizing the young 
> generation and sometimes the survivor ratio.
> 
> Consult the HotSpot CMS settings and young generation (or new) sizes. They are 
> very important.
> 
> If you have multiple slaves under the same load you can easily try different 
> configurations. Keeping an eye on the nodes with a tool like JConsole and at 
> the same time tailing the GC log will help a lot. Don't forget to send updates 
> and frequent commits or you won't be able to replay. I've never seen a Solr 
> instance go down under heavy load and without commits but they tend to behave 
> badly when commits occur while under heavy load with long cache warming times 
> (and heap consumption).
> 
> You might also be suffering from memory fragmentation, this is bad and can lead 
> to failure. You can configure the JVM to fore a compaction before a GC, that's 
> nice but it does consume CPU time.
> 
> A query of death can, in theory, also happen when you sort on a very large 
> dataset that isn't optimized, in this case the maxDoc value is too high.
> 
> Anyway, try some settings and monitor the nodes and please report your 
> findings.
> 
>> On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said:
>>> Heap usage can spike after a commit. Existing caches are still in use and
>>> new caches are being generated and/or auto warmed. Can you confirm this
>>> is the case?
>> 
>> We see spikes after replication which I suspect is, as you say, because
>> of the ensuing commit.
>> 
>> What we seem to have found is that when we weren't using the Concurrent
>> GC stop-the-world gc runs would kill the app. Now that we're using CMS
>> we occasionally find ourselves in situations where the app still has
>> memory "left over" but the load on the machine spikes, the GC duty cycle
>> goes to 100 and the app never recovers.> 
>> Restarting usually helps but sometimes we have to take the machine out
>> of the laod balancer, wait for a number of minutes and then out it back
>> in.
>> 
>> We're working on two hypotheses
>> 
>> Firstly - we're CPU bound somehow and that at some point we cross some
>> threshhold and GC or something else is just unable to to keep up. So
>> whilst it looks like instantaneous death of the app it's actually
>> gradual resource exhaustion where the definition of 'gradual' is 'a very
>> short period of time' (as opposed to some cataclysmic infinite loop bug
>> somewhere).
>> 
>> Either that or ... Secondly - there's some sort of Query Of Death that
>> kills machines. We just haven't found it yet, even when replaying logs.
>> 
>> Or some combination of both. Or other things. It's maddeningly
>> frustrating.
>> 
>> We're also got to try deploying a custom solr.war and try using the
>> MMapDirectory to see if that helps with anything.

Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

Posted by Markus Jelsma <ma...@openindex.io>.

Do you have GC logging enabled? Tail -f the log file and you'll see what CMS is 
telling you. Tuning the occupation fraction of the tenured generation to a 
lower value than default and telling the JVM to only use your value to 
initiate a collection can help a lot. The same goes for sizing the young 
generation and sometimes the survivor ratio.

Consult the HotSpot CMS settings and young generation (or new) sizes. They are 
very important.

If you have multiple slaves under the same load you can easily try different 
configurations. Keeping an eye on the nodes with a tool like JConsole and at 
the same time tailing the GC log will help a lot. Don't forget to send updates 
and frequent commits or you won't be able to replay. I've never seen a Solr 
instance go down under heavy load and without commits but they tend to behave 
badly when commits occur while under heavy load with long cache warming times 
(and heap consumption).

You might also be suffering from memory fragmentation, this is bad and can lead 
to failure. You can configure the JVM to fore a compaction before a GC, that's 
nice but it does consume CPU time.

A query of death can, in theory, also happen when you sort on a very large 
dataset that isn't optimized, in this case the maxDoc value is too high.

Anyway, try some settings and monitor the nodes and please report your 
findings.

> On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said:
> > Heap usage can spike after a commit. Existing caches are still in use and
> > new caches are being generated and/or auto warmed. Can you confirm this
> > is the case?
> 
> We see spikes after replication which I suspect is, as you say, because
> of the ensuing commit.
> 
> What we seem to have found is that when we weren't using the Concurrent
> GC stop-the-world gc runs would kill the app. Now that we're using CMS
> we occasionally find ourselves in situations where the app still has
> memory "left over" but the load on the machine spikes, the GC duty cycle
> goes to 100 and the app never recovers.> 
> Restarting usually helps but sometimes we have to take the machine out
> of the laod balancer, wait for a number of minutes and then out it back
> in.
> 
> We're working on two hypotheses
> 
> Firstly - we're CPU bound somehow and that at some point we cross some
> threshhold and GC or something else is just unable to to keep up. So
> whilst it looks like instantaneous death of the app it's actually
> gradual resource exhaustion where the definition of 'gradual' is 'a very
> short period of time' (as opposed to some cataclysmic infinite loop bug
> somewhere).
> 
> Either that or ... Secondly - there's some sort of Query Of Death that
> kills machines. We just haven't found it yet, even when replaying logs.
> 
> Or some combination of both. Or other things. It's maddeningly
> frustrating.
> 
> We're also got to try deploying a custom solr.war and try using the
> MMapDirectory to see if that helps with anything.

Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

Posted by Simon Wistow <si...@thegestalt.org>.

On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said:
> Heap usage can spike after a commit. Existing caches are still in use and new 
> caches are being generated and/or auto warmed. Can you confirm this is the 
> case?

We see spikes after replication which I suspect is, as you say, because 
of the ensuing commit.

What we seem to have found is that when we weren't using the Concurrent 
GC stop-the-world gc runs would kill the app. Now that we're using CMS 
we occasionally find ourselves in situations where the app still has 
memory "left over" but the load on the machine spikes, the GC duty cycle 
goes to 100 and the app never recovers.

Restarting usually helps but sometimes we have to take the machine out 
of the laod balancer, wait for a number of minutes and then out it back 
in.

We're working on two hypotheses 

Firstly - we're CPU bound somehow and that at some point we cross some 
threshhold and GC or something else is just unable to to keep up. So 
whilst it looks like instantaneous death of the app it's actually 
gradual resource exhaustion where the definition of 'gradual' is 'a very 
short period of time' (as opposed to some cataclysmic infinite loop bug 
somewhere).

Either that or ... Secondly - there's some sort of Query Of Death that 
kills machines. We just haven't found it yet, even when replaying logs. 

Or some combination of both. Or other things. It's maddeningly 
frustrating.

We're also got to try deploying a custom solr.war and try using the 
MMapDirectory to see if that helps with anything.

Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

Posted by Markus Jelsma <ma...@openindex.io>.

Heap usage can spike after a commit. Existing caches are still in use and new 
caches are being generated and/or auto warmed. Can you confirm this is the 
case?

On Friday 28 January 2011 00:34:42 Simon Wistow wrote:
> On Tue, Jan 25, 2011 at 01:28:16PM +0100, Markus Jelsma said:
> > Are you sure you need CMS incremental mode? It's only adviced when
> > running on a machine with one or two processors. If you have more you
> > should consider disabling the incremental flags.
> 
> I'll test agin but we added those to get better performance - not much
> but there did seem to be an improvement.
> 
> The problem seems to not be in average use but that occasionally there's
> huge spike in load (there doesn't seem to be a particular "killer
> query") and Solr just never recovers.
> 
> Thanks,
> 
> Simon

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

Posted by Simon Wistow <si...@thegestalt.org>.

On Tue, Jan 25, 2011 at 01:28:16PM +0100, Markus Jelsma said:
> Are you sure you need CMS incremental mode? It's only adviced when running on 
> a machine with one or two processors. If you have more you should consider 
> disabling the incremental flags.

I'll test agin but we added those to get better performance - not much 
but there did seem to be an improvement.

The problem seems to not be in average use but that occasionally there's 
huge spike in load (there doesn't seem to be a particular "killer 
query") and Solr just never recovers.

Thanks,

Simon

Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

Are you sure you need CMS incremental mode? It's only adviced when running on 
a machine with one or two processors. If you have more you should consider 
disabling the incremental flags.

Cheers,

On Monday 24 January 2011 19:32:38 Simon Wistow wrote:
> We have two slaves replicating off one master every 2 minutes.
> 
> Both using the CMS + ParNew Garbage collector. Specifically
> 
> -server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing
> 
> but periodically they both get into a GC storm and just keel over.
> 
> Looking through the GC logs the amount of memory reclaimed in each GC
> run gets less and less until we get a concurrent mode failure and then
> Solr effectively dies.
> 
> Is it possible there's a memory leak? I note that later versions of
> Lucene have fixed a few leaks. Our current versions are relatively old
> 
> 	Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17
> 18:06:42
> 
> 	Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55
> 
> so I'm wondering if upgrading to later version of Lucene might help (of
> course it might not but I'm trying to investigate all options at this
> point). If so what's the best way to go about this? Can I just grab the
> Lucene jars and drop them somewhere (or unpack and then repack the solr
> war file?). Or should I use a nightly solr 1.4?
> 
> Or am I barking up completely the wrong tree? I'm trawling through heap
> logs and gc logs at the moment trying to to see what other tuning I can
> do but any other hints, tips, tricks or cluebats gratefully received.
> Even if it's just "Yeah, we had that problem and we added more slaves
> and periodically restarted them"
> 
> thanks,
> 
> Simon

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

Posted by Simon Wistow <si...@thegestalt.org>.

On Mon, Jan 24, 2011 at 10:55:59AM -0800, Em said:
> Could it be possible that your slaves not finished their replicating until
> the new replication-process starts?
> If so, there you got the OOM :).

This was one of my thoughts as well - we're currently running a slave 
which has no queries in it just to see if that exhibits similar 
behaviour.

My reasoning against it is that we're not seeing any 

	PERFORMANCE WARNING: Overlapping onDeckSearchers=x

in the logs which is something I'd expect to see.

2 minutes doesn't seem like an unreasonable period of time either - the 
docs at http://wiki.apache.org/solr/SolrReplication suggest 20 seconds.

Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

Posted by Em <ma...@yahoo.de>.

Hi Simon,

I got no experiences with a distributed environment.
However, what you are talking about reminds me on another post on the
mailing list.

Could it be possible that your slaves not finished their replicating until
the new replication-process starts?
If so, there you got the OOM :).

Just a thought, perhaps it helps.

Regards,
Em
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Possible-Memory-Leaks-Upgrading-to-a-Later-Version-of-Solr-or-Lucene-tp2321777p2321959.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

Posted by Simon Wistow <si...@thegestalt.org>.

On Mon, Jan 24, 2011 at 08:00:53PM +0100, Markus Jelsma said:
> Are you using 3rd-party plugins?

No third party plugins - this is actually pretty much stock tomcat6 + 
solr from Ubuntu. The only difference is that we've adapted the 
directory layout to fit in with our house style

Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

Posted by Markus Jelsma <ma...@openindex.io>.

Are you using 3rd-party plugins?

> We have two slaves replicating off one master every 2 minutes.
> 
> Both using the CMS + ParNew Garbage collector. Specifically
> 
> -server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing
> 
> but periodically they both get into a GC storm and just keel over.
> 
> Looking through the GC logs the amount of memory reclaimed in each GC
> run gets less and less until we get a concurrent mode failure and then
> Solr effectively dies.
> 
> Is it possible there's a memory leak? I note that later versions of
> Lucene have fixed a few leaks. Our current versions are relatively old
> 
> 	Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17
> 18:06:42
> 
> 	Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55
> 
> so I'm wondering if upgrading to later version of Lucene might help (of
> course it might not but I'm trying to investigate all options at this
> point). If so what's the best way to go about this? Can I just grab the
> Lucene jars and drop them somewhere (or unpack and then repack the solr
> war file?). Or should I use a nightly solr 1.4?
> 
> Or am I barking up completely the wrong tree? I'm trawling through heap
> logs and gc logs at the moment trying to to see what other tuning I can
> do but any other hints, tips, tricks or cluebats gratefully received.
> Even if it's just "Yeah, we had that problem and we added more slaves
> and periodically restarted them"
> 
> thanks,
> 
> Simon