You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Simon Wistow <si...@thegestalt.org> on 2011/01/24 19:32:38 UTC
Possible Memory Leaks / Upgrading to a Later Version of Solr or
Lucene
We have two slaves replicating off one master every 2 minutes.
Both using the CMS + ParNew Garbage collector. Specifically
-server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
-XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing
but periodically they both get into a GC storm and just keel over.
Looking through the GC logs the amount of memory reclaimed in each GC
run gets less and less until we get a concurrent mode failure and then
Solr effectively dies.
Is it possible there's a memory leak? I note that later versions of
Lucene have fixed a few leaks. Our current versions are relatively old
Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17
18:06:42
Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55
so I'm wondering if upgrading to later version of Lucene might help (of
course it might not but I'm trying to investigate all options at this
point). If so what's the best way to go about this? Can I just grab the
Lucene jars and drop them somewhere (or unpack and then repack the solr
war file?). Or should I use a nightly solr 1.4?
Or am I barking up completely the wrong tree? I'm trawling through heap
logs and gc logs at the moment trying to to see what other tuning I can
do but any other hints, tips, tricks or cluebats gratefully received.
Even if it's just "Yeah, we had that problem and we added more slaves
and periodically restarted them"
thanks,
Simon
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr
or Lucene
Posted by Simon Wistow <si...@thegestalt.org>.
On Wed, Apr 06, 2011 at 12:05:57AM +0200, Jan Høydahl said:
> Just curious, was there any resolution to this?
Not really.
We tuned the GC pretty aggressively - we use these options
-server
-Xmx20G -Xms20G -Xss10M
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:+CMSIncrementalMode
-XX:+CMSIncrementalPacing
-XX:SoftRefLRUPolicyMSPerMB=10
and we've played a little with CompressOops and AggressiveOpts.
We also backported the MMapDirectory factory to 1.4.1 and that helped a
lot.
We do still gets spikes of long (5s-20s queries) a few times an hour
which don't appear to be caused by any kind of "Query of Death".
Occasionally (once every few days) one of the slaves will experience a
period of sustained slowness but recovers by itself in less than a
minute.
According to our GC logs we haven't had a full GC for a long time.
Currently the state of play is that we commit on our master every 5000ms
and replicate from the slaves every 2 minutes. Our reponse times for
searches on the slaves are about 180-270ms but if we turn off
replication then we get 60-90ms. So something is clearly "up" with that.
Having talked to the good people at Lucid we're going to try playing
around with commit intervals, upping our mergeFactor from 10 to 25 and
maybe using the BalancedSegmentMergePolicy.
The system seems to be stable at the moment which is good but obviously
we'd like to lower our query times if possible.
Hopefully this might be of some use to somebody out there, sometime.
Simon
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
Posted by Jan Høydahl <ja...@cominvent.com>.
Hi,
Just curious, was there any resolution to this?
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
On 8. feb. 2011, at 03.40, Markus Jelsma wrote:
> Do you have GC logging enabled? Tail -f the log file and you'll see what CMS is
> telling you. Tuning the occupation fraction of the tenured generation to a
> lower value than default and telling the JVM to only use your value to
> initiate a collection can help a lot. The same goes for sizing the young
> generation and sometimes the survivor ratio.
>
> Consult the HotSpot CMS settings and young generation (or new) sizes. They are
> very important.
>
> If you have multiple slaves under the same load you can easily try different
> configurations. Keeping an eye on the nodes with a tool like JConsole and at
> the same time tailing the GC log will help a lot. Don't forget to send updates
> and frequent commits or you won't be able to replay. I've never seen a Solr
> instance go down under heavy load and without commits but they tend to behave
> badly when commits occur while under heavy load with long cache warming times
> (and heap consumption).
>
> You might also be suffering from memory fragmentation, this is bad and can lead
> to failure. You can configure the JVM to fore a compaction before a GC, that's
> nice but it does consume CPU time.
>
> A query of death can, in theory, also happen when you sort on a very large
> dataset that isn't optimized, in this case the maxDoc value is too high.
>
> Anyway, try some settings and monitor the nodes and please report your
> findings.
>
>> On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said:
>>> Heap usage can spike after a commit. Existing caches are still in use and
>>> new caches are being generated and/or auto warmed. Can you confirm this
>>> is the case?
>>
>> We see spikes after replication which I suspect is, as you say, because
>> of the ensuing commit.
>>
>> What we seem to have found is that when we weren't using the Concurrent
>> GC stop-the-world gc runs would kill the app. Now that we're using CMS
>> we occasionally find ourselves in situations where the app still has
>> memory "left over" but the load on the machine spikes, the GC duty cycle
>> goes to 100 and the app never recovers.>
>> Restarting usually helps but sometimes we have to take the machine out
>> of the laod balancer, wait for a number of minutes and then out it back
>> in.
>>
>> We're working on two hypotheses
>>
>> Firstly - we're CPU bound somehow and that at some point we cross some
>> threshhold and GC or something else is just unable to to keep up. So
>> whilst it looks like instantaneous death of the app it's actually
>> gradual resource exhaustion where the definition of 'gradual' is 'a very
>> short period of time' (as opposed to some cataclysmic infinite loop bug
>> somewhere).
>>
>> Either that or ... Secondly - there's some sort of Query Of Death that
>> kills machines. We just haven't found it yet, even when replaying logs.
>>
>> Or some combination of both. Or other things. It's maddeningly
>> frustrating.
>>
>> We're also got to try deploying a custom solr.war and try using the
>> MMapDirectory to see if that helps with anything.
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
Posted by Markus Jelsma <ma...@openindex.io>.
Do you have GC logging enabled? Tail -f the log file and you'll see what CMS is
telling you. Tuning the occupation fraction of the tenured generation to a
lower value than default and telling the JVM to only use your value to
initiate a collection can help a lot. The same goes for sizing the young
generation and sometimes the survivor ratio.
Consult the HotSpot CMS settings and young generation (or new) sizes. They are
very important.
If you have multiple slaves under the same load you can easily try different
configurations. Keeping an eye on the nodes with a tool like JConsole and at
the same time tailing the GC log will help a lot. Don't forget to send updates
and frequent commits or you won't be able to replay. I've never seen a Solr
instance go down under heavy load and without commits but they tend to behave
badly when commits occur while under heavy load with long cache warming times
(and heap consumption).
You might also be suffering from memory fragmentation, this is bad and can lead
to failure. You can configure the JVM to fore a compaction before a GC, that's
nice but it does consume CPU time.
A query of death can, in theory, also happen when you sort on a very large
dataset that isn't optimized, in this case the maxDoc value is too high.
Anyway, try some settings and monitor the nodes and please report your
findings.
> On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said:
> > Heap usage can spike after a commit. Existing caches are still in use and
> > new caches are being generated and/or auto warmed. Can you confirm this
> > is the case?
>
> We see spikes after replication which I suspect is, as you say, because
> of the ensuing commit.
>
> What we seem to have found is that when we weren't using the Concurrent
> GC stop-the-world gc runs would kill the app. Now that we're using CMS
> we occasionally find ourselves in situations where the app still has
> memory "left over" but the load on the machine spikes, the GC duty cycle
> goes to 100 and the app never recovers.>
> Restarting usually helps but sometimes we have to take the machine out
> of the laod balancer, wait for a number of minutes and then out it back
> in.
>
> We're working on two hypotheses
>
> Firstly - we're CPU bound somehow and that at some point we cross some
> threshhold and GC or something else is just unable to to keep up. So
> whilst it looks like instantaneous death of the app it's actually
> gradual resource exhaustion where the definition of 'gradual' is 'a very
> short period of time' (as opposed to some cataclysmic infinite loop bug
> somewhere).
>
> Either that or ... Secondly - there's some sort of Query Of Death that
> kills machines. We just haven't found it yet, even when replaying logs.
>
> Or some combination of both. Or other things. It's maddeningly
> frustrating.
>
> We're also got to try deploying a custom solr.war and try using the
> MMapDirectory to see if that helps with anything.
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr
or Lucene
Posted by Simon Wistow <si...@thegestalt.org>.
On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said:
> Heap usage can spike after a commit. Existing caches are still in use and new
> caches are being generated and/or auto warmed. Can you confirm this is the
> case?
We see spikes after replication which I suspect is, as you say, because
of the ensuing commit.
What we seem to have found is that when we weren't using the Concurrent
GC stop-the-world gc runs would kill the app. Now that we're using CMS
we occasionally find ourselves in situations where the app still has
memory "left over" but the load on the machine spikes, the GC duty cycle
goes to 100 and the app never recovers.
Restarting usually helps but sometimes we have to take the machine out
of the laod balancer, wait for a number of minutes and then out it back
in.
We're working on two hypotheses
Firstly - we're CPU bound somehow and that at some point we cross some
threshhold and GC or something else is just unable to to keep up. So
whilst it looks like instantaneous death of the app it's actually
gradual resource exhaustion where the definition of 'gradual' is 'a very
short period of time' (as opposed to some cataclysmic infinite loop bug
somewhere).
Either that or ... Secondly - there's some sort of Query Of Death that
kills machines. We just haven't found it yet, even when replaying logs.
Or some combination of both. Or other things. It's maddeningly
frustrating.
We're also got to try deploying a custom solr.war and try using the
MMapDirectory to see if that helps with anything.
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
Posted by Markus Jelsma <ma...@openindex.io>.
Heap usage can spike after a commit. Existing caches are still in use and new
caches are being generated and/or auto warmed. Can you confirm this is the
case?
On Friday 28 January 2011 00:34:42 Simon Wistow wrote:
> On Tue, Jan 25, 2011 at 01:28:16PM +0100, Markus Jelsma said:
> > Are you sure you need CMS incremental mode? It's only adviced when
> > running on a machine with one or two processors. If you have more you
> > should consider disabling the incremental flags.
>
> I'll test agin but we added those to get better performance - not much
> but there did seem to be an improvement.
>
> The problem seems to not be in average use but that occasionally there's
> huge spike in load (there doesn't seem to be a particular "killer
> query") and Solr just never recovers.
>
> Thanks,
>
> Simon
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr
or Lucene
Posted by Simon Wistow <si...@thegestalt.org>.
On Tue, Jan 25, 2011 at 01:28:16PM +0100, Markus Jelsma said:
> Are you sure you need CMS incremental mode? It's only adviced when running on
> a machine with one or two processors. If you have more you should consider
> disabling the incremental flags.
I'll test agin but we added those to get better performance - not much
but there did seem to be an improvement.
The problem seems to not be in average use but that occasionally there's
huge spike in load (there doesn't seem to be a particular "killer
query") and Solr just never recovers.
Thanks,
Simon
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
Are you sure you need CMS incremental mode? It's only adviced when running on
a machine with one or two processors. If you have more you should consider
disabling the incremental flags.
Cheers,
On Monday 24 January 2011 19:32:38 Simon Wistow wrote:
> We have two slaves replicating off one master every 2 minutes.
>
> Both using the CMS + ParNew Garbage collector. Specifically
>
> -server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing
>
> but periodically they both get into a GC storm and just keel over.
>
> Looking through the GC logs the amount of memory reclaimed in each GC
> run gets less and less until we get a concurrent mode failure and then
> Solr effectively dies.
>
> Is it possible there's a memory leak? I note that later versions of
> Lucene have fixed a few leaks. Our current versions are relatively old
>
> Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17
> 18:06:42
>
> Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55
>
> so I'm wondering if upgrading to later version of Lucene might help (of
> course it might not but I'm trying to investigate all options at this
> point). If so what's the best way to go about this? Can I just grab the
> Lucene jars and drop them somewhere (or unpack and then repack the solr
> war file?). Or should I use a nightly solr 1.4?
>
> Or am I barking up completely the wrong tree? I'm trawling through heap
> logs and gc logs at the moment trying to to see what other tuning I can
> do but any other hints, tips, tricks or cluebats gratefully received.
> Even if it's just "Yeah, we had that problem and we added more slaves
> and periodically restarted them"
>
> thanks,
>
> Simon
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr
or Lucene
Posted by Simon Wistow <si...@thegestalt.org>.
On Mon, Jan 24, 2011 at 10:55:59AM -0800, Em said:
> Could it be possible that your slaves not finished their replicating until
> the new replication-process starts?
> If so, there you got the OOM :).
This was one of my thoughts as well - we're currently running a slave
which has no queries in it just to see if that exhibits similar
behaviour.
My reasoning against it is that we're not seeing any
PERFORMANCE WARNING: Overlapping onDeckSearchers=x
in the logs which is something I'd expect to see.
2 minutes doesn't seem like an unreasonable period of time either - the
docs at http://wiki.apache.org/solr/SolrReplication suggest 20 seconds.
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or
Lucene
Posted by Em <ma...@yahoo.de>.
Hi Simon,
I got no experiences with a distributed environment.
However, what you are talking about reminds me on another post on the
mailing list.
Could it be possible that your slaves not finished their replicating until
the new replication-process starts?
If so, there you got the OOM :).
Just a thought, perhaps it helps.
Regards,
Em
--
View this message in context: http://lucene.472066.n3.nabble.com/Possible-Memory-Leaks-Upgrading-to-a-Later-Version-of-Solr-or-Lucene-tp2321777p2321959.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr
or Lucene
Posted by Simon Wistow <si...@thegestalt.org>.
On Mon, Jan 24, 2011 at 08:00:53PM +0100, Markus Jelsma said:
> Are you using 3rd-party plugins?
No third party plugins - this is actually pretty much stock tomcat6 +
solr from Ubuntu. The only difference is that we've adapted the
directory layout to fit in with our house style
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
Posted by Markus Jelsma <ma...@openindex.io>.
Are you using 3rd-party plugins?
> We have two slaves replicating off one master every 2 minutes.
>
> Both using the CMS + ParNew Garbage collector. Specifically
>
> -server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing
>
> but periodically they both get into a GC storm and just keel over.
>
> Looking through the GC logs the amount of memory reclaimed in each GC
> run gets less and less until we get a concurrent mode failure and then
> Solr effectively dies.
>
> Is it possible there's a memory leak? I note that later versions of
> Lucene have fixed a few leaks. Our current versions are relatively old
>
> Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17
> 18:06:42
>
> Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55
>
> so I'm wondering if upgrading to later version of Lucene might help (of
> course it might not but I'm trying to investigate all options at this
> point). If so what's the best way to go about this? Can I just grab the
> Lucene jars and drop them somewhere (or unpack and then repack the solr
> war file?). Or should I use a nightly solr 1.4?
>
> Or am I barking up completely the wrong tree? I'm trawling through heap
> logs and gc logs at the moment trying to to see what other tuning I can
> do but any other hints, tips, tricks or cluebats gratefully received.
> Even if it's just "Yeah, we had that problem and we added more slaves
> and periodically restarted them"
>
> thanks,
>
> Simon