You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Simon Wistow <si...@thegestalt.org> on 2011/01/10 19:36:23 UTC

Box occasionally pegs one cpu at 100%

I have a fairly classic master/slave set up.

Response times on the slave are generally good with blips periodically, 
apparently when replication is happening.

Occasionally however the process will have one incredibly slow query and 
will peg the CPU at 100%.

The weird thing is that it will remain that way even if we stop querying 
it and stop replication and then wait for over 20 minutes. The only way 
to fix the problem at that point is to restart tomcat.

Looking at slow queries around the time of the incident they don't look 
particularly bad - they're predominantly filter queries running under 
dismax and there doesn't seem to be anything unusual about them.

The index file is about 266G and has 30G of disk free. The machine has 
50G of RAM and is running with -Xmx35G.

Looking at the processes running it appears to be the main Java thread 
that's CPU bound, not the child threads. 

Stracing the process gives a lot of brk instructions (presumably some 
sort of wait loop) with occasional blips of: 


mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0
futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 
325, {1294683789, 614186000}, ffffffff) = 0
futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0
mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0
futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1
mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0
futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0
futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
mmap(0x7fc2e0230000, 121962496, PROT_NONE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
0x7fc2e0230000
mmap(0x7fbca58e0000, 237568, PROT_NONE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
0x7fbca58e0000

Any ideas about what's happening and if there's anyway to mitigate it? 
If the box at least recovered then I could run another slave and load 
balance between them working on the principle that the second box 
would pick up the slack whilst the first box restabilised but, as it is, 
that's not reliable.

Thanks,

Simon


Re: Box occasionally pegs one cpu at 100%

Posted by Dennis Gearon <ge...@sbcglobal.net>.
One other possiblity is that the OS or BIOS is doing that, at least on a laptop. 
There is a new feature where, if the load is low enough, non multi threaded 
applications can be assigned to one processor and that processor has it's clock 
boosted so the older software will run faster on the new processors - Otherwise 
they run SLOWER!.

My brother has a cad program that runs slower on his new quad core because the 
base clock speed is slower than a single processor CPU. The software company is 
not taking the time to rewrite their code, excpet where they add features or 
fixes. 




----- Original Message ----

From: Brian Burke <BB...@techtarget.com>
To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
Sent: Mon, January 10, 2011 10:56:27 AM
Subject: Re: Box occasionally pegs one cpu at 100%

This sounds like it could be garbage collection related, especially with a heap 
that large.  Depending on your jvm tuning, a FGC could take quite a while, 
effectively 'pausing' the JVM.

Have you looked at something like jstat -gcutil   or similar to monitor the 
garbage collection?


On Jan 10, 2011, at 1:36 PM, Simon Wistow wrote:

> I have a fairly classic master/slave set up.
> 
> Response times on the slave are generally good with blips periodically, 
> apparently when replication is happening.
> 
> Occasionally however the process will have one incredibly slow query and 
> will peg the CPU at 100%.
> 
> The weird thing is that it will remain that way even if we stop querying 
> it and stop replication and then wait for over 20 minutes. The only way 
> to fix the problem at that point is to restart tomcat.
> 
> Looking at slow queries around the time of the incident they don't look 
> particularly bad - they're predominantly filter queries running under 
> dismax and there doesn't seem to be anything unusual about them.
> 
> The index file is about 266G and has 30G of disk free. The machine has 
> 50G of RAM and is running with -Xmx35G.
> 
> Looking at the processes running it appears to be the main Java thread 
> that's CPU bound, not the child threads. 
> 
> Stracing the process gives a lot of brk instructions (presumably some 
> sort of wait loop) with occasional blips of: 
> 
> 
> mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0
> futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 
> 325, {1294683789, 614186000}, ffffffff) = 0
> futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
> mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0
> mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0
> futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1
> mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0
> futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0
> futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
> mmap(0x7fc2e0230000, 121962496, PROT_NONE, 
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
> 0x7fc2e0230000
> mmap(0x7fbca58e0000, 237568, PROT_NONE, 
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
> 0x7fbca58e0000
> 
> Any ideas about what's happening and if there's anyway to mitigate it? 
> If the box at least recovered then I could run another slave and load 
> balance between them working on the principle that the second box 
> would pick up the slack whilst the first box restabilised but, as it is, 
> that's not reliable.
> 
> Thanks,
> 
> Simon
> 

Re: Box occasionally pegs one cpu at 100%

Posted by Simon Wistow <si...@thegestalt.org>.
On Mon, Jan 10, 2011 at 05:58:42PM -0500, François Schiettecatte said:
> http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html	(you need to read this one)
> 
> http://java.sun.com/performance/reference/whitepapers/tuning.html	(and this one).

Yeah, I have these two pages bookmarked :)

> jstat is also very good for seeing what is going on in the JVM. I also 
> recall there was a way to trace GC in the JVM but cant recall how off 
> the top of my head, maybe it was a JVM option.

You can use -XX:+PrintGC and -XX:+PrintGCDetail (and 
-XX:+PrintGCTimeStamps) as well as -Xloggc:gc.log to log to a file.

I'm also finding NewRelic's RPM system great for monitoring Solr - the 
integration is really good, I give it two thumbs up.



Re: Box occasionally pegs one cpu at 100%

Posted by François Schiettecatte <fs...@gmail.com>.
This reminded me of a situation I ran into in the past where the JVM was being rendered useless because it was calling FGC repeatedly. Effectively what was going on is that a very large array was allocated which swamped the JVM memory and caused it to trash, much like an OS.

Here are some links which will help (at least they helped me):

http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html	(you need to read this one)

http://java.sun.com/performance/reference/whitepapers/tuning.html	(and this one).

http://www.oracle.com/technetwork/java/javase/tech/index-jsp-136373.html

http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

http://java.sun.com/performance/jvmstat/

http://blogs.sun.com/watt/resource/jvm-options-list.html

jstat is also very good for seeing what is going on in the JVM. I also recall there was a way to trace GC in the JVM but cant recall how off the top of my head, maybe it was a JVM option.

Hope this helps.

Cheers

François


On Jan 10, 2011, at 5:13 PM, Simon Wistow wrote:

> On Mon, Jan 10, 2011 at 01:56:27PM -0500, Brian Burke said:
>> This sounds like it could be garbage collection related, especially 
>> with a heap that large.  Depending on your jvm tuning, a FGC could 
>> take quite a while, effectively 'pausing' the JVM.
>> 
>> Have you looked at something like jstat -gcutil or similar to monitor 
>> the garbage collection?
> 
> I think you may have hit the nail on the head. 
> 
> Having checked the configuration again I noticed that the -server flag 
> didn't appear to be present in the options passed to Java (I'm convinced 
> it used to be there). As I understand it, this would mean that the 
> Parallel GC wouldn't be implicitly enabled.
> 
> If that's true then that's a definite strong candidate for causing the 
> root process and only the root process to peg a single CPU.
> 
> Anybody have any experience of the differences between
> 
> -XX:+UseParallelGC 
> 
> and
> 
> -XX:+UseConcMarkSweepGC with -XX:+UseParNewGC
> 
> ?
> 
> I believe -XX:+UseParallelGC  is the default with -server so I suppose 
> that's a good place to start but I'd appreciate any anecdotes or 
> experiences.
> 
> 
> 


Re: Box occasionally pegs one cpu at 100%

Posted by Simon Wistow <si...@thegestalt.org>.
On Mon, Jan 10, 2011 at 01:56:27PM -0500, Brian Burke said:
> This sounds like it could be garbage collection related, especially 
> with a heap that large.  Depending on your jvm tuning, a FGC could 
> take quite a while, effectively 'pausing' the JVM.
> 
> Have you looked at something like jstat -gcutil or similar to monitor 
> the garbage collection?

I think you may have hit the nail on the head. 

Having checked the configuration again I noticed that the -server flag 
didn't appear to be present in the options passed to Java (I'm convinced 
it used to be there). As I understand it, this would mean that the 
Parallel GC wouldn't be implicitly enabled.

If that's true then that's a definite strong candidate for causing the 
root process and only the root process to peg a single CPU.

Anybody have any experience of the differences between

-XX:+UseParallelGC 

and

-XX:+UseConcMarkSweepGC with -XX:+UseParNewGC

?

I believe -XX:+UseParallelGC  is the default with -server so I suppose 
that's a good place to start but I'd appreciate any anecdotes or 
experiences.




Re: Box occasionally pegs one cpu at 100%

Posted by Brian Burke <BB...@techtarget.com>.
This sounds like it could be garbage collection related, especially with a heap that large.  Depending on your jvm tuning, a FGC could take quite a while, effectively 'pausing' the JVM.

Have you looked at something like jstat -gcutil   or similar to monitor the garbage collection?


On Jan 10, 2011, at 1:36 PM, Simon Wistow wrote:

> I have a fairly classic master/slave set up.
> 
> Response times on the slave are generally good with blips periodically, 
> apparently when replication is happening.
> 
> Occasionally however the process will have one incredibly slow query and 
> will peg the CPU at 100%.
> 
> The weird thing is that it will remain that way even if we stop querying 
> it and stop replication and then wait for over 20 minutes. The only way 
> to fix the problem at that point is to restart tomcat.
> 
> Looking at slow queries around the time of the incident they don't look 
> particularly bad - they're predominantly filter queries running under 
> dismax and there doesn't seem to be anything unusual about them.
> 
> The index file is about 266G and has 30G of disk free. The machine has 
> 50G of RAM and is running with -Xmx35G.
> 
> Looking at the processes running it appears to be the main Java thread 
> that's CPU bound, not the child threads. 
> 
> Stracing the process gives a lot of brk instructions (presumably some 
> sort of wait loop) with occasional blips of: 
> 
> 
> mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0
> futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 
> 325, {1294683789, 614186000}, ffffffff) = 0
> futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
> mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0
> mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0
> futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1
> mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0
> futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0
> futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
> mmap(0x7fc2e0230000, 121962496, PROT_NONE, 
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
> 0x7fc2e0230000
> mmap(0x7fbca58e0000, 237568, PROT_NONE, 
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
> 0x7fbca58e0000
> 
> Any ideas about what's happening and if there's anyway to mitigate it? 
> If the box at least recovered then I could run another slave and load 
> balance between them working on the principle that the second box 
> would pick up the slack whilst the first box restabilised but, as it is, 
> that's not reliable.
> 
> Thanks,
> 
> Simon
>