You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Dan Hendry <da...@gmail.com> on 2010/12/20 19:48:58 UTC

Severe Reliability Problems - 0.7 RC2

I have been having severe and strange reliability problems within my
Cassandra cluster. This weekend, all four of my nodes were down at once.
Even now I am loosing one every few hours. I have attached output from all
the system monitoring commands I can think of.

What seems to happen is that the java process locks up and sits and has 100%
system cpu usage (but no user-CPU) (there are 8 cores so 100%=1/8 total
capacity). JMX freezes and the node effectively dies, but there is typically
nothing unusual in the Cassandra logs. About the only thing which seems to
be correlated is the flushing of memtables tables. One of the strangest
stats I am getting when in this state is memory paging: 3727168.00 pages
scanned/second (see sar -B output). Occasionally, if I leave the process
alone (~1 h) it recovers (maybe 1 in 5 times), otherwise the only way to
terminate the Cassandra process is with a kill -9. When this happens,
Cassandra memory usage (as reported by JMX before it dies) is
also reasonable (ex 6 GB out of 12 GB heap and 24 GB system).

This feels more like a system level problem than a Cassandra problem so I
have tried diversifying my cluster, one node runs Ubuntu 10.10, the other
three 10.04. One runs OpenJDK (1.6.0_20), the rest run Sun JDK (1.6.0_22).
Neither change seems be correlated with the problem. These are pretty much
stock ubuntu installs so nothing special on that front.

Now this has been a relatively sudden development and I can potentially
attribute it to a few things:
1. Upgrading to RC2
2. Ever increasing amounts of data (there is less than 100 gb per node so
this should not be the problem).
3. Migrating from a set of machines where data+commit log directories were
on four small raid 5 hds to machines with two 500 gig drives: one for data
and one for commitlog + os. I have seen more IO wait on these new machines.
But they have the same memory and system settings.

I am about at my wits end on this one, any help would be appreciated.

RE: Severe Reliability Problems - 0.7 RC2

Posted by Dan Hendry <da...@gmail.com>.

Thanks for all the response, I have included requested information below. As
a side note, I THINK I have fixed the problem by using "disk_access_mode:
mmap_index_only". At the very least, none of the nodes has died since
setting the option.

> What kernel version are you running?

3/4 running 2.6.32-21-server, 1/4 running 2.6.35-23-server

> Also, you're virtualized (given %steal), right?

No, %steal is 0, these are all dedicated machines

> What filesystem are you using?

EXT4

> Are there any clues in /var/log/messages?

Nothing out of the ordinary

> How much swap space do you have configured?

2 GB and 24 GB of system memory.

Dan

From: Chris Goffinet [mailto:cg@chrisgoffinet.com] 
Sent: December-20-10 17:32
To: user@cassandra.apache.org
Subject: Re: Severe Reliability Problems - 0.7 RC2

What kernel version are you running? I have seen with I/O intense nodes with
2.6.18 to 2.6.24 the kernel has a bug where it locks the JVM and spins to
100%.

On Mon, Dec 20, 2010 at 1:14 PM, Brandon Williams <dr...@gmail.com> wrote:

On Mon, Dec 20, 2010 at 2:13 PM, Dan Hendry <da...@gmail.com>
wrote:

Yes, I have tried that (although only twice). Same impact as a regular kill:
nothing happens and I get no stacktrace output. It is however on my list of
things to try again the next time a node dies. I am also not able to attach
jstack to the process.

Kill -3 will only produce output in foreground mode, jstack will work in
either foreground or background.

-Brandon 

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.872 / Virus Database: 271.1.1/3327 - Release Date: 12/20/10
02:34:00

Re: Severe Reliability Problems - 0.7 RC2

Posted by Chris Goffinet <cg...@chrisgoffinet.com>.

What kernel version are you running? I have seen with I/O intense nodes with
2.6.18 to 2.6.24 the kernel has a bug where it locks the JVM and spins to
100%.

On Mon, Dec 20, 2010 at 1:14 PM, Brandon Williams <dr...@gmail.com> wrote:

> On Mon, Dec 20, 2010 at 2:13 PM, Dan Hendry <da...@gmail.com>wrote:
>
>> Yes, I have tried that (although only twice). Same impact as a regular
>> kill: nothing happens and I get no stacktrace output. It is however on my
>> list of things to try again the next time a node dies. I am also not able to
>> attach jstack to the process.
>>
>
> Kill -3 will only produce output in foreground mode, jstack will work in
> either foreground or background.
>
> -Brandon
>

Re: Severe Reliability Problems - 0.7 RC2

Posted by Brandon Williams <dr...@gmail.com>.

On Mon, Dec 20, 2010 at 2:13 PM, Dan Hendry <da...@gmail.com>wrote:

> Yes, I have tried that (although only twice). Same impact as a regular
> kill: nothing happens and I get no stacktrace output. It is however on my
> list of things to try again the next time a node dies. I am also not able to
> attach jstack to the process.
>

Kill -3 will only produce output in foreground mode, jstack will work in
either foreground or background.

-Brandon

Re: Severe Reliability Problems - 0.7 RC2

Posted by Adrian Cockcroft <ac...@netflix.com>.

What filesystem are you using? You might try EXT3 or 4 vs. XFS as another area of diversity. It sounds as if the page cache or filesystem is messed up. Are there any clues in /var/log/messages? How much swap space do you have configured?

The kernel level debug stuff I know is all for Solaris unfortunately…

Adrian

From: Dan Hendry <da...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Mon, 20 Dec 2010 12:13:56 -0800
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: RE: Severe Reliability Problems - 0.7 RC2

Yes, I have tried that (although only twice). Same impact as a regular kill: nothing happens and I get no stacktrace output. It is however on my list of things to try again the next time a node dies. I am also not able to attach jstack to the process.

I have also tried disabling JNA (did not help) and I have now changed disk_access_mode from auto to mmap_index_only on two of the nodes.

Dan

From: Kani [mailto:javier.canillas@gmail.com]
Sent: December-20-10 14:14
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Severe Reliability Problems - 0.7 RC2

Have you tried to send a KILL -3 to the Cassandra process before you send KILL -9? This way you will see what the threads are doing (and maybe blocking). The majority of the threads may give you the right spot where to look for the problem.

I'm not much of a good linux administrator, but when something goes weird on one of my own application (java running over linux box) i tried that command to see what the application was doing or trying to.

Kani
On Mon, Dec 20, 2010 at 3:48 PM, Dan Hendry <da...@gmail.com>> wrote:
I have been having severe and strange reliability problems within my Cassandra cluster. This weekend, all four of my nodes were down at once. Even now I am loosing one every few hours. I have attached output from all the system monitoring commands I can think of.

What seems to happen is that the java process locks up and sits and has 100% system cpu usage (but no user-CPU) (there are 8 cores so 100%=1/8 total capacity). JMX freezes and the node effectively dies, but there is typically nothing unusual in the Cassandra logs. About the only thing which seems to be correlated is the flushing of memtables tables. One of the strangest stats I am getting when in this state is memory paging: 3727168.00 pages scanned/second (see sar -B output). Occasionally, if I leave the process alone (~1 h) it recovers (maybe 1 in 5 times), otherwise the only way to terminate the Cassandra process is with a kill -9. When this happens, Cassandra memory usage (as reported by JMX before it dies) is also reasonable (ex 6 GB out of 12 GB heap and 24 GB system).

This feels more like a system level problem than a Cassandra problem so I have tried diversifying my cluster, one node runs Ubuntu 10.10, the other three 10.04. One runs OpenJDK (1.6.0_20), the rest run Sun JDK (1.6.0_22). Neither change seems be correlated with the problem. These are pretty much stock ubuntu installs so nothing special on that front.

Now this has been a relatively sudden development and I can potentially attribute it to a few things:
1. Upgrading to RC2
2. Ever increasing amounts of data (there is less than 100 gb per node so this should not be the problem).
3. Migrating from a set of machines where data+commit log directories were on four small raid 5 hds to machines with two 500 gig drives: one for data and one for commitlog + os. I have seen more IO wait on these new machines. But they have the same memory and system settings.

I am about at my wits end on this one, any help would be appreciated.

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.872 / Virus Database: 271.1.1/3327 - Release Date: 12/20/10 02:34:00

RE: Severe Reliability Problems - 0.7 RC2

Posted by Dan Hendry <da...@gmail.com>.

Yes, I have tried that (although only twice). Same impact as a regular kill: nothing happens and I get no stacktrace output. It is however on my list of things to try again the next time a node dies. I am also not able to attach jstack to the process.

 

I have also tried disabling JNA (did not help) and I have now changed disk_access_mode from auto to mmap_index_only on two of the nodes. 

 

Dan

 

From: Kani [mailto:javier.canillas@gmail.com] 
Sent: December-20-10 14:14
To: user@cassandra.apache.org
Subject: Re: Severe Reliability Problems - 0.7 RC2

 

Have you tried to send a KILL -3 to the Cassandra process before you send KILL -9? This way you will see what the threads are doing (and maybe blocking). The majority of the threads may give you the right spot where to look for the problem.

 

I'm not much of a good linux administrator, but when something goes weird on one of my own application (java running over linux box) i tried that command to see what the application was doing or trying to.

 

Kani

On Mon, Dec 20, 2010 at 3:48 PM, Dan Hendry <da...@gmail.com> wrote:

I have been having severe and strange reliability problems within my Cassandra cluster. This weekend, all four of my nodes were down at once. Even now I am loosing one every few hours. I have attached output from all the system monitoring commands I can think of.

 

What seems to happen is that the java process locks up and sits and has 100% system cpu usage (but no user-CPU) (there are 8 cores so 100%=1/8 total capacity). JMX freezes and the node effectively dies, but there is typically nothing unusual in the Cassandra logs. About the only thing which seems to be correlated is the flushing of memtables tables. One of the strangest stats I am getting when in this state is memory paging: 3727168.00 pages scanned/second (see sar -B output). Occasionally, if I leave the process alone (~1 h) it recovers (maybe 1 in 5 times), otherwise the only way to terminate the Cassandra process is with a kill -9. When this happens, Cassandra memory usage (as reported by JMX before it dies) is also reasonable (ex 6 GB out of 12 GB heap and 24 GB system). 

 

This feels more like a system level problem than a Cassandra problem so I have tried diversifying my cluster, one node runs Ubuntu 10.10, the other three 10.04. One runs OpenJDK (1.6.0_20), the rest run Sun JDK (1.6.0_22). Neither change seems be correlated with the problem. These are pretty much stock ubuntu installs so nothing special on that front.

 

Now this has been a relatively sudden development and I can potentially attribute it to a few things:

1. Upgrading to RC2 

2. Ever increasing amounts of data (there is less than 100 gb per node so this should not be the problem).

3. Migrating from a set of machines where data+commit log directories were on four small raid 5 hds to machines with two 500 gig drives: one for data and one for commitlog + os. I have seen more IO wait on these new machines. But they have the same memory and system settings.

 

I am about at my wits end on this one, any help would be appreciated. 

 

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.872 / Virus Database: 271.1.1/3327 - Release Date: 12/20/10 02:34:00

Re: Severe Reliability Problems - 0.7 RC2

Posted by Kani <ja...@gmail.com>.

Have you tried to send a KILL -3 to the Cassandra process before you send
KILL -9? This way you will see what the threads are doing (and maybe
blocking). The majority of the threads may give you the right spot where to
look for the problem.

I'm not much of a good linux administrator, but when something goes weird on
one of my own application (java running over linux box) i tried that command
to see what the application was doing or trying to.

Kani

On Mon, Dec 20, 2010 at 3:48 PM, Dan Hendry <da...@gmail.com>wrote:

> I have been having severe and strange reliability problems within my
> Cassandra cluster. This weekend, all four of my nodes were down at once.
> Even now I am loosing one every few hours. I have attached output from all
> the system monitoring commands I can think of.
>
> What seems to happen is that the java process locks up and sits and has
> 100% system cpu usage (but no user-CPU) (there are 8 cores so 100%=1/8 total
> capacity). JMX freezes and the node effectively dies, but there is typically
> nothing unusual in the Cassandra logs. About the only thing which seems to
> be correlated is the flushing of memtables tables. One of the strangest
> stats I am getting when in this state is memory paging: 3727168.00 pages
> scanned/second (see sar -B output). Occasionally, if I leave the process
> alone (~1 h) it recovers (maybe 1 in 5 times), otherwise the only way to
> terminate the Cassandra process is with a kill -9. When this happens,
> Cassandra memory usage (as reported by JMX before it dies) is
> also reasonable (ex 6 GB out of 12 GB heap and 24 GB system).
>
> This feels more like a system level problem than a Cassandra problem so I
> have tried diversifying my cluster, one node runs Ubuntu 10.10, the other
> three 10.04. One runs OpenJDK (1.6.0_20), the rest run Sun JDK (1.6.0_22).
> Neither change seems be correlated with the problem. These are pretty much
> stock ubuntu installs so nothing special on that front.
>
> Now this has been a relatively sudden development and I can potentially
> attribute it to a few things:
> 1. Upgrading to RC2
> 2. Ever increasing amounts of data (there is less than 100 gb per node so
> this should not be the problem).
> 3. Migrating from a set of machines where data+commit log directories were
> on four small raid 5 hds to machines with two 500 gig drives: one for data
> and one for commitlog + os. I have seen more IO wait on these new machines.
> But they have the same memory and system settings.
>
> I am about at my wits end on this one, any help would be appreciated.
>

Re: Severe Reliability Problems - 0.7 RC2

Posted by Peter Schuller <pe...@infidyne.com>.

> There were a couple of threads on lkml recently that may be relevant,
> but I have to run so I can't find the URL:s atm (todo later tonight).

Ok, I cannot figure out how to find the "first" message in a thread in
any of the lkml archives, but these two threads may be of interest,
especially if you can find their beginnings:

   http://lkml.indiana.edu/hypermail/linux/kernel/1011.3/00030.html

And to a lesser extent (I started that before knowing about the above one):

   http://lkml.indiana.edu/hypermail/linux/kernel/1011.3/00252.html

They don't really talk about the same symptoms, but there are some
good tips on monitoring what's going on there and some of the things
(numactl interleaving, avoiding higher order allocations) might
conceivably be useful in this case too. At least on the theory that
some kind of eviction or looking-for-free-space loop is what's
spinning (and yes, this is an assumption based on very little
evidence...).

Also, you're virtualized (given %steal), right? I wonder to what
extent that impacts the vm subsystem in the guest kernel (I don't
really know to what extent there is guest<->host co-op nowadays on ec2
etc).

-- 
/ Peter Schuller

Re: Severe Reliability Problems - 0.7 RC2

Posted by Peter Schuller <pe...@infidyne.com>.

> be correlated is the flushing of memtables tables. One of the strangest
> stats I am getting when in this state is memory paging: 3727168.00 pages
> scanned/second (see sar -B output). Occasionally, if I leave the process
> alone (~1 h) it recovers (maybe 1 in 5 times), otherwise the only way to

Sounds to me like the Cassandra process is triggering something along
the lines of fast-path page cache eviction or something similar. The
fact that you see Cassandra in 100% system (as opposed to user) CPU
and you have a huge number of pages scanned, certainly sounds like
you're hitting an edge case or bug in the virtual memory system in the
kernel. The JVM can't really do much about it if it's in a syscall
that never returns...

There were a couple of threads on lkml recently that may be relevant,
but I have to run so I can't find the URL:s atm (todo later tonight).

Is anyone aware of a way to get a kernel stack trace for a given
process on a running system?

Cargo cult solution: Upgrade the kernel :)

-- 
/ Peter Schuller