You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Christopher Kung <ch...@gmail.com> on 2010/12/22 10:08:53 UTC

Cassandra Node Routinely Goes Down - 0.7 RC2

Hey All,

I have been having problems running 0.7RC2 where one of my two nodes
routinely goes down. Somtimes both of them go down. I am running the nodes
using Ubuntu Lucid LTS 64-bit with kernal version 2.6.32. Currently, both
nodes are running on micro instances on EC2. I will eventual migrate to
large instance...but I can't seem to get Cassandra to stay up for more than
1 day at a time

 I saw another post recently where someone else was having a similiar
problem, and the solution was to change to mmap_index for disk access mode
rather than auto. Anyways, the machines are 64-bit, despite being under
powered, so I don't see why that's necessary. I checked my logs and there
are no error messages. Are the nodes just running into resource issues?

Thanks.

Chris

Re: Cassandra Node Routinely Goes Down - 0.7 RC2

Posted by Peter Schuller <pe...@infidyne.com>.
> I have been having problems running 0.7RC2 where one of my two nodes
> routinely goes down. Somtimes both of them go down. I am running the nodes
> using Ubuntu Lucid LTS 64-bit with kernal version 2.6.32. Currently, both
> nodes are running on micro instances on EC2. I will eventual migrate to
> large instance...but I can't seem to get Cassandra to stay up for more than
> 1 day at a time

We need more information. What does "go down" mean? Does the JVM get
killed? Does it stop responding to request but remains running? if the
latter, what does it do - does it spin CPU? What *does* show up in eg
the cassandra system log when this happens (error messages or not)?

You're saying you're running on micro instances. Have you configured
your node appropriately, in particular memory thresholds? If using the
out-of-the-box config on a micro instance (isn't that like 512 mb?),
the max heap size will probably be 256 mb. And with out-of-the-box
cassandra.yaml settings I would not be surprised if you're dying with
an OutOfMemory error and possibly with high amounts of GC activity
before actually dying.

-- 
/ Peter Schuller

Re: Cassandra Node Routinely Goes Down - 0.7 RC2

Posted by Robert Coli <rc...@digg.com>.
On Wed, Dec 22, 2010 at 7:21 AM, Dan Hendry <da...@gmail.com> wrote:
> Can one of the Cassandra devs or anybody who knows about memory mapping
> comment on this/my particular mmap situation?

Pardon me if this has been covered or if you are already aware, but if
not, you might find :

https://issues.apache.org/jira/browse/CASSANDRA-1214

To be interesting background reading.

=Rob

Re: Cassandra Node Routinely Goes Down - 0.7 RC2

Posted by Peter Schuller <pe...@infidyne.com>.
> Can one of the Cassandra devs or anybody who knows about memory mapping
> comment on this/my particular mmap situation? I have been thinking about it
> and the start of my problems seemed to correlate to my active dataset and
> single sstable sizes growing beyond the amount of free system memory (12 GB,
> my nodes have 24 GB total with 12 GB for Cassandra heap). Does memory
> mapping somehow force the data to stay in memory or prevent it memory from
> being reclaimed for other purposes? Google does not turn up any nice simple
> answers.

It doesn't force it to be in memory, but in some respects memory
mapped files will be treated differently than pages cached that aren't
mapped (for example, use of mmap() tends to more easily trigger
swapping out of the application, presumably because the kernel doesn't
know to treat the mmap():ed file as less important than other
mmap():ed or brk():ed data by the application for regular heap
purposes).

Given that your problem definitely seems to be kernel-space, I was not
surprised when you said that switching away from mmap():Ing the data
files helped, just from a 'stirring the pot' perspective. But I
certainly don't know enough about the linux vm system to offer
specific claims as to what problem you triggered.

-- 
/ Peter Schuller

Re: Cassandra Node Routinely Goes Down - 0.7 RC2

Posted by Jonathan Ellis <jb...@gmail.com>.
On Wed, Dec 22, 2010 at 9:21 AM, Dan Hendry <da...@gmail.com>wrote:

> Does memory mapping somehow force the data to stay in memory or prevent it
> memory from being reclaimed for other purposes? Google does not turn up any
> nice simple answers.
>

No.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

RE: Cassandra Node Routinely Goes Down - 0.7 RC2

Posted by Dan Hendry <da...@gmail.com>.
The main diagnosing feature of the problem I was seeing is very high system
CPU with no user CPU utilization(check with top or sar -u), vmstat showing
one process waiting for run-time but never seeming to get it, a high page
scan rate, and no Cassandra error messages (although nodes dying did *seem*
to correlate with flushing memtables and compaction). I am also using 64 bit
kernel.

 

I was having nodes dying every few hours but ever since I switched from mmap
(auto= mmap for 64 bit) to mmap_index_only, things have been rock solid
reliable. No down time in 48+ hours. You haven't really provided enough
information to determine if you are having the same problem I was having but
if you think so, I would recommend you at least try switching to
mmap_index_only. 

 

Can one of the Cassandra devs or anybody who knows about memory mapping
comment on this/my particular mmap situation? I have been thinking about it
and the start of my problems seemed to correlate to my active dataset and
single sstable sizes growing beyond the amount of free system memory (12 GB,
my nodes have 24 GB total with 12 GB for Cassandra heap). Does memory
mapping somehow force the data to stay in memory or prevent it memory from
being reclaimed for other purposes? Google does not turn up any nice simple
answers.

 

Dan

 

From: Christopher Kung [mailto:chris.kung@gmail.com] 
Sent: December-22-10 4:09
To: user@cassandra.apache.org
Subject: Cassandra Node Routinely Goes Down - 0.7 RC2

 

Hey All,

 

I have been having problems running 0.7RC2 where one of my two nodes
routinely goes down. Somtimes both of them go down. I am running the nodes
using Ubuntu Lucid LTS 64-bit with kernal version 2.6.32. Currently, both
nodes are running on micro instances on EC2. I will eventual migrate to
large instance...but I can't seem to get Cassandra to stay up for more than
1 day at a time

 

 I saw another post recently where someone else was having a similiar
problem, and the solution was to change to mmap_index for disk access mode
rather than auto. Anyways, the machines are 64-bit, despite being under
powered, so I don't see why that's necessary. I checked my logs and there
are no error messages. Are the nodes just running into resource issues? 

 

Thanks.

 

Chris

 

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.872 / Virus Database: 271.1.1/3329 - Release Date: 12/21/10
02:34:00