You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Delaney Manders (JIRA)" <ji...@apache.org> on 2012/06/01 16:31:23 UTC
[jira] [Resolved] (CASSANDRA-4225) EC2 nodes randomly hard-crash the machine on newest EC2 Linux AMI

     [ https://issues.apache.org/jira/browse/CASSANDRA-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Delaney Manders resolved CASSANDRA-4225.
----------------------------------------

    Resolution: Invalid

My ticket was finally closed by AWS.

Their response:
> The Kernel team has got back to me. They say that there is a new kernel for the AMI which has some patches in the net_rx area that shows up in your traces.  
  
I've moved two machines to the new patched AMI, and they've been solid for 3 days now.  I consider this closed.
                
> EC2 nodes randomly hard-crash the machine on newest EC2 Linux AMI
> -----------------------------------------------------------------
>
>                 Key: CASSANDRA-4225
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4225
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.0
>         Environment: Amazon Linux AMI release 2012.03
> 3.2.12-3.2.4.amzn1.x86_64
> m1.xlarge
> Nodes have:
> Cassandra built and installed from source.
> Ant binary (apache-ant-1.8.3-bin.tar.gz), automake(1.11.1), autoconf(2.64), libtool(2.2.10) installed from AWS repository.
> Sun Java:
> > java -version
> java version "1.6.0_31"
> Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
> Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)
> Only system changes are:
> echo "root soft memlock unlimited" | sudo tee -a /etc/security/limits.conf
> echo "root hard memlock unlimited" | sudo tee -a /etc/security/limits.conf
> Setup scripts available.
> Cassandra cluster has two datacenters, with DC1 having 8 nodes and DC2 having 4, DC2 being reserved for Hadoop jobs.  DC2 nodes have not had the same frequency of hard crashes, though it has happened.
> Storage is set up with 4 ephemeral drives raided for commit, 4 EBS drives raided for storage.
> Usage is exclusively write, with all mutations being done in batch mutations, where each batch mutation has a set of columns added/modified to a single key.  There are ~2000 threads streaming batch mutations from a web edge of varying size, distributed across DC1.  Client is Hector(1.0-5) w/ DynamicLoadBalancing.
> In an effort to mitigate this issue, I've removed jna.jar & platform.jar from $CASSANDRA_HOME/lib, and set disk_access_mode: standard in $CASSANDRA_HOME/conf.cassandra.yaml.  Neither has seemed to help.
>            Reporter: Delaney Manders
>
> At fairly random intervals, about once/day, one of my Cassandra nodes does a hard crash (kernel panic).  
>   
> I can find no system logs (/var/log/*) which have any errors.  No cassandra logs have any errors.  
>   
> On one machine I was watching as it went down, and caught the following comment:  
> > Message from syslogd@domU-12-31-38-00-64-31 at May  3 18:24:17 ...
> >  kernel:[252906.019808] Oops: 0002 [#1] SMP
> An AWS support guy found one entry in the console logs:
> > [30178.298308] Pid: 2238, comm: java Not tainted 3.2.12-3.2.4.amzn1.x86_64 #1
> I've replaced two of the nodes with new instances, but all are showing the same behaviour.
> It's very reproduceable on my system, though it takes a little waiting.  Leaving it running is no big deal for another day or so, I just need to restart Cassandra every once in a while when I get alerted.  
> I'm open to any additional requested debugging steps before bailing and going back to 1.0.9.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira