You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jorge Rodriguez (JIRA)" <ji...@apache.org> on 2015/08/21 16:09:46 UTC

[jira] [Commented] (CASSANDRA-10150) Cassandra read latency potentially caused by memory leak

    [ https://issues.apache.org/jira/browse/CASSANDRA-10150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706772#comment-14706772 ] 

Jorge Rodriguez commented on CASSANDRA-10150:
---------------------------------------------

We came across this thread from Benedict on the jmx-dev mailing list yesterday, and we implemented the workaround he recommended here: http://mail.openjdk.java.net/pipermail/jmx-dev/2014-February/000585.html
Which is to enable the flag: "CMSClassUnloadingEnabled"

Since we enabled this flag yesterday we are not seeing the memory leak.  Performance also hasn't been impacted by this so far either it seems.

> Cassandra read latency potentially caused by memory leak
> --------------------------------------------------------
>
>                 Key: CASSANDRA-10150
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10150
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: cassandra 2.0.12
>            Reporter: Cheng Ren
>
>   We are currently migrating to a new cassandra cluster which is multi-region on ec2.  Our previous cluster was also on ec2 but only in the east region.  In addition we have upgraded to cassandra 2.0.12 from 2.0.4 and from ubuntu 12 to 14.
>   We are investigating a cassandra latency problem on our new cluster.  The symptom is that over a long period of time (12-16 hours) the TP90-95 read latency degrades to the point of being well above our SLA's.  During normal operation our TP95 for a 50key lookup is 75ms, when fully degraded, we are facing 300ms TP95 latencies.  Doing a rolling restart resolves the problem.
> We are noticing a high correlation between the Old Gen heap usage (and how much is freed up) and the high latencies.  We are running with a max heap size of 12GB and a max new-gen size of 2GB.
> Below is a chart of the heap usage over a 24 hour period.  Right below it is a chart of TP95 latencies (was a mixed workload of 50 and single key lookups), the third image is a look at CMS Old Gen memory usage:
> Overall heap usage over 24 hrs:
> !https://dl.dropboxusercontent.com/u/303980955/1.png|height=300,width=500!
> TP95 latencies over 24 hours:
> !https://dl.dropboxusercontent.com/u/303980955/2.png|height=300,width=500!
> OldGen memory usage over 24 hours:
> !https://dl.dropboxusercontent.com/u/303980955/3.png|height=300,width=500!
>  You can see from this that the old gen section of our heap is what is using up the majority of the heap space.  We cannot figure out why the memory is not being collected during a full GC.  For reference, in our old cassandra cluster, the behavior is that the full GC will clear up the majority of the heap space.  See image below from an old production node operating normally:
> !https://dl.dropboxusercontent.com/u/303980955/4.png|height=300,width=500!
> From heap dump file we found that most memory is consumed by unreachable objects. With further analysis we were able to see those objects are RMIConnectionImpl$CombinedClassLoader$ClassLoaderWrapper (holding 4GB of memory) and java.security.ProtectionDomain (holding 2GB) . The only place we know Cassandra is using RMI is in JMX, but
> does anyone has any clue on where else those objects are used? And Why do they take so much memory?
> Or It would be great if someone could offer any further debugging tips on the latency or GC issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)