You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Nitin (JIRA)" <ji...@apache.org> on 2013/11/13 23:59:23 UTC

[jira] [Commented] (CASSANDRA-6346) Cassandra 2.0 server node runs out of memory during writes/replications

    [ https://issues.apache.org/jira/browse/CASSANDRA-6346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13821954#comment-13821954 ] 

Nitin commented on CASSANDRA-6346:
----------------------------------

Thanks Jonathan.

I am a little confused as to how this would limit the total memory usage on the server. The client can keep enqueing requests to the server (especially a distributed client like mapreduce). Reducing the client timeout will cause the client to timeout earlier but the server will still enqueue those requests in the LinkedBlockingQueue that currently seems to of infinite capacity for mutations. Shouldn't the real fix be able to specify a maximum size of the LinkedBlockingQueue?

Please correct me if I am completely mistaken. Thanks in advance for your help.

> Cassandra 2.0 server node runs out of memory during writes/replications
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-6346
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6346
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Nitin
>         Attachments: LinkedBlockingQ.png
>
>
> Currently we are running 18 node cassandra cluster with NetworkTopologyReplication Strategy (d1 = 3 and d2=3).  
> Our severs seem to crash with OOM exceptions. Our heap size is 8Gb. However while crashing i got hold of the hprof file and ran it through an eclipse MAT analyzer
> After analyzing the hprof (please see attachment for top offenders), i find that there is a linked blocking queue (from mutation stage) that seems to hold about 7.3 Gb of the total 8Gb of ram. 
> After deep diving into the cassandra2.0 code, i see that every update/write/replication goes through stages and mutation stage  and the no of threads that flush this queue (I am assuming memtable to sstable write) is controlled by concurrent writes. Ours is set to 32 concurrent writes
> However we observe node crashes even when there are 0 writes to the node but replication requests are floating around the cluster. 
> Any ideas what are the knobs to throttle the size of these queues/max no of write and replication requests a node can get? What are the recommended settings to operate cassandra node in a mode where it rejects requests beyond certain queue threshold?



--
This message was sent by Atlassian JIRA
(v6.1#6144)