You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Carl Hu <me...@carlhu.com> on 2015/06/01 15:42:32 UTC

GC pauses affecting entire cluster.

We are running Cassandra version 2.1.5.469 on 15 nodes and are experiencing
a problem where the entire cluster slows down for 2.5 minutes when one node
experiences a 17 second stop-the-world gc. These gc's happen once every 2
hours. I did find a ticket that seems related to this:
https://issues.apache.org/jira/browse/CASSANDRA-3853, but Jonathan Ellis
has resolved this ticket.

We are running standard gc settings, but this ticket is not so much
concerned with the 17 second gc on a single node (after all, we have 14
others), but that the cascading performance problem.

We running standard values of dynamic_snitch_badness_threshold (0.1) and
phi_convict_threshold (8). (These values are relevant for the dynamic
snitch routing requests away from the frozen node or the failure detector
marking the node as 'down').

We use the python client in default round robin mode, so all clients hits
the coordinators at all nodes in round robin. One theory is that since the
coordinator on all nodes must hit the frozen node at some point in the 17
seconds, each node's request queues fills up, and the entire cluster thus
freezes up. That would explain a 17 second freeze but would not explain the
2.5 minute slowdown (10x increase in request latency @P50).

I'd love your thoughts. I've provided the GC chart here.

Carl

[image: Inline image 1]

Re: GC pauses affecting entire cluster.

Posted by graham sanderson <gr...@vast.com>.

Yes native_objects is the way to go… you can tell if memtables are you problem because you’ll see promotion failures of objects sized 131074 dwords.

If your h/w is fast enough make your young gen as big as possible - we can collect 8G in sub second always, and this gives you your best chance of transient objects (especially if you still have thrift clients) leaking into the old gen. Moving to 2.1.x (and off heap memtables) from 2.0.x we have reduced our old gen down from 16gig to 12gig and will keep shrinking it, but have had no promotion failures yet, and it’s been several months.

Note we are running a patched 2.1.3, but 2.1.5 has the equivalent important bugs fixed (that might have given you memory issues)

> On Jun 1, 2015, at 3:00 PM, Carl Hu <me...@carlhu.com> wrote:
> 
> Thank you for the suggestion. After analysis of your settings, the basic hypothesis here is to promote very quickly to Old Gen because of a rapid accumulation of heap usage due to memtables. We happen to be running on 2.1, and I thought a more conservative approach that your (quite aggressive gc settings) is to try the new memtable_allocation_type with offheap_objects and see if the memtable pressure is relieved sufficiently such that the standard gc settings can keep up.
> 
> The experiment is in progress and I will report back with the results.
> 
> On Mon, Jun 1, 2015 at 10:20 AM, Anuj Wadehra <anujw_2003@yahoo.co.in <ma...@yahoo.co.in>> wrote:
> We have write heavy workload and used to face promotion failures/long gc pauses with Cassandra 2.0.x. I am not into code yet but I think that memtable and compaction related objects have mid-life and write heavy workload is not suitable for generation collection by default. So, we tuned JVM to make sure that minimum objects are promoted to Old Gen and achieved great success in that:
> MAX_HEAP_SIZE="12G"
> HEAP_NEWSIZE="3G"
> -XX:SurvivorRatio=2
> -XX:MaxTenuringThreshold=20
> -XX:CMSInitiatingOccupancyFraction=70
> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20"
> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
> JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000"
> JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000"
> JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
> JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
> We also think that default total_memtable_space_in_mb=1/4 heap is too much for write heavy loads. By default, young gen is also 1/4 heap.We reduced it to 1000mb in order to make sure that memtable related objects dont stay in memory for too long. Combining this with SurvivorRatio=2 and MaxTenuringThreshold=20 did the job well. GC was very consistent. No Full GC observed.
> 
> Environment: 3 node cluster with each node having 24cores,64G RAM and SSDs in RAID5.
> We are making around 12k writes/sec in 5 cf (one with 4 sec index) and 2300 reads/sec on each node of 3 node cluster. 2 CFs have wide rows with max data of around 100mb per row
> 
> Yes. Node marking down has cascading effect. Within seconds all nodes in our cluster are marked down. 
> 
> Thanks
> Anuj Wadehra
> 
> 
> 
> On Monday, 1 June 2015 7:12 PM, Carl Hu <me@carlhu.com <ma...@carlhu.com>> wrote:
> 
> 
> We are running Cassandra version 2.1.5.469 on 15 nodes and are experiencing a problem where the entire cluster slows down for 2.5 minutes when one node experiences a 17 second stop-the-world gc. These gc's happen once every 2 hours. I did find a ticket that seems related to this: https://issues.apache.org/jira/browse/CASSANDRA-3853 <https://issues.apache.org/jira/browse/CASSANDRA-3853>, but Jonathan Ellis has resolved this ticket. 
> 
> We are running standard gc settings, but this ticket is not so much concerned with the 17 second gc on a single node (after all, we have 14 others), but that the cascading performance problem.
> 
> We running standard values of dynamic_snitch_badness_threshold (0.1) and phi_convict_threshold (8). (These values are relevant for the dynamic snitch routing requests away from the frozen node or the failure detector marking the node as 'down').
> 
> We use the python client in default round robin mode, so all clients hits the coordinators at all nodes in round robin. One theory is that since the coordinator on all nodes must hit the frozen node at some point in the 17 seconds, each node's request queues fills up, and the entire cluster thus freezes up. That would explain a 17 second freeze but would not explain the 2.5 minute slowdown (10x increase in request latency @P50).
> 
> I'd love your thoughts. I've provided the GC chart here.
> 
> Carl
> 
> <d2c95dce-0848-11e5-91f7-6b223349fc14.png>
> 
> 
>

Re: GC pauses affecting entire cluster.

Posted by Carl Hu <me...@carlhu.com>.

Anuj,

So I did the experiment with the default gc settings but using
memtable_allocation_type
with offheap_objects: cassandra still freezes once every two hours or so,
locking up the cluster. I will try your settings tomorrow and report back.

Let me know if anyone else has any suggestions,
Carl


On Mon, Jun 1, 2015 at 4:00 PM, Carl Hu <me...@carlhu.com> wrote:

> Thank you for the suggestion. After analysis of your settings, the basic
> hypothesis here is to promote very quickly to Old Gen because of a rapid
> accumulation of heap usage due to memtables. We happen to be running on
> 2.1, and I thought a more conservative approach that your (quite aggressive
> gc settings) is to try the new memtable_allocation_type with
> offheap_objects and see if the memtable pressure is relieved sufficiently
> such that the standard gc settings can keep up.
>
> The experiment is in progress and I will report back with the results.
>
> On Mon, Jun 1, 2015 at 10:20 AM, Anuj Wadehra <an...@yahoo.co.in>
> wrote:
>
>> We have write heavy workload and used to face promotion failures/long gc
>> pauses with Cassandra 2.0.x. I am not into code yet but I think that
>> memtable and compaction related objects have mid-life and write heavy
>> workload is not suitable for generation collection by default. So, we tuned
>> JVM to make sure that minimum objects are promoted to Old Gen and achieved
>> great success in that:
>> MAX_HEAP_SIZE="12G"
>> HEAP_NEWSIZE="3G"
>> -XX:SurvivorRatio=2
>> -XX:MaxTenuringThreshold=20
>> -XX:CMSInitiatingOccupancyFraction=70
>> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20"
>> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
>> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>> JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000"
>> JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000"
>> JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
>> JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
>> We also think that default total_memtable_space_in_mb=1/4 heap is too
>> much for write heavy loads. By default, young gen is also 1/4 heap.We
>> reduced it to 1000mb in order to make sure that memtable related objects
>> dont stay in memory for too long. Combining this with SurvivorRatio=2 and
>> MaxTenuringThreshold=20 did the job well. GC was very consistent. No Full
>> GC observed.
>>
>> Environment: 3 node cluster with each node having 24cores,64G RAM and
>> SSDs in RAID5.
>> We are making around 12k writes/sec in 5 cf (one with 4 sec index) and
>> 2300 reads/sec on each node of 3 node cluster. 2 CFs have wide rows with
>> max data of around 100mb per row
>>
>> Yes. Node marking down has cascading effect. Within seconds all nodes in
>> our cluster are marked down.
>>
>> Thanks
>> Anuj Wadehra
>>
>>
>>
>>   On Monday, 1 June 2015 7:12 PM, Carl Hu <me...@carlhu.com> wrote:
>>
>>
>> We are running Cassandra version 2.1.5.469 on 15 nodes and are
>> experiencing a problem where the entire cluster slows down for 2.5 minutes
>> when one node experiences a 17 second stop-the-world gc. These gc's happen
>> once every 2 hours. I did find a ticket that seems related to this:
>> https://issues.apache.org/jira/browse/CASSANDRA-3853, but Jonathan Ellis
>> has resolved this ticket.
>>
>> We are running standard gc settings, but this ticket is not so much
>> concerned with the 17 second gc on a single node (after all, we have 14
>> others), but that the cascading performance problem.
>>
>> We running standard values of dynamic_snitch_badness_threshold (0.1) and
>> phi_convict_threshold (8). (These values are relevant for the dynamic
>> snitch routing requests away from the frozen node or the failure detector
>> marking the node as 'down').
>>
>> We use the python client in default round robin mode, so all clients hits
>> the coordinators at all nodes in round robin. One theory is that since the
>> coordinator on all nodes must hit the frozen node at some point in the 17
>> seconds, each node's request queues fills up, and the entire cluster thus
>> freezes up. That would explain a 17 second freeze but would not explain the
>> 2.5 minute slowdown (10x increase in request latency @P50).
>>
>> I'd love your thoughts. I've provided the GC chart here.
>>
>> Carl
>>
>> [image: Inline image 1]
>>
>>
>>
>

Re: GC pauses affecting entire cluster.

Posted by Carl Hu <me...@carlhu.com>.

Thank you for the suggestion. After analysis of your settings, the basic
hypothesis here is to promote very quickly to Old Gen because of a rapid
accumulation of heap usage due to memtables. We happen to be running on
2.1, and I thought a more conservative approach that your (quite aggressive
gc settings) is to try the new memtable_allocation_type with
offheap_objects and see if the memtable pressure is relieved sufficiently
such that the standard gc settings can keep up.

The experiment is in progress and I will report back with the results.

On Mon, Jun 1, 2015 at 10:20 AM, Anuj Wadehra <an...@yahoo.co.in>
wrote:

> We have write heavy workload and used to face promotion failures/long gc
> pauses with Cassandra 2.0.x. I am not into code yet but I think that
> memtable and compaction related objects have mid-life and write heavy
> workload is not suitable for generation collection by default. So, we tuned
> JVM to make sure that minimum objects are promoted to Old Gen and achieved
> great success in that:
> MAX_HEAP_SIZE="12G"
> HEAP_NEWSIZE="3G"
> -XX:SurvivorRatio=2
> -XX:MaxTenuringThreshold=20
> -XX:CMSInitiatingOccupancyFraction=70
> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20"
> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
> JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000"
> JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000"
> JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
> JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
> We also think that default total_memtable_space_in_mb=1/4 heap is too much
> for write heavy loads. By default, young gen is also 1/4 heap.We reduced it
> to 1000mb in order to make sure that memtable related objects dont stay in
> memory for too long. Combining this with SurvivorRatio=2 and
> MaxTenuringThreshold=20 did the job well. GC was very consistent. No Full
> GC observed.
>
> Environment: 3 node cluster with each node having 24cores,64G RAM and SSDs
> in RAID5.
> We are making around 12k writes/sec in 5 cf (one with 4 sec index) and
> 2300 reads/sec on each node of 3 node cluster. 2 CFs have wide rows with
> max data of around 100mb per row
>
> Yes. Node marking down has cascading effect. Within seconds all nodes in
> our cluster are marked down.
>
> Thanks
> Anuj Wadehra
>
>
>
>   On Monday, 1 June 2015 7:12 PM, Carl Hu <me...@carlhu.com> wrote:
>
>
> We are running Cassandra version 2.1.5.469 on 15 nodes and are
> experiencing a problem where the entire cluster slows down for 2.5 minutes
> when one node experiences a 17 second stop-the-world gc. These gc's happen
> once every 2 hours. I did find a ticket that seems related to this:
> https://issues.apache.org/jira/browse/CASSANDRA-3853, but Jonathan Ellis
> has resolved this ticket.
>
> We are running standard gc settings, but this ticket is not so much
> concerned with the 17 second gc on a single node (after all, we have 14
> others), but that the cascading performance problem.
>
> We running standard values of dynamic_snitch_badness_threshold (0.1) and
> phi_convict_threshold (8). (These values are relevant for the dynamic
> snitch routing requests away from the frozen node or the failure detector
> marking the node as 'down').
>
> We use the python client in default round robin mode, so all clients hits
> the coordinators at all nodes in round robin. One theory is that since the
> coordinator on all nodes must hit the frozen node at some point in the 17
> seconds, each node's request queues fills up, and the entire cluster thus
> freezes up. That would explain a 17 second freeze but would not explain the
> 2.5 minute slowdown (10x increase in request latency @P50).
>
> I'd love your thoughts. I've provided the GC chart here.
>
> Carl
>
> [image: Inline image 1]
>
>
>

Re: GC pauses affecting entire cluster.

Posted by Anuj Wadehra <an...@yahoo.co.in>.

We have write heavy workload and used to face promotion failures/long gc pauses with Cassandra 2.0.x. I am not into code yet but I think that memtable and compaction related objects have mid-life and write heavy workload is not suitable for generation collection by default. So, we tuned JVM to make sure that minimum objects are promoted to Old Gen and achieved great success in that:MAX_HEAP_SIZE="12G"
HEAP_NEWSIZE="3G"
-XX:SurvivorRatio=2
-XX:MaxTenuringThreshold=20
-XX:CMSInitiatingOccupancyFraction=70
JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20"
JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000"
JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000"
JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"We also think that default total_memtable_space_in_mb=1/4 heap is too much for write heavy loads. By default, young gen is also 1/4 heap.We reduced it to 1000mb in order to make sure that memtable related objects dont stay in memory for too long. Combining this with SurvivorRatio=2 and MaxTenuringThreshold=20 did the job well. GC was very consistent. No Full GC observed.
Environment: 3 node cluster with each node having 24cores,64G RAM and SSDs in RAID5.We are making around 12k writes/sec in 5 cf (one with 4 sec index) and 2300 reads/sec on each node of 3 node cluster. 2 CFs have wide rows with max data of around 100mb per row
Yes. Node marking down has cascading effect. Within seconds all nodes in our cluster are marked down.

ThanksAnuj Wadehra

On Monday, 1 June 2015 7:12 PM, Carl Hu <me...@carlhu.com> wrote:

We are running Cassandra version 2.1.5.469 on 15 nodes and are experiencing a problem where the entire cluster slows down for 2.5 minutes when one node experiences a 17 second stop-the-world gc. These gc's happen once every 2 hours. I did find a ticket that seems related to this: https://issues.apache.org/jira/browse/CASSANDRA-3853, but Jonathan Ellis has resolved this ticket.
We are running standard gc settings, but this ticket is not so much concerned with the 17 second gc on a single node (after all, we have 14 others), but that the cascading performance problem.
We running standard values of dynamic_snitch_badness_threshold (0.1) and phi_convict_threshold (8). (These values are relevant for the dynamic snitch routing requests away from the frozen node or the failure detector marking the node as 'down').
We use the python client in default round robin mode, so all clients hits the coordinators at all nodes in round robin. One theory is that since the coordinator on all nodes must hit the frozen node at some point in the 17 seconds, each node's request queues fills up, and the entire cluster thus freezes up. That would explain a 17 second freeze but would not explain the 2.5 minute slowdown (10x increase in request latency @P50).
I'd love your thoughts. I've provided the GC chart here.
Carl