You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Rajsekhar Mallick <ra...@gmail.com> on 2019/02/11 04:22:13 UTC

High GC pauses leading to client seeing impact

Hello Team,

I have a cluster of 17 nodes in production.(8 and 9 nodes in 2 DC).
Cassandra version: 2.0.11
Client connecting using thrift over port 9160
Jdk version : 1.8.066
GC used : G1GC (16GB heap)
Other GC settings:
Maxgcpausemillis=200
Parallels gc threads=32
Concurrent gc threads= 10
Initiatingheapoccupancypercent=50
Number of cpu cores for each system : 40
Memory size: 185 GB
Read/sec : 300 /sec on each node
Writes/sec : 300/sec on each node
Compaction strategy used : Size tiered compaction strategy

Identified issues in the cluster:
1. Disk space usage across all nodes in the cluster is 80%. We are currently working on adding more storage on each node
2. There are 2 tables for which we keep on seeing large number of tombstones. One of table has read requests seeing 120 tombstones cells in last 5 mins as compared to 4 live cells. Tombstone warns and Error messages of query getting aborted is also seen.

Current issue sen:
1. We keep on seeing GC pauses of few minutes randomly across nodes in the cluster. GC pauses of 120 seconds, even 770 seconds are also seen.
2. This leads to nodes getting stalled and client seeing direct impact
3. The GC pause we see, are not during any of G1GC phases. The GC log message prints “Time to stop threads took 770 seconds”. So it is not the garbage collector doing any work but stopping the threads at a safe point is taking so much of time.
4. This issue has surfaced recently after we changed 8GB(CMS) to 16GB(G1GC) across all nodes in the cluster.

Kindly do help on the above issue. I am not able to exactly understand if the GC is wrongly tuned, other if this is something else.

Thanks,
Rajsekhar Mallick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: High GC pauses leading to client seeing impact

Posted by Rahul Singh <ra...@gmail.com>.

There are a few factors: sometimes data that is in a fat partition clogs up
the heap space / memtable space and tombstones don't help that much either.
This is worsened by data skew . I agree , if CMS is working for now,
continue using it and then upgrade to better versions of Java / C*.

Few things you can do to see what's going on

1. Use Hubspot's gc visualizer https://github.com/HubSpot/gc_log_visualizer
2. Look at the heapdump in a  JVM explorer to see what is taking up memory.
https://docs.oracle.com/javase/8/docs/technotes/guides/visualvm/heapdump.html

GC visualizer shows you the patterns of GC... and the VisualVM if you can
connect to an existing JVM instance will show you what is coming and going.
Looking at a heapdump also helps you see what type of objects are in memory
-- rows, cells, which cell types





On Mon, Feb 11, 2019 at 3:06 AM Elliott Sims <el...@backblaze.com> wrote:

> I would strongly suggest you consider an upgrade to 3.11.x.  I found it
> decreased space needed by about 30% in addition to significantly lowering
> GC.
>
> As a first step, though, why not just revert to CMS for now if that was
> working ok for you?  Then you can convert one host for diagnosis/tuning so
> the cluster as a whole stays functional.
>
> That's also a pretty old version of the JDK to be using G1.  I would
> definitely upgrade that to 1.8u202 and see if the problem goes away.
>
> On Sun, Feb 10, 2019, 10:22 PM Rajsekhar Mallick <raj.mallick14@gmail.com
> wrote:
>
>> Hello Team,
>>
>> I have a cluster of 17 nodes in production.(8 and 9 nodes in 2 DC).
>> Cassandra version: 2.0.11
>> Client connecting using thrift over port 9160
>> Jdk version : 1.8.066
>> GC used : G1GC (16GB heap)
>> Other GC settings:
>> Maxgcpausemillis=200
>> Parallels gc threads=32
>> Concurrent gc threads= 10
>> Initiatingheapoccupancypercent=50
>> Number of cpu cores for each system : 40
>> Memory size: 185 GB
>> Read/sec : 300 /sec on each node
>> Writes/sec : 300/sec on each node
>> Compaction strategy used : Size tiered compaction strategy
>>
>> Identified issues in the cluster:
>> 1. Disk space usage across all nodes in the cluster is 80%. We are
>> currently working on adding more storage on each node
>> 2. There are 2 tables for which we keep on seeing large number of
>> tombstones. One of table has read requests seeing 120 tombstones cells in
>> last 5 mins as compared to 4 live cells. Tombstone warns and Error messages
>> of query getting aborted is also seen.
>>
>> Current issue sen:
>> 1. We keep on seeing GC pauses of few minutes randomly across nodes in
>> the cluster. GC pauses of 120 seconds, even 770 seconds are also seen.
>> 2. This leads to nodes getting stalled and client seeing direct impact
>> 3. The GC pause we see, are not during any of G1GC phases. The GC log
>> message prints “Time to stop threads took 770 seconds”. So it is not the
>> garbage collector doing any work but stopping the threads at a safe point
>> is taking so much of time.
>> 4. This issue has surfaced recently after we changed 8GB(CMS) to
>> 16GB(G1GC) across all nodes in the cluster.
>>
>> Kindly do help on the above issue. I am not able to exactly understand if
>> the GC is wrongly tuned, other if this is something else.
>>
>> Thanks,
>> Rajsekhar Mallick
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: user-help@cassandra.apache.org
>>
>>

Re: High GC pauses leading to client seeing impact

Posted by Elliott Sims <el...@backblaze.com>.

I would strongly suggest you consider an upgrade to 3.11.x.  I found it
decreased space needed by about 30% in addition to significantly lowering
GC.

As a first step, though, why not just revert to CMS for now if that was
working ok for you?  Then you can convert one host for diagnosis/tuning so
the cluster as a whole stays functional.

That's also a pretty old version of the JDK to be using G1.  I would
definitely upgrade that to 1.8u202 and see if the problem goes away.

On Sun, Feb 10, 2019, 10:22 PM Rajsekhar Mallick <raj.mallick14@gmail.com
wrote:

> Hello Team,
>
> I have a cluster of 17 nodes in production.(8 and 9 nodes in 2 DC).
> Cassandra version: 2.0.11
> Client connecting using thrift over port 9160
> Jdk version : 1.8.066
> GC used : G1GC (16GB heap)
> Other GC settings:
> Maxgcpausemillis=200
> Parallels gc threads=32
> Concurrent gc threads= 10
> Initiatingheapoccupancypercent=50
> Number of cpu cores for each system : 40
> Memory size: 185 GB
> Read/sec : 300 /sec on each node
> Writes/sec : 300/sec on each node
> Compaction strategy used : Size tiered compaction strategy
>
> Identified issues in the cluster:
> 1. Disk space usage across all nodes in the cluster is 80%. We are
> currently working on adding more storage on each node
> 2. There are 2 tables for which we keep on seeing large number of
> tombstones. One of table has read requests seeing 120 tombstones cells in
> last 5 mins as compared to 4 live cells. Tombstone warns and Error messages
> of query getting aborted is also seen.
>
> Current issue sen:
> 1. We keep on seeing GC pauses of few minutes randomly across nodes in the
> cluster. GC pauses of 120 seconds, even 770 seconds are also seen.
> 2. This leads to nodes getting stalled and client seeing direct impact
> 3. The GC pause we see, are not during any of G1GC phases. The GC log
> message prints “Time to stop threads took 770 seconds”. So it is not the
> garbage collector doing any work but stopping the threads at a safe point
> is taking so much of time.
> 4. This issue has surfaced recently after we changed 8GB(CMS) to
> 16GB(G1GC) across all nodes in the cluster.
>
> Kindly do help on the above issue. I am not able to exactly understand if
> the GC is wrongly tuned, other if this is something else.
>
> Thanks,
> Rajsekhar Mallick
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>
>