You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Anishek Agarwal <an...@gmail.com> on 2016/03/02 08:12:36 UTC

Lot of GC on two nodes out of 7

Hello,

we have a cassandra cluster of 7 nodes, all of them have the same JVM GC
configurations, all our writes /  reads use the TokenAware Policy wrapping
a DCAware policy. All nodes are part of same Datacenter.

We are seeing that two nodes are having high GC collection times. Then
mostly seem to spend time in GC like about 300-600 ms. This also seems to
result in higher CPU utilisation on these machines. Other  5 nodes don't
have this problem.

There is no additional repair activity going on the cluster, we are not
sure why this is happening.
we checked cfhistograms on the two CF we have in the cluster and number of
reads seems to be almost same.

we also used cfstats to see the number of ssttables on each node and one of
the nodes with the above problem has twice the number of ssttables than
other nodes. This still doesnot explain why two nodes have high GC
Overheads. our GC config is as below:

JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"

JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"

JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"

JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"

JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"

JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"

JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"

JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"

JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"

JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"

JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"

JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"

JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"

JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"

# earlier value 131072 = 32768 * 4

JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"

JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"

JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"

JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"

#new

JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"

We are using cassandra 2.0.17. If anyone has any suggestion as to how what
else we can look for to understand why this is happening please do reply.



Thanks
anishek

RE: Lot of GC on two nodes out of 7

Posted by Amit Singh F <am...@ericsson.com>.

Hi,

Cold you try pasting  all GC settings portion done in Cassandra env file.

Thanks
Amit Singh
From: Anishek Agarwal [mailto:anishek@gmail.com]
Sent: Monday, March 07, 2016 12:24 PM
To: user@cassandra.apache.org
Subject: Re: Lot of GC on two nodes out of 7

@Jeff i was just trying to follow some more advice given above, I personally still think a larger newGen heap size would be better.

@Johnathan I will post the whole logs, I have restarted the nodes with additional changes most probably tomorrow or day after i will put out the gc logs.

the problem still exists on two nodes. too much time spent in GC, additionally I tried to print the state of cluster via my application to see what is happening and i see that the node with high GC has a lot of  "inflight Queries" -- almost 1100 and other nodes is all 0.

the cfhistograms for all nodes show the approx the same number of reads. -- so i am thinking the above phenomenon is happening since the node is spending time in gc.

also looking at the Load Balancing policy on client its new TokenAwarePolicy(new DCAwareRoundRobinPolicy())

if you have any other ideas please keep posting them.

thanks
anishek

On Sat, Mar 5, 2016 at 12:54 AM, Jonathan Haddad <jo...@jonhaddad.com>> wrote:
Without looking at your GC logs (you never posted a gist), my assumption would be you're doing a lot of copying between survivor generations, and they're taking a long time.  You're probably also copying a lot of data to your old gen as a result of having full-ish survivor spaces to begin with.

On Thu, Mar 3, 2016 at 10:26 PM Jeff Jirsa <je...@crowdstrike.com>> wrote:
I’d personally would have gone the other way – if you’re seeing parnew, increasing new gen instead of decreasing it should help drop (faster) rather than promoting to sv/oldgen (slower) ?

From: Anishek Agarwal
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
Date: Thursday, March 3, 2016 at 8:55 PM

To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
Subject: Re: Lot of GC on two nodes out of 7

Hello,

Bryan, most of the partition sizes are under 45 KB

I have tried with concurrent_compactors : 8 for one of the nodes still no improvement,
I have tried max_heap_Size : 8G, no improvement.

I will try the newHeapsize of 2G though i am sure CMS will be a longer then.

Also doesn't look like i mentioned what type of GC was causing the problems. On both the nodes its the ParNewGC thats taking long for each run and too many runs are happening in succession.

anishek

On Fri, Mar 4, 2016 at 5:36 AM, Bryan Cheng <br...@blockcypher.com>> wrote:
Hi Anishek,

In addition to the good advice others have given, do you notice any abnormally large partitions? What does cfhistograms report for 99% partition size? A few huge partitions will cause very disproportionate load on your cluster, including high GC.

--Bryan

On Wed, Mar 2, 2016 at 9:28 AM, Amit Singh F <am...@ericsson.com>> wrote:
Hi Anishek,

We too faced similar problem in 2.0.14 and after doing some research we config few parameters in Cassandra.yaml and was able to overcome GC pauses . Those are :

·         memtable_flush_writers : increased from 1 to 3 as from tpstats output  we can see mutations dropped so it means writes are getting blocked, so increasing number will have those catered.

·         memtable_total_space_in_mb : Default (1/4 of heap size), can lowered because larger long lived objects will create pressure on HEAP, so its better to reduce some amount of size.

·         Concurrent_compactors : Alain righlty pointed out this i.e reduce it to 8. You need to try this.

Also please check whether you have mutations drop in other nodes or not.

Hope this helps in your cluster too.

Regards
Amit Singh
From: Jonathan Haddad [mailto:jon@jonhaddad.com<ma...@jonhaddad.com>]
Sent: Wednesday, March 02, 2016 9:33 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Lot of GC on two nodes out of 7

Can you post a gist of the output of jstat -gccause (60 seconds worth)?  I think it's cool you're willing to experiment with alternative JVM settings but I've never seen anyone use max tenuring threshold of 50 either and I can't imagine it's helpful.  Keep in mind if your objects are actually reaching that threshold it means they've been copied 50x (really really slow) and also you're going to end up spilling your eden objects directly into your old gen if your survivor is full.  Considering the small amount of memory you're using for heap I'm really not surprised you're running into problems.

I recommend G1GC + 12GB heap and just let it optimize itself for almost all cases with the latest JVM versions.

On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ <ar...@gmail.com>> wrote:
It looks like you are doing a good work with this cluster and know a lot about JVM, that's good :-).

our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM

That's good hardware too.

With 64 GB of ram I would probably directly give a try to `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably.

Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the effects. But that's just an idea of something I would try to see the impacts, I don't think it will solve your current issues or even make it worse for this node.

Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would allow you to store the memtables off-heap. Those are 2 improvements reducing the heap pressure that you might be interested in.

I have spent time reading about all other options before including them and a similar configuration on our other prod cluster is showing good GC graphs via gcviewer.

So, let's look for an other reason.

there are MUTATION and READ messages dropped in high number on nodes in question and on other 5 nodes it varies between 1-3.

- Is Memory, CPU or disk a bottleneck? Is one of those running at the limits?

concurrent_compactors: 48

Reducing this to 8 would free some space for transactions (R&W requests). It is probably worth a try, even more when compaction is not keeping up and compaction throughput is not throttled.

Just found an issue about that: https://issues.apache.org/jira/browse/CASSANDRA-7139

Looks like `concurrent_compactors: 8` is the new default.

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com<ma...@thelastpickle.com>
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-03-02 12:27 GMT+01:00 Anishek Agarwal <an...@gmail.com>>:
Thanks a lot Alian for the details.
`HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
`MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You might want to keep this as it or even reduce it if you have less than 16 GB of native memory. Go with 8 GB if you have a lot of memory.
`-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so far. I had luck with values between 4 <--> 16 in the past. I would give  a try with 15.
`-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ? Using default and then tune from there to improve things is generally a good idea.

we have a lot of reads and writes onto the system so keeping the high new size to make sure enough is held in memory including caches / memtables etc --number of flush_writers : 4 for us. similarly keeping less in old generation to make sure we spend less time with CMS GC most of the data is transient in memory for us. Keeping high TenuringThreshold because we don't want objects going to old generation and just die in young generation given we have configured large survivor spaces.
using occupancyFraction as 70 since
given heap is 4G
survivor space is : 400 mb -- 2 survivor spaces
70 % of 2G (old generation) = 1.4G

so once we are just below 1.4G and we have to move the full survivor + some extra during a par new gc due to promotion failure, everything will fit in old generation, and will trigger CMS.

I have spent time reading about all other options before including them and a similar configuration on our other prod cluster is showing good GC graphs via gcviewer.

tp stats on all machines show flush writer blocked at : 0.3% of total

the two nodes in question have stats almost as below

  *   specifically there are pending was in readStage, MutationStage and RequestResponseStage

Pool Name                    Active   Pending      Completed   Blocked  All time blocked

ReadStage                        21        19     2141798645         0                 0

RequestResponseStage              0         1      803242391         0                 0

MutationStage                     0         0      291813703         0                 0

ReadRepairStage                   0         0      200544344         0                 0

ReplicateOnWriteStage             0         0              0         0                 0

GossipStage                       0         0         292477         0                 0

CacheCleanupExecutor              0         0              0         0                 0

MigrationStage                    0         0              0         0                 0

MemoryMeter                       0         0           2172         0                 0

FlushWriter                       0         0           2756         0                 6

ValidationExecutor                0         0            101         0                 0

InternalResponseStage             0         0              0         0                 0

AntiEntropyStage                  0         0            202         0                 0

MemtablePostFlusher               0         0           4395         0                 0

MiscStage                         0         0              0         0                 0

PendingRangeCalculator            0         0             20         0                 0

CompactionExecutor                4         4          49323         0                 0

commitlog_archiver                0         0              0         0                 0

HintedHandoff                     0         0            116         0                 0

Message type           Dropped

RANGE_SLICE                  0

READ_REPAIR                 36

PAGED_RANGE                  0

BINARY                       0

READ                     11471

MUTATION                   898

_TRACE                       0

REQUEST_RESPONSE             0

COUNTER_MUTATION             0

all the other 5 nodes show no pending numbers.

our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
compaction throughput is 0 MB/s
concurrent_compactors: 48
flush_writers: 4

I think Jeff is trying to spot a wide row messing with your system, so looking at the max row size on those nodes compared to other is more relevant than average size for this check.

i think is what you are looking for, please correct me if i am wrong

Compacted partition maximum bytes: 1629722
similar value on all 7 nodes.

grep -i "ERROR" /var/log/cassandra/system.log

there are MUTATION and READ messages dropped in high number on nodes in question and on other 5 nodes it varies between 1-3.

On Wed, Mar 2, 2016 at 4:15 PM, Alain RODRIGUEZ <ar...@gmail.com>> wrote:
Hi Anishek,

Even if it highly depends on your workload, here are my thoughts:

`HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
`MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You might want to keep this as it or even reduce it if you have less than 16 GB of native memory. Go with 8 GB if you have a lot of memory.
`-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so far. I had luck with values between 4 <--> 16 in the past. I would give  a try with 15.
`-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ? Using default and then tune from there to improve things is generally a good idea.

You also use a bunch of option I don't know about, if you are uncertain about them, you could try a default conf without the options you added and just the using the changes above from default https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh. Or you might find more useful information on a nice reference about this topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java Virtual Machine' part: https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and G1GC. G1GC is supposed to be easier to configure too.

the average row size for compacted partitions is about 1640 bytes on all nodes. We have replication factor 3 but the problem is only on two nodes.

I think Jeff is trying to spot a wide row messing with your system, so looking at the max row size on those nodes compared to other is more relevant than average size for this check.

the only other thing that stands out in cfstats is the read time and write time on the nodes with high GC is 5-7 times higher than other 5 nodes, but i think thats expected.

I would probably look at this the reverse way: I imagine that extra GC  is a consequence of something going wrong on those nodes as JVM / GC are configured the same way cluster-wide. GC / JVM issues are often due to Cassandra / system / hardware issues, inducing extra pressure on the JVM. I would try to tune JVM / GC only once the system is healthy. So I often saw high GC being a consequence rather than the root cause of an issue.

To explore this possibility:

Does this command show some dropped or blocked tasks? This would add pressure to heap.
nodetool tpstats

Do you have errors in logs? Always good to know when facing an issue.
grep -i "ERROR" /var/log/cassandra/system.log

How are compactions tuned (throughput + concurrent compactors)? This tuning might explain compactions not keeping up or a high GC pressure.

What are your disks / CPU? To help us giving you good arbitrary values to try.

Is there some iowait ? Could point to a bottleneck or bad hardware.
iostats -mx 5 100

...

Hope one of those will point you to an issue, but there are many more thing you could check.

Let us know how it goes,

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com<ma...@thelastpickle.com>
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-03-02 10:33 GMT+01:00 Anishek Agarwal <an...@gmail.com>>:
also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G.

On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <an...@gmail.com>> wrote:
Hey Jeff,

one of the nodes with high GC has 1400 SST tables, all other nodes have about 500-900 SST tables. the other node with high GC has 636 SST tables.

the average row size for compacted partitions is about 1640 bytes on all nodes. We have replication factor 3 but the problem is only on two nodes.
the only other thing that stands out in cfstats is the read time and write time on the nodes with high GC is 5-7 times higher than other 5 nodes, but i think thats expected.

thanks
anishek

On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <je...@crowdstrike.com>> wrote:
Compaction falling behind will likely cause additional work on reads (more sstables to merge), but I’d be surprised if it manifested in super long GC. When you say twice as many sstables, how many is that?.

In cfstats, does anything stand out? Is max row size on those nodes larger than on other nodes?

What you don’t show in your JVM options is the new gen size – if you do have unusually large partitions on those two nodes (especially likely if you have rf=2 – if you have rf=3, then there’s probably a third node misbehaving you haven’t found yet), then raising new gen size can help handle the garbage created by reading large partitions without having to tolerate the promotion. Estimates for the amount of garbage vary, but it could be “gigabytes” of garbage on a very wide partition (see https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in progress to help mitigate that type of pain).

- Jeff

From: Anishek Agarwal
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
Date: Tuesday, March 1, 2016 at 11:12 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
Subject: Lot of GC on two nodes out of 7

Hello,

we have a cassandra cluster of 7 nodes, all of them have the same JVM GC configurations, all our writes /  reads use the TokenAware Policy wrapping a DCAware policy. All nodes are part of same Datacenter.

We are seeing that two nodes are having high GC collection times. Then mostly seem to spend time in GC like about 300-600 ms. This also seems to result in higher CPU utilisation on these machines. Other  5 nodes don't have this problem.

There is no additional repair activity going on the cluster, we are not sure why this is happening.
we checked cfhistograms on the two CF we have in the cluster and number of reads seems to be almost same.

we also used cfstats to see the number of ssttables on each node and one of the nodes with the above problem has twice the number of ssttables than other nodes. This still doesnot explain why two nodes have high GC Overheads. our GC config is as below:

JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"

JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"

JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"

JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"

JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"

JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"

JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"

JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"

JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"

JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"

JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"

JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"

JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"

JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"

# earlier value 131072 = 32768 * 4

JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"

JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"

JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"

JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"

#new

JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"

We are using cassandra 2.0.17. If anyone has any suggestion as to how what else we can look for to understand why this is happening please do reply.

Thanks
anishek

Re: Lot of GC on two nodes out of 7

Posted by Anishek Agarwal <an...@gmail.com>.

@Jeff i was just trying to follow some more advice given above, I
personally still think a larger newGen heap size would be better.

@Johnathan I will post the whole logs, I have restarted the nodes with
additional changes most probably tomorrow or day after i will put out the
gc logs.

the problem still exists on two nodes. too much time spent in GC,
additionally I tried to print the state of cluster via my application to
see what is happening and i see that the node with high GC has a lot of
 "inflight Queries" -- almost 1100 and other nodes is all 0.

the cfhistograms for all nodes show the approx the same number of reads. --
so i am thinking the above phenomenon is happening since the node is
spending time in gc.

also looking at the Load Balancing policy on client its new
TokenAwarePolicy(new DCAwareRoundRobinPolicy())

if you have any other ideas please keep posting them.

thanks
anishek

On Sat, Mar 5, 2016 at 12:54 AM, Jonathan Haddad <jo...@jonhaddad.com> wrote:

> Without looking at your GC logs (you never posted a gist), my assumption
> would be you're doing a lot of copying between survivor generations, and
> they're taking a long time.  You're probably also copying a lot of data to
> your old gen as a result of having full-ish survivor spaces to begin with.
>
> On Thu, Mar 3, 2016 at 10:26 PM Jeff Jirsa <je...@crowdstrike.com>
> wrote:
>
>> I’d personally would have gone the other way – if you’re seeing parnew,
>> increasing new gen instead of decreasing it should help drop (faster)
>> rather than promoting to sv/oldgen (slower) ?
>>
>>
>>
>> From: Anishek Agarwal
>> Reply-To: "user@cassandra.apache.org"
>> Date: Thursday, March 3, 2016 at 8:55 PM
>>
>> To: "user@cassandra.apache.org"
>> Subject: Re: Lot of GC on two nodes out of 7
>>
>> Hello,
>>
>> Bryan, most of the partition sizes are under 45 KB
>>
>> I have tried with concurrent_compactors : 8 for one of the nodes still no
>> improvement,
>> I have tried max_heap_Size : 8G, no improvement.
>>
>> I will try the newHeapsize of 2G though i am sure CMS will be a longer
>> then.
>>
>> Also doesn't look like i mentioned what type of GC was causing the
>> problems. On both the nodes its the ParNewGC thats taking long for each run
>> and too many runs are happening in succession.
>>
>> anishek
>>
>>
>> On Fri, Mar 4, 2016 at 5:36 AM, Bryan Cheng <br...@blockcypher.com>
>> wrote:
>>
>>> Hi Anishek,
>>>
>>> In addition to the good advice others have given, do you notice any
>>> abnormally large partitions? What does cfhistograms report for 99%
>>> partition size? A few huge partitions will cause very disproportionate load
>>> on your cluster, including high GC.
>>>
>>> --Bryan
>>>
>>> On Wed, Mar 2, 2016 at 9:28 AM, Amit Singh F <am...@ericsson.com>
>>> wrote:
>>>
>>>> Hi Anishek,
>>>>
>>>>
>>>>
>>>> We too faced similar problem in 2.0.14 and after doing some research we
>>>> config few parameters in Cassandra.yaml and was able to overcome GC pauses
>>>> . Those are :
>>>>
>>>>
>>>>
>>>> ·         memtable_flush_writers : increased from 1 to 3 as from
>>>> tpstats output  we can see mutations dropped so it means writes are getting
>>>> blocked, so increasing number will have those catered.
>>>>
>>>> ·         memtable_total_space_in_mb : Default (1/4 of heap size), can
>>>> lowered because larger long lived objects will create pressure on HEAP, so
>>>> its better to reduce some amount of size.
>>>>
>>>> ·         Concurrent_compactors : Alain righlty pointed out this i.e
>>>> reduce it to 8. You need to try this.
>>>>
>>>>
>>>>
>>>> Also please check whether you have mutations drop in other nodes or not.
>>>>
>>>>
>>>>
>>>> Hope this helps in your cluster too.
>>>>
>>>>
>>>>
>>>> Regards
>>>>
>>>> Amit Singh
>>>>
>>>> *From:* Jonathan Haddad [mailto:jon@jonhaddad.com]
>>>> *Sent:* Wednesday, March 02, 2016 9:33 PM
>>>> *To:* user@cassandra.apache.org
>>>> *Subject:* Re: Lot of GC on two nodes out of 7
>>>>
>>>>
>>>>
>>>> Can you post a gist of the output of jstat -gccause (60 seconds
>>>> worth)?  I think it's cool you're willing to experiment with alternative
>>>> JVM settings but I've never seen anyone use max tenuring threshold of 50
>>>> either and I can't imagine it's helpful.  Keep in mind if your objects are
>>>> actually reaching that threshold it means they've been copied 50x (really
>>>> really slow) and also you're going to end up spilling your eden objects
>>>> directly into your old gen if your survivor is full.  Considering the small
>>>> amount of memory you're using for heap I'm really not surprised you're
>>>> running into problems.
>>>>
>>>>
>>>>
>>>> I recommend G1GC + 12GB heap and just let it optimize itself for almost
>>>> all cases with the latest JVM versions.
>>>>
>>>>
>>>>
>>>> On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ <ar...@gmail.com>
>>>> wrote:
>>>>
>>>> It looks like you are doing a good work with this cluster and know a
>>>> lot about JVM, that's good :-).
>>>>
>>>>
>>>>
>>>> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
>>>>
>>>>
>>>>
>>>> That's good hardware too.
>>>>
>>>>
>>>>
>>>> With 64 GB of ram I would probably directly give a try to
>>>> `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably.
>>>>
>>>>
>>>>
>>>> Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using
>>>> `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the
>>>> effects. But that's just an idea of something I would try to see the
>>>> impacts, I don't think it will solve your current issues or even make it
>>>> worse for this node.
>>>>
>>>>
>>>>
>>>> Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would
>>>> allow you to store the memtables off-heap. Those are 2 improvements
>>>> reducing the heap pressure that you might be interested in.
>>>>
>>>>
>>>>
>>>> I have spent time reading about all other options before including them
>>>> and a similar configuration on our other prod cluster is showing good GC
>>>> graphs via gcviewer.
>>>>
>>>>
>>>>
>>>> So, let's look for an other reason.
>>>>
>>>>
>>>>
>>>> there are MUTATION and READ messages dropped in high number on nodes in
>>>> question and on other 5 nodes it varies between 1-3.
>>>>
>>>>
>>>>
>>>> - Is Memory, CPU or disk a bottleneck? Is one of those running at the
>>>> limits?
>>>>
>>>>
>>>>
>>>> concurrent_compactors: 48
>>>>
>>>>
>>>>
>>>> Reducing this to 8 would free some space for transactions (R&W
>>>> requests). It is probably worth a try, even more when compaction is not
>>>> keeping up and compaction throughput is not throttled.
>>>>
>>>>
>>>>
>>>> Just found an issue about that:
>>>> https://issues.apache.org/jira/browse/CASSANDRA-7139
>>>>
>>>>
>>>>
>>>> Looks like `concurrent_compactors: 8` is the new default.
>>>>
>>>>
>>>>
>>>> C*heers,
>>>>
>>>> -----------------------
>>>>
>>>> Alain Rodriguez - alain@thelastpickle.com
>>>>
>>>> France
>>>>
>>>>
>>>>
>>>> The Last Pickle - Apache Cassandra Consulting
>>>>
>>>> http://www.thelastpickle.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2016-03-02 12:27 GMT+01:00 Anishek Agarwal <an...@gmail.com>:
>>>>
>>>> Thanks a lot Alian for the details.
>>>>
>>>> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
>>>> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
>>>> might want to keep this as it or even reduce it if you have less than 16 GB
>>>> of native memory. Go with 8 GB if you have a lot of memory.
>>>> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use
>>>> so far. I had luck with values between 4 <--> 16 in the past. I would give
>>>>  a try with 15.
>>>> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
>>>> Using default and then tune from there to improve things is generally a
>>>> good idea.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> we have a lot of reads and writes onto the system so keeping the high
>>>> new size to make sure enough is held in memory including caches / memtables
>>>> etc --number of flush_writers : 4 for us. similarly keeping less in old
>>>> generation to make sure we spend less time with CMS GC most of the data is
>>>> transient in memory for us. Keeping high TenuringThreshold because we don't
>>>> want objects going to old generation and just die in young generation given
>>>> we have configured large survivor spaces.
>>>>
>>>> using occupancyFraction as 70 since
>>>>
>>>> given heap is 4G
>>>>
>>>> survivor space is : 400 mb -- 2 survivor spaces
>>>>
>>>> 70 % of 2G (old generation) = 1.4G
>>>>
>>>>
>>>>
>>>> so once we are just below 1.4G and we have to move the full survivor +
>>>> some extra during a par new gc due to promotion failure, everything will
>>>> fit in old generation, and will trigger CMS.
>>>>
>>>>
>>>>
>>>> I have spent time reading about all other options before including them
>>>> and a similar configuration on our other prod cluster is showing good GC
>>>> graphs via gcviewer.
>>>>
>>>>
>>>>
>>>> tp stats on all machines show flush writer blocked at : 0.3% of total
>>>>
>>>>
>>>>
>>>> the two nodes in question have stats almost as below
>>>>
>>>>    - specifically there are pending was in readStage, MutationStage
>>>>    and RequestResponseStage
>>>>
>>>> Pool Name                    Active   Pending      Completed   Blocked
>>>> All time blocked
>>>>
>>>> ReadStage                        21        19     2141798645         0
>>>>                 0
>>>>
>>>> RequestResponseStage              0         1      803242391         0
>>>>                 0
>>>>
>>>> MutationStage                     0         0      291813703         0
>>>>                 0
>>>>
>>>> ReadRepairStage                   0         0      200544344         0
>>>>                 0
>>>>
>>>> ReplicateOnWriteStage             0         0              0         0
>>>>                 0
>>>>
>>>> GossipStage                       0         0         292477         0
>>>>                 0
>>>>
>>>> CacheCleanupExecutor              0         0              0         0
>>>>                 0
>>>>
>>>> MigrationStage                    0         0              0         0
>>>>                 0
>>>>
>>>> MemoryMeter                       0         0           2172         0
>>>>                 0
>>>>
>>>> FlushWriter                       0         0           2756         0
>>>>                 6
>>>>
>>>> ValidationExecutor                0         0            101         0
>>>>                 0
>>>>
>>>> InternalResponseStage             0         0              0         0
>>>>                 0
>>>>
>>>> AntiEntropyStage                  0         0            202         0
>>>>                 0
>>>>
>>>> MemtablePostFlusher               0         0           4395         0
>>>>                 0
>>>>
>>>> MiscStage                         0         0              0         0
>>>>                 0
>>>>
>>>> PendingRangeCalculator            0         0             20         0
>>>>                 0
>>>>
>>>> CompactionExecutor                4         4          49323         0
>>>>                 0
>>>>
>>>> commitlog_archiver                0         0              0         0
>>>>                 0
>>>>
>>>> HintedHandoff                     0         0            116         0
>>>>                 0
>>>>
>>>>
>>>>
>>>> Message type           Dropped
>>>>
>>>> RANGE_SLICE                  0
>>>>
>>>> READ_REPAIR                 36
>>>>
>>>> PAGED_RANGE                  0
>>>>
>>>> BINARY                       0
>>>>
>>>> READ                     11471
>>>>
>>>> MUTATION                   898
>>>>
>>>> _TRACE                       0
>>>>
>>>> REQUEST_RESPONSE             0
>>>>
>>>> COUNTER_MUTATION             0
>>>>
>>>>
>>>>
>>>> all the other 5 nodes show no pending numbers.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
>>>>
>>>> compaction throughput is 0 MB/s
>>>>
>>>> concurrent_compactors: 48
>>>>
>>>> flush_writers: 4
>>>>
>>>>
>>>>
>>>>
>>>> I think Jeff is trying to spot a wide row messing with your system, so
>>>> looking at the max row size on those nodes compared to other is more
>>>> relevant than average size for this check.
>>>>
>>>>
>>>>
>>>> i think is what you are looking for, please correct me if i am wrong
>>>>
>>>> Compacted partition maximum bytes: 1629722
>>>>
>>>> similar value on all 7 nodes.
>>>>
>>>>
>>>>
>>>> grep -i "ERROR" /var/log/cassandra/system.log
>>>>
>>>>
>>>>
>>>> there are MUTATION and READ messages dropped in high number on nodes in
>>>> question and on other 5 nodes it varies between 1-3.
>>>>
>>>>
>>>>
>>>> On Wed, Mar 2, 2016 at 4:15 PM, Alain RODRIGUEZ <ar...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Anishek,
>>>>
>>>>
>>>>
>>>> Even if it highly depends on your workload, here are my thoughts:
>>>>
>>>>
>>>>
>>>> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
>>>>
>>>> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
>>>> might want to keep this as it or even reduce it if you have less than 16 GB
>>>> of native memory. Go with 8 GB if you have a lot of memory.
>>>>
>>>> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use
>>>> so far. I had luck with values between 4 <--> 16 in the past. I would give
>>>>  a try with 15.
>>>>
>>>> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
>>>> Using default and then tune from there to improve things is generally a
>>>> good idea.
>>>>
>>>>
>>>>
>>>> You also use a bunch of option I don't know about, if you are uncertain
>>>> about them, you could try a default conf without the options you added and
>>>> just the using the changes above from default
>>>> https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh.
>>>> Or you might find more useful information on a nice reference about this
>>>> topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java
>>>> Virtual Machine' part:
>>>> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
>>>>
>>>>
>>>>
>>>> FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8
>>>> and G1GC. G1GC is supposed to be easier to configure too.
>>>>
>>>>
>>>>
>>>> the average row size for compacted partitions is about 1640 bytes on
>>>> all nodes. We have replication factor 3 but the problem is only on two
>>>> nodes.
>>>>
>>>>
>>>>
>>>> I think Jeff is trying to spot a wide row messing with your system, so
>>>> looking at the max row size on those nodes compared to other is more
>>>> relevant than average size for this check.
>>>>
>>>>
>>>>
>>>> the only other thing that stands out in cfstats is the read time and
>>>> write time on the nodes with high GC is 5-7 times higher than other 5
>>>> nodes, but i think thats expected.
>>>>
>>>>
>>>>
>>>> I would probably look at this the reverse way: I imagine that extra GC
>>>>  is a consequence of something going wrong on those nodes as JVM / GC are
>>>> configured the same way cluster-wide. GC / JVM issues are often due to
>>>> Cassandra / system / hardware issues, inducing extra pressure on the JVM. I
>>>> would try to tune JVM / GC only once the system is healthy. So I often saw
>>>> high GC being a consequence rather than the root cause of an issue.
>>>>
>>>>
>>>>
>>>> To explore this possibility:
>>>>
>>>>
>>>>
>>>> Does this command show some dropped or blocked tasks? This would add
>>>> pressure to heap.
>>>>
>>>> nodetool tpstats
>>>>
>>>>
>>>>
>>>> Do you have errors in logs? Always good to know when facing an issue.
>>>>
>>>> grep -i "ERROR" /var/log/cassandra/system.log
>>>>
>>>>
>>>>
>>>> How are compactions tuned (throughput + concurrent compactors)? This
>>>> tuning might explain compactions not keeping up or a high GC pressure.
>>>>
>>>>
>>>>
>>>> What are your disks / CPU? To help us giving you good arbitrary values
>>>> to try.
>>>>
>>>>
>>>>
>>>> Is there some iowait ? Could point to a bottleneck or bad hardware.
>>>>
>>>> iostats -mx 5 100
>>>>
>>>>
>>>>
>>>> ...
>>>>
>>>>
>>>>
>>>> Hope one of those will point you to an issue, but there are many more
>>>> thing you could check.
>>>>
>>>>
>>>>
>>>> Let us know how it goes,
>>>>
>>>>
>>>>
>>>> C*heers,
>>>>
>>>> -----------------------
>>>>
>>>> Alain Rodriguez - alain@thelastpickle.com
>>>>
>>>> France
>>>>
>>>>
>>>>
>>>> The Last Pickle - Apache Cassandra Consulting
>>>>
>>>> http://www.thelastpickle.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2016-03-02 10:33 GMT+01:00 Anishek Agarwal <an...@gmail.com>:
>>>>
>>>> also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G.
>>>>
>>>>
>>>>
>>>> On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <an...@gmail.com>
>>>> wrote:
>>>>
>>>> Hey Jeff,
>>>>
>>>>
>>>>
>>>> one of the nodes with high GC has 1400 SST tables, all other nodes have
>>>> about 500-900 SST tables. the other node with high GC has 636 SST tables.
>>>>
>>>>
>>>>
>>>> the average row size for compacted partitions is about 1640 bytes on
>>>> all nodes. We have replication factor 3 but the problem is only on two
>>>> nodes.
>>>>
>>>> the only other thing that stands out in cfstats is the read time and
>>>> write time on the nodes with high GC is 5-7 times higher than other 5
>>>> nodes, but i think thats expected.
>>>>
>>>>
>>>>
>>>> thanks
>>>>
>>>> anishek
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <je...@crowdstrike.com>
>>>> wrote:
>>>>
>>>> Compaction falling behind will likely cause additional work on reads
>>>> (more sstables to merge), but I’d be surprised if it manifested in super
>>>> long GC. When you say twice as many sstables, how many is that?.
>>>>
>>>>
>>>>
>>>> In cfstats, does anything stand out? Is max row size on those nodes
>>>> larger than on other nodes?
>>>>
>>>>
>>>>
>>>> What you don’t show in your JVM options is the new gen size – if you do
>>>> have unusually large partitions on those two nodes (especially likely if
>>>> you have rf=2 – if you have rf=3, then there’s probably a third node
>>>> misbehaving you haven’t found yet), then raising new gen size can help
>>>> handle the garbage created by reading large partitions without having to
>>>> tolerate the promotion. Estimates for the amount of garbage vary, but it
>>>> could be “gigabytes” of garbage on a very wide partition (see
>>>> https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in
>>>> progress to help mitigate that type of pain).
>>>>
>>>>
>>>>
>>>> - Jeff
>>>>
>>>>
>>>>
>>>> *From: *Anishek Agarwal
>>>> *Reply-To: *"user@cassandra.apache.org"
>>>> *Date: *Tuesday, March 1, 2016 at 11:12 PM
>>>> *To: *"user@cassandra.apache.org"
>>>> *Subject: *Lot of GC on two nodes out of 7
>>>>
>>>>
>>>>
>>>> Hello,
>>>>
>>>>
>>>>
>>>> we have a cassandra cluster of 7 nodes, all of them have the same JVM
>>>> GC configurations, all our writes /  reads use the TokenAware Policy
>>>> wrapping a DCAware policy. All nodes are part of same Datacenter.
>>>>
>>>>
>>>>
>>>> We are seeing that two nodes are having high GC collection times. Then
>>>> mostly seem to spend time in GC like about 300-600 ms. This also seems to
>>>> result in higher CPU utilisation on these machines. Other  5 nodes don't
>>>> have this problem.
>>>>
>>>>
>>>>
>>>> There is no additional repair activity going on the cluster, we are not
>>>> sure why this is happening.
>>>>
>>>> we checked cfhistograms on the two CF we have in the cluster and number
>>>> of reads seems to be almost same.
>>>>
>>>>
>>>>
>>>> we also used cfstats to see the number of ssttables on each node and
>>>> one of the nodes with the above problem has twice the number of ssttables
>>>> than other nodes. This still doesnot explain why two nodes have high GC
>>>> Overheads. our GC config is as below:
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>>>>
>>>> # earlier value 131072 = 32768 * 4
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"
>>>>
>>>> #new
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"
>>>>
>>>>
>>>>
>>>> We are using cassandra 2.0.17. If anyone has any suggestion as to how
>>>> what else we can look for to understand why this is happening please do
>>>> reply.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> anishek
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>

Re: Lot of GC on two nodes out of 7

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

Without looking at your GC logs (you never posted a gist), my assumption
would be you're doing a lot of copying between survivor generations, and
they're taking a long time.  You're probably also copying a lot of data to
your old gen as a result of having full-ish survivor spaces to begin with.

On Thu, Mar 3, 2016 at 10:26 PM Jeff Jirsa <je...@crowdstrike.com>
wrote:

> I’d personally would have gone the other way – if you’re seeing parnew,
> increasing new gen instead of decreasing it should help drop (faster)
> rather than promoting to sv/oldgen (slower) ?
>
>
>
> From: Anishek Agarwal
> Reply-To: "user@cassandra.apache.org"
> Date: Thursday, March 3, 2016 at 8:55 PM
>
> To: "user@cassandra.apache.org"
> Subject: Re: Lot of GC on two nodes out of 7
>
> Hello,
>
> Bryan, most of the partition sizes are under 45 KB
>
> I have tried with concurrent_compactors : 8 for one of the nodes still no
> improvement,
> I have tried max_heap_Size : 8G, no improvement.
>
> I will try the newHeapsize of 2G though i am sure CMS will be a longer
> then.
>
> Also doesn't look like i mentioned what type of GC was causing the
> problems. On both the nodes its the ParNewGC thats taking long for each run
> and too many runs are happening in succession.
>
> anishek
>
>
> On Fri, Mar 4, 2016 at 5:36 AM, Bryan Cheng <br...@blockcypher.com> wrote:
>
>> Hi Anishek,
>>
>> In addition to the good advice others have given, do you notice any
>> abnormally large partitions? What does cfhistograms report for 99%
>> partition size? A few huge partitions will cause very disproportionate load
>> on your cluster, including high GC.
>>
>> --Bryan
>>
>> On Wed, Mar 2, 2016 at 9:28 AM, Amit Singh F <am...@ericsson.com>
>> wrote:
>>
>>> Hi Anishek,
>>>
>>>
>>>
>>> We too faced similar problem in 2.0.14 and after doing some research we
>>> config few parameters in Cassandra.yaml and was able to overcome GC pauses
>>> . Those are :
>>>
>>>
>>>
>>> ·         memtable_flush_writers : increased from 1 to 3 as from
>>> tpstats output  we can see mutations dropped so it means writes are getting
>>> blocked, so increasing number will have those catered.
>>>
>>> ·         memtable_total_space_in_mb : Default (1/4 of heap size), can
>>> lowered because larger long lived objects will create pressure on HEAP, so
>>> its better to reduce some amount of size.
>>>
>>> ·         Concurrent_compactors : Alain righlty pointed out this i.e
>>> reduce it to 8. You need to try this.
>>>
>>>
>>>
>>> Also please check whether you have mutations drop in other nodes or not.
>>>
>>>
>>>
>>> Hope this helps in your cluster too.
>>>
>>>
>>>
>>> Regards
>>>
>>> Amit Singh
>>>
>>> *From:* Jonathan Haddad [mailto:jon@jonhaddad.com]
>>> *Sent:* Wednesday, March 02, 2016 9:33 PM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Lot of GC on two nodes out of 7
>>>
>>>
>>>
>>> Can you post a gist of the output of jstat -gccause (60 seconds worth)?
>>> I think it's cool you're willing to experiment with alternative JVM
>>> settings but I've never seen anyone use max tenuring threshold of 50 either
>>> and I can't imagine it's helpful.  Keep in mind if your objects are
>>> actually reaching that threshold it means they've been copied 50x (really
>>> really slow) and also you're going to end up spilling your eden objects
>>> directly into your old gen if your survivor is full.  Considering the small
>>> amount of memory you're using for heap I'm really not surprised you're
>>> running into problems.
>>>
>>>
>>>
>>> I recommend G1GC + 12GB heap and just let it optimize itself for almost
>>> all cases with the latest JVM versions.
>>>
>>>
>>>
>>> On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ <ar...@gmail.com>
>>> wrote:
>>>
>>> It looks like you are doing a good work with this cluster and know a lot
>>> about JVM, that's good :-).
>>>
>>>
>>>
>>> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
>>>
>>>
>>>
>>> That's good hardware too.
>>>
>>>
>>>
>>> With 64 GB of ram I would probably directly give a try to
>>> `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably.
>>>
>>>
>>>
>>> Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using
>>> `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the
>>> effects. But that's just an idea of something I would try to see the
>>> impacts, I don't think it will solve your current issues or even make it
>>> worse for this node.
>>>
>>>
>>>
>>> Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would
>>> allow you to store the memtables off-heap. Those are 2 improvements
>>> reducing the heap pressure that you might be interested in.
>>>
>>>
>>>
>>> I have spent time reading about all other options before including them
>>> and a similar configuration on our other prod cluster is showing good GC
>>> graphs via gcviewer.
>>>
>>>
>>>
>>> So, let's look for an other reason.
>>>
>>>
>>>
>>> there are MUTATION and READ messages dropped in high number on nodes in
>>> question and on other 5 nodes it varies between 1-3.
>>>
>>>
>>>
>>> - Is Memory, CPU or disk a bottleneck? Is one of those running at the
>>> limits?
>>>
>>>
>>>
>>> concurrent_compactors: 48
>>>
>>>
>>>
>>> Reducing this to 8 would free some space for transactions (R&W
>>> requests). It is probably worth a try, even more when compaction is not
>>> keeping up and compaction throughput is not throttled.
>>>
>>>
>>>
>>> Just found an issue about that:
>>> https://issues.apache.org/jira/browse/CASSANDRA-7139
>>>
>>>
>>>
>>> Looks like `concurrent_compactors: 8` is the new default.
>>>
>>>
>>>
>>> C*heers,
>>>
>>> -----------------------
>>>
>>> Alain Rodriguez - alain@thelastpickle.com
>>>
>>> France
>>>
>>>
>>>
>>> The Last Pickle - Apache Cassandra Consulting
>>>
>>> http://www.thelastpickle.com
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2016-03-02 12:27 GMT+01:00 Anishek Agarwal <an...@gmail.com>:
>>>
>>> Thanks a lot Alian for the details.
>>>
>>> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
>>> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
>>> might want to keep this as it or even reduce it if you have less than 16 GB
>>> of native memory. Go with 8 GB if you have a lot of memory.
>>> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
>>> far. I had luck with values between 4 <--> 16 in the past. I would give  a
>>> try with 15.
>>> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
>>> Using default and then tune from there to improve things is generally a
>>> good idea.
>>>
>>>
>>>
>>>
>>>
>>> we have a lot of reads and writes onto the system so keeping the high
>>> new size to make sure enough is held in memory including caches / memtables
>>> etc --number of flush_writers : 4 for us. similarly keeping less in old
>>> generation to make sure we spend less time with CMS GC most of the data is
>>> transient in memory for us. Keeping high TenuringThreshold because we don't
>>> want objects going to old generation and just die in young generation given
>>> we have configured large survivor spaces.
>>>
>>> using occupancyFraction as 70 since
>>>
>>> given heap is 4G
>>>
>>> survivor space is : 400 mb -- 2 survivor spaces
>>>
>>> 70 % of 2G (old generation) = 1.4G
>>>
>>>
>>>
>>> so once we are just below 1.4G and we have to move the full survivor +
>>> some extra during a par new gc due to promotion failure, everything will
>>> fit in old generation, and will trigger CMS.
>>>
>>>
>>>
>>> I have spent time reading about all other options before including them
>>> and a similar configuration on our other prod cluster is showing good GC
>>> graphs via gcviewer.
>>>
>>>
>>>
>>> tp stats on all machines show flush writer blocked at : 0.3% of total
>>>
>>>
>>>
>>> the two nodes in question have stats almost as below
>>>
>>>    - specifically there are pending was in readStage, MutationStage and
>>>    RequestResponseStage
>>>
>>> Pool Name                    Active   Pending      Completed   Blocked
>>> All time blocked
>>>
>>> ReadStage                        21        19     2141798645         0
>>>               0
>>>
>>> RequestResponseStage              0         1      803242391         0
>>>               0
>>>
>>> MutationStage                     0         0      291813703         0
>>>               0
>>>
>>> ReadRepairStage                   0         0      200544344         0
>>>               0
>>>
>>> ReplicateOnWriteStage             0         0              0         0
>>>               0
>>>
>>> GossipStage                       0         0         292477         0
>>>               0
>>>
>>> CacheCleanupExecutor              0         0              0         0
>>>               0
>>>
>>> MigrationStage                    0         0              0         0
>>>               0
>>>
>>> MemoryMeter                       0         0           2172         0
>>>               0
>>>
>>> FlushWriter                       0         0           2756         0
>>>               6
>>>
>>> ValidationExecutor                0         0            101         0
>>>               0
>>>
>>> InternalResponseStage             0         0              0         0
>>>               0
>>>
>>> AntiEntropyStage                  0         0            202         0
>>>               0
>>>
>>> MemtablePostFlusher               0         0           4395         0
>>>               0
>>>
>>> MiscStage                         0         0              0         0
>>>               0
>>>
>>> PendingRangeCalculator            0         0             20         0
>>>               0
>>>
>>> CompactionExecutor                4         4          49323         0
>>>               0
>>>
>>> commitlog_archiver                0         0              0         0
>>>               0
>>>
>>> HintedHandoff                     0         0            116         0
>>>               0
>>>
>>>
>>>
>>> Message type           Dropped
>>>
>>> RANGE_SLICE                  0
>>>
>>> READ_REPAIR                 36
>>>
>>> PAGED_RANGE                  0
>>>
>>> BINARY                       0
>>>
>>> READ                     11471
>>>
>>> MUTATION                   898
>>>
>>> _TRACE                       0
>>>
>>> REQUEST_RESPONSE             0
>>>
>>> COUNTER_MUTATION             0
>>>
>>>
>>>
>>> all the other 5 nodes show no pending numbers.
>>>
>>>
>>>
>>>
>>>
>>> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
>>>
>>> compaction throughput is 0 MB/s
>>>
>>> concurrent_compactors: 48
>>>
>>> flush_writers: 4
>>>
>>>
>>>
>>>
>>> I think Jeff is trying to spot a wide row messing with your system, so
>>> looking at the max row size on those nodes compared to other is more
>>> relevant than average size for this check.
>>>
>>>
>>>
>>> i think is what you are looking for, please correct me if i am wrong
>>>
>>> Compacted partition maximum bytes: 1629722
>>>
>>> similar value on all 7 nodes.
>>>
>>>
>>>
>>> grep -i "ERROR" /var/log/cassandra/system.log
>>>
>>>
>>>
>>> there are MUTATION and READ messages dropped in high number on nodes in
>>> question and on other 5 nodes it varies between 1-3.
>>>
>>>
>>>
>>> On Wed, Mar 2, 2016 at 4:15 PM, Alain RODRIGUEZ <ar...@gmail.com>
>>> wrote:
>>>
>>> Hi Anishek,
>>>
>>>
>>>
>>> Even if it highly depends on your workload, here are my thoughts:
>>>
>>>
>>>
>>> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
>>>
>>> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
>>> might want to keep this as it or even reduce it if you have less than 16 GB
>>> of native memory. Go with 8 GB if you have a lot of memory.
>>>
>>> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
>>> far. I had luck with values between 4 <--> 16 in the past. I would give  a
>>> try with 15.
>>>
>>> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
>>> Using default and then tune from there to improve things is generally a
>>> good idea.
>>>
>>>
>>>
>>> You also use a bunch of option I don't know about, if you are uncertain
>>> about them, you could try a default conf without the options you added and
>>> just the using the changes above from default
>>> https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh.
>>> Or you might find more useful information on a nice reference about this
>>> topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java
>>> Virtual Machine' part:
>>> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
>>>
>>>
>>>
>>> FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and
>>> G1GC. G1GC is supposed to be easier to configure too.
>>>
>>>
>>>
>>> the average row size for compacted partitions is about 1640 bytes on all
>>> nodes. We have replication factor 3 but the problem is only on two nodes.
>>>
>>>
>>>
>>> I think Jeff is trying to spot a wide row messing with your system, so
>>> looking at the max row size on those nodes compared to other is more
>>> relevant than average size for this check.
>>>
>>>
>>>
>>> the only other thing that stands out in cfstats is the read time and
>>> write time on the nodes with high GC is 5-7 times higher than other 5
>>> nodes, but i think thats expected.
>>>
>>>
>>>
>>> I would probably look at this the reverse way: I imagine that extra GC
>>>  is a consequence of something going wrong on those nodes as JVM / GC are
>>> configured the same way cluster-wide. GC / JVM issues are often due to
>>> Cassandra / system / hardware issues, inducing extra pressure on the JVM. I
>>> would try to tune JVM / GC only once the system is healthy. So I often saw
>>> high GC being a consequence rather than the root cause of an issue.
>>>
>>>
>>>
>>> To explore this possibility:
>>>
>>>
>>>
>>> Does this command show some dropped or blocked tasks? This would add
>>> pressure to heap.
>>>
>>> nodetool tpstats
>>>
>>>
>>>
>>> Do you have errors in logs? Always good to know when facing an issue.
>>>
>>> grep -i "ERROR" /var/log/cassandra/system.log
>>>
>>>
>>>
>>> How are compactions tuned (throughput + concurrent compactors)? This
>>> tuning might explain compactions not keeping up or a high GC pressure.
>>>
>>>
>>>
>>> What are your disks / CPU? To help us giving you good arbitrary values
>>> to try.
>>>
>>>
>>>
>>> Is there some iowait ? Could point to a bottleneck or bad hardware.
>>>
>>> iostats -mx 5 100
>>>
>>>
>>>
>>> ...
>>>
>>>
>>>
>>> Hope one of those will point you to an issue, but there are many more
>>> thing you could check.
>>>
>>>
>>>
>>> Let us know how it goes,
>>>
>>>
>>>
>>> C*heers,
>>>
>>> -----------------------
>>>
>>> Alain Rodriguez - alain@thelastpickle.com
>>>
>>> France
>>>
>>>
>>>
>>> The Last Pickle - Apache Cassandra Consulting
>>>
>>> http://www.thelastpickle.com
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2016-03-02 10:33 GMT+01:00 Anishek Agarwal <an...@gmail.com>:
>>>
>>> also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G.
>>>
>>>
>>>
>>> On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <an...@gmail.com>
>>> wrote:
>>>
>>> Hey Jeff,
>>>
>>>
>>>
>>> one of the nodes with high GC has 1400 SST tables, all other nodes have
>>> about 500-900 SST tables. the other node with high GC has 636 SST tables.
>>>
>>>
>>>
>>> the average row size for compacted partitions is about 1640 bytes on all
>>> nodes. We have replication factor 3 but the problem is only on two nodes.
>>>
>>> the only other thing that stands out in cfstats is the read time and
>>> write time on the nodes with high GC is 5-7 times higher than other 5
>>> nodes, but i think thats expected.
>>>
>>>
>>>
>>> thanks
>>>
>>> anishek
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <je...@crowdstrike.com>
>>> wrote:
>>>
>>> Compaction falling behind will likely cause additional work on reads
>>> (more sstables to merge), but I’d be surprised if it manifested in super
>>> long GC. When you say twice as many sstables, how many is that?.
>>>
>>>
>>>
>>> In cfstats, does anything stand out? Is max row size on those nodes
>>> larger than on other nodes?
>>>
>>>
>>>
>>> What you don’t show in your JVM options is the new gen size – if you do
>>> have unusually large partitions on those two nodes (especially likely if
>>> you have rf=2 – if you have rf=3, then there’s probably a third node
>>> misbehaving you haven’t found yet), then raising new gen size can help
>>> handle the garbage created by reading large partitions without having to
>>> tolerate the promotion. Estimates for the amount of garbage vary, but it
>>> could be “gigabytes” of garbage on a very wide partition (see
>>> https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in
>>> progress to help mitigate that type of pain).
>>>
>>>
>>>
>>> - Jeff
>>>
>>>
>>>
>>> *From: *Anishek Agarwal
>>> *Reply-To: *"user@cassandra.apache.org"
>>> *Date: *Tuesday, March 1, 2016 at 11:12 PM
>>> *To: *"user@cassandra.apache.org"
>>> *Subject: *Lot of GC on two nodes out of 7
>>>
>>>
>>>
>>> Hello,
>>>
>>>
>>>
>>> we have a cassandra cluster of 7 nodes, all of them have the same JVM GC
>>> configurations, all our writes /  reads use the TokenAware Policy wrapping
>>> a DCAware policy. All nodes are part of same Datacenter.
>>>
>>>
>>>
>>> We are seeing that two nodes are having high GC collection times. Then
>>> mostly seem to spend time in GC like about 300-600 ms. This also seems to
>>> result in higher CPU utilisation on these machines. Other  5 nodes don't
>>> have this problem.
>>>
>>>
>>>
>>> There is no additional repair activity going on the cluster, we are not
>>> sure why this is happening.
>>>
>>> we checked cfhistograms on the two CF we have in the cluster and number
>>> of reads seems to be almost same.
>>>
>>>
>>>
>>> we also used cfstats to see the number of ssttables on each node and one
>>> of the nodes with the above problem has twice the number of ssttables than
>>> other nodes. This still doesnot explain why two nodes have high GC
>>> Overheads. our GC config is as below:
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>>>
>>> # earlier value 131072 = 32768 * 4
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"
>>>
>>> #new
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"
>>>
>>>
>>>
>>> We are using cassandra 2.0.17. If anyone has any suggestion as to how
>>> what else we can look for to understand why this is happening please do
>>> reply.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks
>>>
>>> anishek
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Lot of GC on two nodes out of 7

Posted by Jeff Jirsa <je...@crowdstrike.com>.

I’d personally would have gone the other way – if you’re seeing parnew, increasing new gen instead of decreasing it should help drop (faster) rather than promoting to sv/oldgen (slower) ?

From:  Anishek Agarwal
Reply-To:  "user@cassandra.apache.org"
Date:  Thursday, March 3, 2016 at 8:55 PM
To:  "user@cassandra.apache.org"
Subject:  Re: Lot of GC on two nodes out of 7

Hello, 

Bryan, most of the partition sizes are under 45 KB

I have tried with concurrent_compactors : 8 for one of the nodes still no improvement, 
I have tried max_heap_Size : 8G, no improvement. 

I will try the newHeapsize of 2G though i am sure CMS will be a longer then.

Also doesn't look like i mentioned what type of GC was causing the problems. On both the nodes its the ParNewGC thats taking long for each run and too many runs are happening in succession. 

anishek

On Fri, Mar 4, 2016 at 5:36 AM, Bryan Cheng <br...@blockcypher.com> wrote:
Hi Anishek, 

In addition to the good advice others have given, do you notice any abnormally large partitions? What does cfhistograms report for 99% partition size? A few huge partitions will cause very disproportionate load on your cluster, including high GC.

--Bryan

On Wed, Mar 2, 2016 at 9:28 AM, Amit Singh F <am...@ericsson.com> wrote:
Hi Anishek,

We too faced similar problem in 2.0.14 and after doing some research we config few parameters in Cassandra.yaml and was able to overcome GC pauses . Those are :

·        memtable_flush_writers : increased from 1 to 3 as from tpstats output  we can see mutations dropped so it means writes are getting blocked, so increasing number will have those catered.

·        memtable_total_space_in_mb : Default (1/4 of heap size), can lowered because larger long lived objects will create pressure on HEAP, so its better to reduce some amount of size.

·        Concurrent_compactors : Alain righlty pointed out this i.e reduce it to 8. You need to try this.

Also please check whether you have mutations drop in other nodes or not.

Hope this helps in your cluster too.

Regards

Amit Singh

From: Jonathan Haddad [mailto:jon@jonhaddad.com] 
Sent: Wednesday, March 02, 2016 9:33 PM
To: user@cassandra.apache.org
Subject: Re: Lot of GC on two nodes out of 7

Can you post a gist of the output of jstat -gccause (60 seconds worth)?  I think it's cool you're willing to experiment with alternative JVM settings but I've never seen anyone use max tenuring threshold of 50 either and I can't imagine it's helpful.  Keep in mind if your objects are actually reaching that threshold it means they've been copied 50x (really really slow) and also you're going to end up spilling your eden objects directly into your old gen if your survivor is full.  Considering the small amount of memory you're using for heap I'm really not surprised you're running into problems.  

I recommend G1GC + 12GB heap and just let it optimize itself for almost all cases with the latest JVM versions.

On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ <ar...@gmail.com> wrote:

It looks like you are doing a good work with this cluster and know a lot about JVM, that's good :-).

our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM

That's good hardware too.

With 64 GB of ram I would probably directly give a try to `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably.

Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the effects. But that's just an idea of something I would try to see the impacts, I don't think it will solve your current issues or even make it worse for this node.

Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would allow you to store the memtables off-heap. Those are 2 improvements reducing the heap pressure that you might be interested in.

I have spent time reading about all other options before including them and a similar configuration on our other prod cluster is showing good GC graphs via gcviewer.

So, let's look for an other reason. 

there are MUTATION and READ messages dropped in high number on nodes in question and on other 5 nodes it varies between 1-3.

- Is Memory, CPU or disk a bottleneck? Is one of those running at the limits?

concurrent_compactors: 48

Reducing this to 8 would free some space for transactions (R&W requests). It is probably worth a try, even more when compaction is not keeping up and compaction throughput is not throttled.

Just found an issue about that: https://issues.apache.org/jira/browse/CASSANDRA-7139

Looks like `concurrent_compactors: 8` is the new default.

C*heers,

-----------------------

Alain Rodriguez - alain@thelastpickle.com

France

The Last Pickle - Apache Cassandra Consulting

http://www.thelastpickle.com

2016-03-02 12:27 GMT+01:00 Anishek Agarwal <an...@gmail.com>:

Thanks a lot Alian for the details.

`HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
`MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You might want to keep this as it or even reduce it if you have less than 16 GB of native memory. Go with 8 GB if you have a lot of memory.
`-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so far. I had luck with values between 4 <--> 16 in the past. I would give  a try with 15.
`-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ? Using default and then tune from there to improve things is generally a good idea.

we have a lot of reads and writes onto the system so keeping the high new size to make sure enough is held in memory including caches / memtables etc --number of flush_writers : 4 for us. similarly keeping less in old generation to make sure we spend less time with CMS GC most of the data is transient in memory for us. Keeping high TenuringThreshold because we don't want objects going to old generation and just die in young generation given we have configured large survivor spaces.

using occupancyFraction as 70 since 

given heap is 4G 

survivor space is : 400 mb -- 2 survivor spaces

70 % of 2G (old generation) = 1.4G 

so once we are just below 1.4G and we have to move the full survivor + some extra during a par new gc due to promotion failure, everything will fit in old generation, and will trigger CMS.

I have spent time reading about all other options before including them and a similar configuration on our other prod cluster is showing good GC graphs via gcviewer.

tp stats on all machines show flush writer blocked at : 0.3% of total

the two nodes in question have stats almost as below 

specifically there are pending was in readStage, MutationStage and RequestResponseStage
Pool Name                    Active   Pending      Completed   Blocked  All time blocked

ReadStage                        21        19     2141798645         0                 0

RequestResponseStage              0         1      803242391         0                 0

MutationStage                     0         0      291813703         0                 0

ReadRepairStage                   0         0      200544344         0                 0

ReplicateOnWriteStage             0         0              0         0                 0

GossipStage                       0         0         292477         0                 0

CacheCleanupExecutor              0         0              0         0                 0

MigrationStage                    0         0              0         0                 0

MemoryMeter                       0         0           2172         0                 0

FlushWriter                       0         0           2756         0                 6

ValidationExecutor                0         0            101         0                 0

InternalResponseStage             0         0              0         0                 0

AntiEntropyStage                  0         0            202         0                 0

MemtablePostFlusher               0         0           4395         0                 0

MiscStage                         0         0              0         0                 0

PendingRangeCalculator            0         0             20         0                 0

CompactionExecutor                4         4          49323         0                 0

commitlog_archiver                0         0              0         0                 0

HintedHandoff                     0         0            116         0                 0

Message type           Dropped

RANGE_SLICE                  0

READ_REPAIR                 36

PAGED_RANGE                  0

BINARY                       0

READ                     11471

MUTATION                   898

_TRACE                       0

REQUEST_RESPONSE             0

COUNTER_MUTATION             0

all the other 5 nodes show no pending numbers.

our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM

compaction throughput is 0 MB/s

concurrent_compactors: 48

flush_writers: 4 

I think Jeff is trying to spot a wide row messing with your system, so looking at the max row size on those nodes compared to other is more relevant than average size for this check.

i think is what you are looking for, please correct me if i am wrong

Compacted partition maximum bytes: 1629722

similar value on all 7 nodes.

grep -i "ERROR" /var/log/cassandra/system.log

there are MUTATION and READ messages dropped in high number on nodes in question and on other 5 nodes it varies between 1-3.

On Wed, Mar 2, 2016 at 4:15 PM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

Hi Anishek,

Even if it highly depends on your workload, here are my thoughts:

`HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)

`MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You might want to keep this as it or even reduce it if you have less than 16 GB of native memory. Go with 8 GB if you have a lot of memory.

`-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so far. I had luck with values between 4 <--> 16 in the past. I would give  a try with 15.

`-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ? Using default and then tune from there to improve things is generally a good idea.

You also use a bunch of option I don't know about, if you are uncertain about them, you could try a default conf without the options you added and just the using the changes above from default https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh. Or you might find more useful information on a nice reference about this topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java Virtual Machine' part: https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and G1GC. G1GC is supposed to be easier to configure too.

the average row size for compacted partitions is about 1640 bytes on all nodes. We have replication factor 3 but the problem is only on two nodes. 

I think Jeff is trying to spot a wide row messing with your system, so looking at the max row size on those nodes compared to other is more relevant than average size for this check.

the only other thing that stands out in cfstats is the read time and write time on the nodes with high GC is 5-7 times higher than other 5 nodes, but i think thats expected.

I would probably look at this the reverse way: I imagine that extra GC  is a consequence of something going wrong on those nodes as JVM / GC are configured the same way cluster-wide. GC / JVM issues are often due to Cassandra / system / hardware issues, inducing extra pressure on the JVM. I would try to tune JVM / GC only once the system is healthy. So I often saw high GC being a consequence rather than the root cause of an issue.

To explore this possibility:

Does this command show some dropped or blocked tasks? This would add pressure to heap.

nodetool tpstats

Do you have errors in logs? Always good to know when facing an issue.

grep -i "ERROR" /var/log/cassandra/system.log

How are compactions tuned (throughput + concurrent compactors)? This tuning might explain compactions not keeping up or a high GC pressure.

What are your disks / CPU? To help us giving you good arbitrary values to try.

Is there some iowait ? Could point to a bottleneck or bad hardware.

iostats -mx 5 100

...

Hope one of those will point you to an issue, but there are many more thing you could check.

Let us know how it goes,

C*heers,

-----------------------

Alain Rodriguez - alain@thelastpickle.com

France

The Last Pickle - Apache Cassandra Consulting

http://www.thelastpickle.com

2016-03-02 10:33 GMT+01:00 Anishek Agarwal <an...@gmail.com>:

also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G.

On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <an...@gmail.com> wrote:

Hey Jeff,

one of the nodes with high GC has 1400 SST tables, all other nodes have about 500-900 SST tables. the other node with high GC has 636 SST tables.

the average row size for compacted partitions is about 1640 bytes on all nodes. We have replication factor 3 but the problem is only on two nodes. 

the only other thing that stands out in cfstats is the read time and write time on the nodes with high GC is 5-7 times higher than other 5 nodes, but i think thats expected.

thanks

anishek

On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <je...@crowdstrike.com> wrote:

Compaction falling behind will likely cause additional work on reads (more sstables to merge), but I’d be surprised if it manifested in super long GC. When you say twice as many sstables, how many is that?. 

In cfstats, does anything stand out? Is max row size on those nodes larger than on other nodes?

What you don’t show in your JVM options is the new gen size – if you do have unusually large partitions on those two nodes (especially likely if you have rf=2 – if you have rf=3, then there’s probably a third node misbehaving you haven’t found yet), then raising new gen size can help handle the garbage created by reading large partitions without having to tolerate the promotion. Estimates for the amount of garbage vary, but it could be “gigabytes” of garbage on a very wide partition (see https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in progress to help mitigate that type of pain).

- Jeff 

From: Anishek Agarwal
Reply-To: "user@cassandra.apache.org"
Date: Tuesday, March 1, 2016 at 11:12 PM
To: "user@cassandra.apache.org"
Subject: Lot of GC on two nodes out of 7

Hello, 

we have a cassandra cluster of 7 nodes, all of them have the same JVM GC configurations, all our writes /  reads use the TokenAware Policy wrapping a DCAware policy. All nodes are part of same Datacenter.

We are seeing that two nodes are having high GC collection times. Then mostly seem to spend time in GC like about 300-600 ms. This also seems to result in higher CPU utilisation on these machines. Other  5 nodes don't have this problem.

There is no additional repair activity going on the cluster, we are not sure why this is happening. 

we checked cfhistograms on the two CF we have in the cluster and number of reads seems to be almost same. 

we also used cfstats to see the number of ssttables on each node and one of the nodes with the above problem has twice the number of ssttables than other nodes. This still doesnot explain why two nodes have high GC Overheads. our GC config is as below:

JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"

JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"

JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"

JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"

JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"

JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"

JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"

JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"

JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"

JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"

JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"

JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"

JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"

JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"

# earlier value 131072 = 32768 * 4

JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"

JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"

JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"

JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"

#new 

JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"

We are using cassandra 2.0.17. If anyone has any suggestion as to how what else we can look for to understand why this is happening please do reply. 

Thanks

anishek

Re: Lot of GC on two nodes out of 7

Posted by Anishek Agarwal <an...@gmail.com>.

Hello,

Bryan, most of the partition sizes are under 45 KB

I have tried with concurrent_compactors : 8 for one of the nodes still no
improvement,
I have tried max_heap_Size : 8G, no improvement.

I will try the newHeapsize of 2G though i am sure CMS will be a longer then.

Also doesn't look like i mentioned what type of GC was causing the
problems. On both the nodes its the ParNewGC thats taking long for each run
and too many runs are happening in succession.

anishek


On Fri, Mar 4, 2016 at 5:36 AM, Bryan Cheng <br...@blockcypher.com> wrote:

> Hi Anishek,
>
> In addition to the good advice others have given, do you notice any
> abnormally large partitions? What does cfhistograms report for 99%
> partition size? A few huge partitions will cause very disproportionate load
> on your cluster, including high GC.
>
> --Bryan
>
> On Wed, Mar 2, 2016 at 9:28 AM, Amit Singh F <am...@ericsson.com>
> wrote:
>
>> Hi Anishek,
>>
>>
>>
>> We too faced similar problem in 2.0.14 and after doing some research we
>> config few parameters in Cassandra.yaml and was able to overcome GC pauses
>> . Those are :
>>
>>
>>
>> ·         memtable_flush_writers : increased from 1 to 3 as from tpstats
>> output  we can see mutations dropped so it means writes are getting
>> blocked, so increasing number will have those catered.
>>
>> ·         memtable_total_space_in_mb : Default (1/4 of heap size), can
>> lowered because larger long lived objects will create pressure on HEAP, so
>> its better to reduce some amount of size.
>>
>> ·         Concurrent_compactors : Alain righlty pointed out this i.e
>> reduce it to 8. You need to try this.
>>
>>
>>
>> Also please check whether you have mutations drop in other nodes or not.
>>
>>
>>
>> Hope this helps in your cluster too.
>>
>>
>>
>> Regards
>>
>> Amit Singh
>>
>> *From:* Jonathan Haddad [mailto:jon@jonhaddad.com]
>> *Sent:* Wednesday, March 02, 2016 9:33 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Lot of GC on two nodes out of 7
>>
>>
>>
>> Can you post a gist of the output of jstat -gccause (60 seconds worth)?
>> I think it's cool you're willing to experiment with alternative JVM
>> settings but I've never seen anyone use max tenuring threshold of 50 either
>> and I can't imagine it's helpful.  Keep in mind if your objects are
>> actually reaching that threshold it means they've been copied 50x (really
>> really slow) and also you're going to end up spilling your eden objects
>> directly into your old gen if your survivor is full.  Considering the small
>> amount of memory you're using for heap I'm really not surprised you're
>> running into problems.
>>
>>
>>
>> I recommend G1GC + 12GB heap and just let it optimize itself for almost
>> all cases with the latest JVM versions.
>>
>>
>>
>> On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ <ar...@gmail.com>
>> wrote:
>>
>> It looks like you are doing a good work with this cluster and know a lot
>> about JVM, that's good :-).
>>
>>
>>
>> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
>>
>>
>>
>> That's good hardware too.
>>
>>
>>
>> With 64 GB of ram I would probably directly give a try to
>> `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably.
>>
>>
>>
>> Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using
>> `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the
>> effects. But that's just an idea of something I would try to see the
>> impacts, I don't think it will solve your current issues or even make it
>> worse for this node.
>>
>>
>>
>> Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would
>> allow you to store the memtables off-heap. Those are 2 improvements
>> reducing the heap pressure that you might be interested in.
>>
>>
>>
>> I have spent time reading about all other options before including them
>> and a similar configuration on our other prod cluster is showing good GC
>> graphs via gcviewer.
>>
>>
>>
>> So, let's look for an other reason.
>>
>>
>>
>> there are MUTATION and READ messages dropped in high number on nodes in
>> question and on other 5 nodes it varies between 1-3.
>>
>>
>>
>> - Is Memory, CPU or disk a bottleneck? Is one of those running at the
>> limits?
>>
>>
>>
>> concurrent_compactors: 48
>>
>>
>>
>> Reducing this to 8 would free some space for transactions (R&W requests).
>> It is probably worth a try, even more when compaction is not keeping up and
>> compaction throughput is not throttled.
>>
>>
>>
>> Just found an issue about that:
>> https://issues.apache.org/jira/browse/CASSANDRA-7139
>>
>>
>>
>> Looks like `concurrent_compactors: 8` is the new default.
>>
>>
>>
>> C*heers,
>>
>> -----------------------
>>
>> Alain Rodriguez - alain@thelastpickle.com
>>
>> France
>>
>>
>>
>> The Last Pickle - Apache Cassandra Consulting
>>
>> http://www.thelastpickle.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2016-03-02 12:27 GMT+01:00 Anishek Agarwal <an...@gmail.com>:
>>
>> Thanks a lot Alian for the details.
>>
>> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
>> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
>> might want to keep this as it or even reduce it if you have less than 16 GB
>> of native memory. Go with 8 GB if you have a lot of memory.
>> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
>> far. I had luck with values between 4 <--> 16 in the past. I would give  a
>> try with 15.
>> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
>> Using default and then tune from there to improve things is generally a
>> good idea.
>>
>>
>>
>>
>>
>> we have a lot of reads and writes onto the system so keeping the high new
>> size to make sure enough is held in memory including caches / memtables etc
>> --number of flush_writers : 4 for us. similarly keeping less in old
>> generation to make sure we spend less time with CMS GC most of the data is
>> transient in memory for us. Keeping high TenuringThreshold because we don't
>> want objects going to old generation and just die in young generation given
>> we have configured large survivor spaces.
>>
>> using occupancyFraction as 70 since
>>
>> given heap is 4G
>>
>> survivor space is : 400 mb -- 2 survivor spaces
>>
>> 70 % of 2G (old generation) = 1.4G
>>
>>
>>
>> so once we are just below 1.4G and we have to move the full survivor +
>> some extra during a par new gc due to promotion failure, everything will
>> fit in old generation, and will trigger CMS.
>>
>>
>>
>> I have spent time reading about all other options before including them
>> and a similar configuration on our other prod cluster is showing good GC
>> graphs via gcviewer.
>>
>>
>>
>> tp stats on all machines show flush writer blocked at : 0.3% of total
>>
>>
>>
>> the two nodes in question have stats almost as below
>>
>>    - specifically there are pending was in readStage, MutationStage and
>>    RequestResponseStage
>>
>> Pool Name                    Active   Pending      Completed   Blocked
>> All time blocked
>>
>> ReadStage                        21        19     2141798645         0
>>               0
>>
>> RequestResponseStage              0         1      803242391         0
>>               0
>>
>> MutationStage                     0         0      291813703         0
>>               0
>>
>> ReadRepairStage                   0         0      200544344         0
>>               0
>>
>> ReplicateOnWriteStage             0         0              0         0
>>               0
>>
>> GossipStage                       0         0         292477         0
>>               0
>>
>> CacheCleanupExecutor              0         0              0         0
>>               0
>>
>> MigrationStage                    0         0              0         0
>>               0
>>
>> MemoryMeter                       0         0           2172         0
>>               0
>>
>> FlushWriter                       0         0           2756         0
>>               6
>>
>> ValidationExecutor                0         0            101         0
>>               0
>>
>> InternalResponseStage             0         0              0         0
>>               0
>>
>> AntiEntropyStage                  0         0            202         0
>>               0
>>
>> MemtablePostFlusher               0         0           4395         0
>>               0
>>
>> MiscStage                         0         0              0         0
>>               0
>>
>> PendingRangeCalculator            0         0             20         0
>>               0
>>
>> CompactionExecutor                4         4          49323         0
>>               0
>>
>> commitlog_archiver                0         0              0         0
>>               0
>>
>> HintedHandoff                     0         0            116         0
>>               0
>>
>>
>>
>> Message type           Dropped
>>
>> RANGE_SLICE                  0
>>
>> READ_REPAIR                 36
>>
>> PAGED_RANGE                  0
>>
>> BINARY                       0
>>
>> READ                     11471
>>
>> MUTATION                   898
>>
>> _TRACE                       0
>>
>> REQUEST_RESPONSE             0
>>
>> COUNTER_MUTATION             0
>>
>>
>>
>> all the other 5 nodes show no pending numbers.
>>
>>
>>
>>
>>
>> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
>>
>> compaction throughput is 0 MB/s
>>
>> concurrent_compactors: 48
>>
>> flush_writers: 4
>>
>>
>>
>>
>> I think Jeff is trying to spot a wide row messing with your system, so
>> looking at the max row size on those nodes compared to other is more
>> relevant than average size for this check.
>>
>>
>>
>> i think is what you are looking for, please correct me if i am wrong
>>
>> Compacted partition maximum bytes: 1629722
>>
>> similar value on all 7 nodes.
>>
>>
>>
>> grep -i "ERROR" /var/log/cassandra/system.log
>>
>>
>>
>> there are MUTATION and READ messages dropped in high number on nodes in
>> question and on other 5 nodes it varies between 1-3.
>>
>>
>>
>> On Wed, Mar 2, 2016 at 4:15 PM, Alain RODRIGUEZ <ar...@gmail.com>
>> wrote:
>>
>> Hi Anishek,
>>
>>
>>
>> Even if it highly depends on your workload, here are my thoughts:
>>
>>
>>
>> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
>>
>> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
>> might want to keep this as it or even reduce it if you have less than 16 GB
>> of native memory. Go with 8 GB if you have a lot of memory.
>>
>> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
>> far. I had luck with values between 4 <--> 16 in the past. I would give  a
>> try with 15.
>>
>> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
>> Using default and then tune from there to improve things is generally a
>> good idea.
>>
>>
>>
>> You also use a bunch of option I don't know about, if you are uncertain
>> about them, you could try a default conf without the options you added and
>> just the using the changes above from default
>> https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh.
>> Or you might find more useful information on a nice reference about this
>> topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java
>> Virtual Machine' part:
>> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
>>
>>
>>
>> FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and
>> G1GC. G1GC is supposed to be easier to configure too.
>>
>>
>>
>> the average row size for compacted partitions is about 1640 bytes on all
>> nodes. We have replication factor 3 but the problem is only on two nodes.
>>
>>
>>
>> I think Jeff is trying to spot a wide row messing with your system, so
>> looking at the max row size on those nodes compared to other is more
>> relevant than average size for this check.
>>
>>
>>
>> the only other thing that stands out in cfstats is the read time and
>> write time on the nodes with high GC is 5-7 times higher than other 5
>> nodes, but i think thats expected.
>>
>>
>>
>> I would probably look at this the reverse way: I imagine that extra GC
>>  is a consequence of something going wrong on those nodes as JVM / GC are
>> configured the same way cluster-wide. GC / JVM issues are often due to
>> Cassandra / system / hardware issues, inducing extra pressure on the JVM. I
>> would try to tune JVM / GC only once the system is healthy. So I often saw
>> high GC being a consequence rather than the root cause of an issue.
>>
>>
>>
>> To explore this possibility:
>>
>>
>>
>> Does this command show some dropped or blocked tasks? This would add
>> pressure to heap.
>>
>> nodetool tpstats
>>
>>
>>
>> Do you have errors in logs? Always good to know when facing an issue.
>>
>> grep -i "ERROR" /var/log/cassandra/system.log
>>
>>
>>
>> How are compactions tuned (throughput + concurrent compactors)? This
>> tuning might explain compactions not keeping up or a high GC pressure.
>>
>>
>>
>> What are your disks / CPU? To help us giving you good arbitrary values to
>> try.
>>
>>
>>
>> Is there some iowait ? Could point to a bottleneck or bad hardware.
>>
>> iostats -mx 5 100
>>
>>
>>
>> ...
>>
>>
>>
>> Hope one of those will point you to an issue, but there are many more
>> thing you could check.
>>
>>
>>
>> Let us know how it goes,
>>
>>
>>
>> C*heers,
>>
>> -----------------------
>>
>> Alain Rodriguez - alain@thelastpickle.com
>>
>> France
>>
>>
>>
>> The Last Pickle - Apache Cassandra Consulting
>>
>> http://www.thelastpickle.com
>>
>>
>>
>>
>>
>>
>>
>> 2016-03-02 10:33 GMT+01:00 Anishek Agarwal <an...@gmail.com>:
>>
>> also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G.
>>
>>
>>
>> On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <an...@gmail.com>
>> wrote:
>>
>> Hey Jeff,
>>
>>
>>
>> one of the nodes with high GC has 1400 SST tables, all other nodes have
>> about 500-900 SST tables. the other node with high GC has 636 SST tables.
>>
>>
>>
>> the average row size for compacted partitions is about 1640 bytes on all
>> nodes. We have replication factor 3 but the problem is only on two nodes.
>>
>> the only other thing that stands out in cfstats is the read time and
>> write time on the nodes with high GC is 5-7 times higher than other 5
>> nodes, but i think thats expected.
>>
>>
>>
>> thanks
>>
>> anishek
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <je...@crowdstrike.com>
>> wrote:
>>
>> Compaction falling behind will likely cause additional work on reads
>> (more sstables to merge), but I’d be surprised if it manifested in super
>> long GC. When you say twice as many sstables, how many is that?.
>>
>>
>>
>> In cfstats, does anything stand out? Is max row size on those nodes
>> larger than on other nodes?
>>
>>
>>
>> What you don’t show in your JVM options is the new gen size – if you do
>> have unusually large partitions on those two nodes (especially likely if
>> you have rf=2 – if you have rf=3, then there’s probably a third node
>> misbehaving you haven’t found yet), then raising new gen size can help
>> handle the garbage created by reading large partitions without having to
>> tolerate the promotion. Estimates for the amount of garbage vary, but it
>> could be “gigabytes” of garbage on a very wide partition (see
>> https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in
>> progress to help mitigate that type of pain).
>>
>>
>>
>> - Jeff
>>
>>
>>
>> *From: *Anishek Agarwal
>> *Reply-To: *"user@cassandra.apache.org"
>> *Date: *Tuesday, March 1, 2016 at 11:12 PM
>> *To: *"user@cassandra.apache.org"
>> *Subject: *Lot of GC on two nodes out of 7
>>
>>
>>
>> Hello,
>>
>>
>>
>> we have a cassandra cluster of 7 nodes, all of them have the same JVM GC
>> configurations, all our writes /  reads use the TokenAware Policy wrapping
>> a DCAware policy. All nodes are part of same Datacenter.
>>
>>
>>
>> We are seeing that two nodes are having high GC collection times. Then
>> mostly seem to spend time in GC like about 300-600 ms. This also seems to
>> result in higher CPU utilisation on these machines. Other  5 nodes don't
>> have this problem.
>>
>>
>>
>> There is no additional repair activity going on the cluster, we are not
>> sure why this is happening.
>>
>> we checked cfhistograms on the two CF we have in the cluster and number
>> of reads seems to be almost same.
>>
>>
>>
>> we also used cfstats to see the number of ssttables on each node and one
>> of the nodes with the above problem has twice the number of ssttables than
>> other nodes. This still doesnot explain why two nodes have high GC
>> Overheads. our GC config is as below:
>>
>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>>
>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>>
>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"
>>
>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>>
>> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>>
>> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"
>>
>> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"
>>
>> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>>
>> # earlier value 131072 = 32768 * 4
>>
>> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"
>>
>> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"
>>
>> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"
>>
>> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"
>>
>> #new
>>
>> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"
>>
>>
>>
>> We are using cassandra 2.0.17. If anyone has any suggestion as to how
>> what else we can look for to understand why this is happening please do
>> reply.
>>
>>
>>
>>
>>
>>
>>
>> Thanks
>>
>> anishek
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: Lot of GC on two nodes out of 7

Posted by Bryan Cheng <br...@blockcypher.com>.

Hi Anishek,

In addition to the good advice others have given, do you notice any
abnormally large partitions? What does cfhistograms report for 99%
partition size? A few huge partitions will cause very disproportionate load
on your cluster, including high GC.

--Bryan

On Wed, Mar 2, 2016 at 9:28 AM, Amit Singh F <am...@ericsson.com>
wrote:

> Hi Anishek,
>
>
>
> We too faced similar problem in 2.0.14 and after doing some research we
> config few parameters in Cassandra.yaml and was able to overcome GC pauses
> . Those are :
>
>
>
> ·         memtable_flush_writers : increased from 1 to 3 as from tpstats
> output  we can see mutations dropped so it means writes are getting
> blocked, so increasing number will have those catered.
>
> ·         memtable_total_space_in_mb : Default (1/4 of heap size), can
> lowered because larger long lived objects will create pressure on HEAP, so
> its better to reduce some amount of size.
>
> ·         Concurrent_compactors : Alain righlty pointed out this i.e
> reduce it to 8. You need to try this.
>
>
>
> Also please check whether you have mutations drop in other nodes or not.
>
>
>
> Hope this helps in your cluster too.
>
>
>
> Regards
>
> Amit Singh
>
> *From:* Jonathan Haddad [mailto:jon@jonhaddad.com]
> *Sent:* Wednesday, March 02, 2016 9:33 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Lot of GC on two nodes out of 7
>
>
>
> Can you post a gist of the output of jstat -gccause (60 seconds worth)?  I
> think it's cool you're willing to experiment with alternative JVM settings
> but I've never seen anyone use max tenuring threshold of 50 either and I
> can't imagine it's helpful.  Keep in mind if your objects are actually
> reaching that threshold it means they've been copied 50x (really really
> slow) and also you're going to end up spilling your eden objects directly
> into your old gen if your survivor is full.  Considering the small amount
> of memory you're using for heap I'm really not surprised you're running
> into problems.
>
>
>
> I recommend G1GC + 12GB heap and just let it optimize itself for almost
> all cases with the latest JVM versions.
>
>
>
> On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ <ar...@gmail.com> wrote:
>
> It looks like you are doing a good work with this cluster and know a lot
> about JVM, that's good :-).
>
>
>
> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
>
>
>
> That's good hardware too.
>
>
>
> With 64 GB of ram I would probably directly give a try to
> `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably.
>
>
>
> Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using
> `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the
> effects. But that's just an idea of something I would try to see the
> impacts, I don't think it will solve your current issues or even make it
> worse for this node.
>
>
>
> Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would
> allow you to store the memtables off-heap. Those are 2 improvements
> reducing the heap pressure that you might be interested in.
>
>
>
> I have spent time reading about all other options before including them
> and a similar configuration on our other prod cluster is showing good GC
> graphs via gcviewer.
>
>
>
> So, let's look for an other reason.
>
>
>
> there are MUTATION and READ messages dropped in high number on nodes in
> question and on other 5 nodes it varies between 1-3.
>
>
>
> - Is Memory, CPU or disk a bottleneck? Is one of those running at the
> limits?
>
>
>
> concurrent_compactors: 48
>
>
>
> Reducing this to 8 would free some space for transactions (R&W requests).
> It is probably worth a try, even more when compaction is not keeping up and
> compaction throughput is not throttled.
>
>
>
> Just found an issue about that:
> https://issues.apache.org/jira/browse/CASSANDRA-7139
>
>
>
> Looks like `concurrent_compactors: 8` is the new default.
>
>
>
> C*heers,
>
> -----------------------
>
> Alain Rodriguez - alain@thelastpickle.com
>
> France
>
>
>
> The Last Pickle - Apache Cassandra Consulting
>
> http://www.thelastpickle.com
>
>
>
>
>
>
>
>
>
>
>
>
>
> 2016-03-02 12:27 GMT+01:00 Anishek Agarwal <an...@gmail.com>:
>
> Thanks a lot Alian for the details.
>
> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
> might want to keep this as it or even reduce it if you have less than 16 GB
> of native memory. Go with 8 GB if you have a lot of memory.
> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
> far. I had luck with values between 4 <--> 16 in the past. I would give  a
> try with 15.
> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
> Using default and then tune from there to improve things is generally a
> good idea.
>
>
>
>
>
> we have a lot of reads and writes onto the system so keeping the high new
> size to make sure enough is held in memory including caches / memtables etc
> --number of flush_writers : 4 for us. similarly keeping less in old
> generation to make sure we spend less time with CMS GC most of the data is
> transient in memory for us. Keeping high TenuringThreshold because we don't
> want objects going to old generation and just die in young generation given
> we have configured large survivor spaces.
>
> using occupancyFraction as 70 since
>
> given heap is 4G
>
> survivor space is : 400 mb -- 2 survivor spaces
>
> 70 % of 2G (old generation) = 1.4G
>
>
>
> so once we are just below 1.4G and we have to move the full survivor +
> some extra during a par new gc due to promotion failure, everything will
> fit in old generation, and will trigger CMS.
>
>
>
> I have spent time reading about all other options before including them
> and a similar configuration on our other prod cluster is showing good GC
> graphs via gcviewer.
>
>
>
> tp stats on all machines show flush writer blocked at : 0.3% of total
>
>
>
> the two nodes in question have stats almost as below
>
>    - specifically there are pending was in readStage, MutationStage and
>    RequestResponseStage
>
> Pool Name                    Active   Pending      Completed   Blocked
> All time blocked
>
> ReadStage                        21        19     2141798645         0
>             0
>
> RequestResponseStage              0         1      803242391         0
>             0
>
> MutationStage                     0         0      291813703         0
>             0
>
> ReadRepairStage                   0         0      200544344         0
>             0
>
> ReplicateOnWriteStage             0         0              0         0
>             0
>
> GossipStage                       0         0         292477         0
>             0
>
> CacheCleanupExecutor              0         0              0         0
>             0
>
> MigrationStage                    0         0              0         0
>             0
>
> MemoryMeter                       0         0           2172         0
>             0
>
> FlushWriter                       0         0           2756         0
>             6
>
> ValidationExecutor                0         0            101         0
>             0
>
> InternalResponseStage             0         0              0         0
>             0
>
> AntiEntropyStage                  0         0            202         0
>             0
>
> MemtablePostFlusher               0         0           4395         0
>             0
>
> MiscStage                         0         0              0         0
>             0
>
> PendingRangeCalculator            0         0             20         0
>             0
>
> CompactionExecutor                4         4          49323         0
>             0
>
> commitlog_archiver                0         0              0         0
>             0
>
> HintedHandoff                     0         0            116         0
>             0
>
>
>
> Message type           Dropped
>
> RANGE_SLICE                  0
>
> READ_REPAIR                 36
>
> PAGED_RANGE                  0
>
> BINARY                       0
>
> READ                     11471
>
> MUTATION                   898
>
> _TRACE                       0
>
> REQUEST_RESPONSE             0
>
> COUNTER_MUTATION             0
>
>
>
> all the other 5 nodes show no pending numbers.
>
>
>
>
>
> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
>
> compaction throughput is 0 MB/s
>
> concurrent_compactors: 48
>
> flush_writers: 4
>
>
>
>
> I think Jeff is trying to spot a wide row messing with your system, so
> looking at the max row size on those nodes compared to other is more
> relevant than average size for this check.
>
>
>
> i think is what you are looking for, please correct me if i am wrong
>
> Compacted partition maximum bytes: 1629722
>
> similar value on all 7 nodes.
>
>
>
> grep -i "ERROR" /var/log/cassandra/system.log
>
>
>
> there are MUTATION and READ messages dropped in high number on nodes in
> question and on other 5 nodes it varies between 1-3.
>
>
>
> On Wed, Mar 2, 2016 at 4:15 PM, Alain RODRIGUEZ <ar...@gmail.com>
> wrote:
>
> Hi Anishek,
>
>
>
> Even if it highly depends on your workload, here are my thoughts:
>
>
>
> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
>
> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
> might want to keep this as it or even reduce it if you have less than 16 GB
> of native memory. Go with 8 GB if you have a lot of memory.
>
> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
> far. I had luck with values between 4 <--> 16 in the past. I would give  a
> try with 15.
>
> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
> Using default and then tune from there to improve things is generally a
> good idea.
>
>
>
> You also use a bunch of option I don't know about, if you are uncertain
> about them, you could try a default conf without the options you added and
> just the using the changes above from default
> https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh.
> Or you might find more useful information on a nice reference about this
> topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java
> Virtual Machine' part:
> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
>
>
>
> FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and
> G1GC. G1GC is supposed to be easier to configure too.
>
>
>
> the average row size for compacted partitions is about 1640 bytes on all
> nodes. We have replication factor 3 but the problem is only on two nodes.
>
>
>
> I think Jeff is trying to spot a wide row messing with your system, so
> looking at the max row size on those nodes compared to other is more
> relevant than average size for this check.
>
>
>
> the only other thing that stands out in cfstats is the read time and write
> time on the nodes with high GC is 5-7 times higher than other 5 nodes, but
> i think thats expected.
>
>
>
> I would probably look at this the reverse way: I imagine that extra GC  is
> a consequence of something going wrong on those nodes as JVM / GC are
> configured the same way cluster-wide. GC / JVM issues are often due to
> Cassandra / system / hardware issues, inducing extra pressure on the JVM. I
> would try to tune JVM / GC only once the system is healthy. So I often saw
> high GC being a consequence rather than the root cause of an issue.
>
>
>
> To explore this possibility:
>
>
>
> Does this command show some dropped or blocked tasks? This would add
> pressure to heap.
>
> nodetool tpstats
>
>
>
> Do you have errors in logs? Always good to know when facing an issue.
>
> grep -i "ERROR" /var/log/cassandra/system.log
>
>
>
> How are compactions tuned (throughput + concurrent compactors)? This
> tuning might explain compactions not keeping up or a high GC pressure.
>
>
>
> What are your disks / CPU? To help us giving you good arbitrary values to
> try.
>
>
>
> Is there some iowait ? Could point to a bottleneck or bad hardware.
>
> iostats -mx 5 100
>
>
>
> ...
>
>
>
> Hope one of those will point you to an issue, but there are many more
> thing you could check.
>
>
>
> Let us know how it goes,
>
>
>
> C*heers,
>
> -----------------------
>
> Alain Rodriguez - alain@thelastpickle.com
>
> France
>
>
>
> The Last Pickle - Apache Cassandra Consulting
>
> http://www.thelastpickle.com
>
>
>
>
>
>
>
> 2016-03-02 10:33 GMT+01:00 Anishek Agarwal <an...@gmail.com>:
>
> also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G.
>
>
>
> On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <an...@gmail.com> wrote:
>
> Hey Jeff,
>
>
>
> one of the nodes with high GC has 1400 SST tables, all other nodes have
> about 500-900 SST tables. the other node with high GC has 636 SST tables.
>
>
>
> the average row size for compacted partitions is about 1640 bytes on all
> nodes. We have replication factor 3 but the problem is only on two nodes.
>
> the only other thing that stands out in cfstats is the read time and write
> time on the nodes with high GC is 5-7 times higher than other 5 nodes, but
> i think thats expected.
>
>
>
> thanks
>
> anishek
>
>
>
>
>
>
>
>
>
> On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <je...@crowdstrike.com>
> wrote:
>
> Compaction falling behind will likely cause additional work on reads (more
> sstables to merge), but I’d be surprised if it manifested in super long GC.
> When you say twice as many sstables, how many is that?.
>
>
>
> In cfstats, does anything stand out? Is max row size on those nodes larger
> than on other nodes?
>
>
>
> What you don’t show in your JVM options is the new gen size – if you do
> have unusually large partitions on those two nodes (especially likely if
> you have rf=2 – if you have rf=3, then there’s probably a third node
> misbehaving you haven’t found yet), then raising new gen size can help
> handle the garbage created by reading large partitions without having to
> tolerate the promotion. Estimates for the amount of garbage vary, but it
> could be “gigabytes” of garbage on a very wide partition (see
> https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in progress
> to help mitigate that type of pain).
>
>
>
> - Jeff
>
>
>
> *From: *Anishek Agarwal
> *Reply-To: *"user@cassandra.apache.org"
> *Date: *Tuesday, March 1, 2016 at 11:12 PM
> *To: *"user@cassandra.apache.org"
> *Subject: *Lot of GC on two nodes out of 7
>
>
>
> Hello,
>
>
>
> we have a cassandra cluster of 7 nodes, all of them have the same JVM GC
> configurations, all our writes /  reads use the TokenAware Policy wrapping
> a DCAware policy. All nodes are part of same Datacenter.
>
>
>
> We are seeing that two nodes are having high GC collection times. Then
> mostly seem to spend time in GC like about 300-600 ms. This also seems to
> result in higher CPU utilisation on these machines. Other  5 nodes don't
> have this problem.
>
>
>
> There is no additional repair activity going on the cluster, we are not
> sure why this is happening.
>
> we checked cfhistograms on the two CF we have in the cluster and number of
> reads seems to be almost same.
>
>
>
> we also used cfstats to see the number of ssttables on each node and one
> of the nodes with the above problem has twice the number of ssttables than
> other nodes. This still doesnot explain why two nodes have high GC
> Overheads. our GC config is as below:
>
> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>
> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>
> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>
> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"
>
> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>
> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>
> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"
>
> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"
>
> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"
>
> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>
> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>
> # earlier value 131072 = 32768 * 4
>
> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"
>
> #new
>
> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"
>
>
>
> We are using cassandra 2.0.17. If anyone has any suggestion as to how what
> else we can look for to understand why this is happening please do reply.
>
>
>
>
>
>
>
> Thanks
>
> anishek
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: Lot of GC on two nodes out of 7

Posted by Amit Singh F <am...@ericsson.com>.

Hi Anishek,

We too faced similar problem in 2.0.14 and after doing some research we config few parameters in Cassandra.yaml and was able to overcome GC pauses . Those are :


·         memtable_flush_writers : increased from 1 to 3 as from tpstats output  we can see mutations dropped so it means writes are getting blocked, so increasing number will have those catered.

·         memtable_total_space_in_mb : Default (1/4 of heap size), can lowered because larger long lived objects will create pressure on HEAP, so its better to reduce some amount of size.

·         Concurrent_compactors : Alain righlty pointed out this i.e reduce it to 8. You need to try this.

Also please check whether you have mutations drop in other nodes or not.

Hope this helps in your cluster too.

Regards
Amit Singh
From: Jonathan Haddad [mailto:jon@jonhaddad.com]
Sent: Wednesday, March 02, 2016 9:33 PM
To: user@cassandra.apache.org
Subject: Re: Lot of GC on two nodes out of 7

Can you post a gist of the output of jstat -gccause (60 seconds worth)?  I think it's cool you're willing to experiment with alternative JVM settings but I've never seen anyone use max tenuring threshold of 50 either and I can't imagine it's helpful.  Keep in mind if your objects are actually reaching that threshold it means they've been copied 50x (really really slow) and also you're going to end up spilling your eden objects directly into your old gen if your survivor is full.  Considering the small amount of memory you're using for heap I'm really not surprised you're running into problems.

I recommend G1GC + 12GB heap and just let it optimize itself for almost all cases with the latest JVM versions.

On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ <ar...@gmail.com>> wrote:
It looks like you are doing a good work with this cluster and know a lot about JVM, that's good :-).

our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM

That's good hardware too.

With 64 GB of ram I would probably directly give a try to `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably.

Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the effects. But that's just an idea of something I would try to see the impacts, I don't think it will solve your current issues or even make it worse for this node.

Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would allow you to store the memtables off-heap. Those are 2 improvements reducing the heap pressure that you might be interested in.

I have spent time reading about all other options before including them and a similar configuration on our other prod cluster is showing good GC graphs via gcviewer.

So, let's look for an other reason.

there are MUTATION and READ messages dropped in high number on nodes in question and on other 5 nodes it varies between 1-3.

- Is Memory, CPU or disk a bottleneck? Is one of those running at the limits?

concurrent_compactors: 48

Reducing this to 8 would free some space for transactions (R&W requests). It is probably worth a try, even more when compaction is not keeping up and compaction throughput is not throttled.

Just found an issue about that: https://issues.apache.org/jira/browse/CASSANDRA-7139

Looks like `concurrent_compactors: 8` is the new default.

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com<ma...@thelastpickle.com>
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com






2016-03-02 12:27 GMT+01:00 Anishek Agarwal <an...@gmail.com>>:
Thanks a lot Alian for the details.
`HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
`MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You might want to keep this as it or even reduce it if you have less than 16 GB of native memory. Go with 8 GB if you have a lot of memory.
`-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so far. I had luck with values between 4 <--> 16 in the past. I would give  a try with 15.
`-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ? Using default and then tune from there to improve things is generally a good idea.


we have a lot of reads and writes onto the system so keeping the high new size to make sure enough is held in memory including caches / memtables etc --number of flush_writers : 4 for us. similarly keeping less in old generation to make sure we spend less time with CMS GC most of the data is transient in memory for us. Keeping high TenuringThreshold because we don't want objects going to old generation and just die in young generation given we have configured large survivor spaces.
using occupancyFraction as 70 since
given heap is 4G
survivor space is : 400 mb -- 2 survivor spaces
70 % of 2G (old generation) = 1.4G

so once we are just below 1.4G and we have to move the full survivor + some extra during a par new gc due to promotion failure, everything will fit in old generation, and will trigger CMS.

I have spent time reading about all other options before including them and a similar configuration on our other prod cluster is showing good GC graphs via gcviewer.

tp stats on all machines show flush writer blocked at : 0.3% of total

the two nodes in question have stats almost as below

  *   specifically there are pending was in readStage, MutationStage and RequestResponseStage

Pool Name                    Active   Pending      Completed   Blocked  All time blocked

ReadStage                        21        19     2141798645         0                 0

RequestResponseStage              0         1      803242391         0                 0

MutationStage                     0         0      291813703         0                 0

ReadRepairStage                   0         0      200544344         0                 0

ReplicateOnWriteStage             0         0              0         0                 0

GossipStage                       0         0         292477         0                 0

CacheCleanupExecutor              0         0              0         0                 0

MigrationStage                    0         0              0         0                 0

MemoryMeter                       0         0           2172         0                 0

FlushWriter                       0         0           2756         0                 6

ValidationExecutor                0         0            101         0                 0

InternalResponseStage             0         0              0         0                 0

AntiEntropyStage                  0         0            202         0                 0

MemtablePostFlusher               0         0           4395         0                 0

MiscStage                         0         0              0         0                 0

PendingRangeCalculator            0         0             20         0                 0

CompactionExecutor                4         4          49323         0                 0

commitlog_archiver                0         0              0         0                 0

HintedHandoff                     0         0            116         0                 0



Message type           Dropped

RANGE_SLICE                  0

READ_REPAIR                 36

PAGED_RANGE                  0

BINARY                       0

READ                     11471

MUTATION                   898

_TRACE                       0

REQUEST_RESPONSE             0

COUNTER_MUTATION             0

all the other 5 nodes show no pending numbers.


our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
compaction throughput is 0 MB/s
concurrent_compactors: 48
flush_writers: 4


I think Jeff is trying to spot a wide row messing with your system, so looking at the max row size on those nodes compared to other is more relevant than average size for this check.

i think is what you are looking for, please correct me if i am wrong

Compacted partition maximum bytes: 1629722
similar value on all 7 nodes.

grep -i "ERROR" /var/log/cassandra/system.log

there are MUTATION and READ messages dropped in high number on nodes in question and on other 5 nodes it varies between 1-3.

On Wed, Mar 2, 2016 at 4:15 PM, Alain RODRIGUEZ <ar...@gmail.com>> wrote:
Hi Anishek,

Even if it highly depends on your workload, here are my thoughts:

`HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
`MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You might want to keep this as it or even reduce it if you have less than 16 GB of native memory. Go with 8 GB if you have a lot of memory.
`-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so far. I had luck with values between 4 <--> 16 in the past. I would give  a try with 15.
`-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ? Using default and then tune from there to improve things is generally a good idea.

You also use a bunch of option I don't know about, if you are uncertain about them, you could try a default conf without the options you added and just the using the changes above from default https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh. Or you might find more useful information on a nice reference about this topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java Virtual Machine' part: https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and G1GC. G1GC is supposed to be easier to configure too.

the average row size for compacted partitions is about 1640 bytes on all nodes. We have replication factor 3 but the problem is only on two nodes.

I think Jeff is trying to spot a wide row messing with your system, so looking at the max row size on those nodes compared to other is more relevant than average size for this check.

the only other thing that stands out in cfstats is the read time and write time on the nodes with high GC is 5-7 times higher than other 5 nodes, but i think thats expected.

I would probably look at this the reverse way: I imagine that extra GC  is a consequence of something going wrong on those nodes as JVM / GC are configured the same way cluster-wide. GC / JVM issues are often due to Cassandra / system / hardware issues, inducing extra pressure on the JVM. I would try to tune JVM / GC only once the system is healthy. So I often saw high GC being a consequence rather than the root cause of an issue.

To explore this possibility:

Does this command show some dropped or blocked tasks? This would add pressure to heap.
nodetool tpstats

Do you have errors in logs? Always good to know when facing an issue.
grep -i "ERROR" /var/log/cassandra/system.log

How are compactions tuned (throughput + concurrent compactors)? This tuning might explain compactions not keeping up or a high GC pressure.

What are your disks / CPU? To help us giving you good arbitrary values to try.

Is there some iowait ? Could point to a bottleneck or bad hardware.
iostats -mx 5 100

...

Hope one of those will point you to an issue, but there are many more thing you could check.

Let us know how it goes,

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com<ma...@thelastpickle.com>
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



2016-03-02 10:33 GMT+01:00 Anishek Agarwal <an...@gmail.com>>:
also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G.

On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <an...@gmail.com>> wrote:
Hey Jeff,

one of the nodes with high GC has 1400 SST tables, all other nodes have about 500-900 SST tables. the other node with high GC has 636 SST tables.

the average row size for compacted partitions is about 1640 bytes on all nodes. We have replication factor 3 but the problem is only on two nodes.
the only other thing that stands out in cfstats is the read time and write time on the nodes with high GC is 5-7 times higher than other 5 nodes, but i think thats expected.

thanks
anishek




On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <je...@crowdstrike.com>> wrote:
Compaction falling behind will likely cause additional work on reads (more sstables to merge), but I’d be surprised if it manifested in super long GC. When you say twice as many sstables, how many is that?.

In cfstats, does anything stand out? Is max row size on those nodes larger than on other nodes?

What you don’t show in your JVM options is the new gen size – if you do have unusually large partitions on those two nodes (especially likely if you have rf=2 – if you have rf=3, then there’s probably a third node misbehaving you haven’t found yet), then raising new gen size can help handle the garbage created by reading large partitions without having to tolerate the promotion. Estimates for the amount of garbage vary, but it could be “gigabytes” of garbage on a very wide partition (see https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in progress to help mitigate that type of pain).

- Jeff

From: Anishek Agarwal
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
Date: Tuesday, March 1, 2016 at 11:12 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
Subject: Lot of GC on two nodes out of 7

Hello,

we have a cassandra cluster of 7 nodes, all of them have the same JVM GC configurations, all our writes /  reads use the TokenAware Policy wrapping a DCAware policy. All nodes are part of same Datacenter.

We are seeing that two nodes are having high GC collection times. Then mostly seem to spend time in GC like about 300-600 ms. This also seems to result in higher CPU utilisation on these machines. Other  5 nodes don't have this problem.

There is no additional repair activity going on the cluster, we are not sure why this is happening.
we checked cfhistograms on the two CF we have in the cluster and number of reads seems to be almost same.

we also used cfstats to see the number of ssttables on each node and one of the nodes with the above problem has twice the number of ssttables than other nodes. This still doesnot explain why two nodes have high GC Overheads. our GC config is as below:

JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"

JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"

JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"

JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"

JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"

JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"

JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"

JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"

JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"

JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"

JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"

JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"

JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"

JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"

# earlier value 131072 = 32768 * 4

JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"

JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"

JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"

JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"

#new

JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"

We are using cassandra 2.0.17. If anyone has any suggestion as to how what else we can look for to understand why this is happening please do reply.



Thanks
anishek

RE: Lot of GC on two nodes out of 7

Posted by Amit Singh F <am...@ericsson.com>.

Hi Anishek,

We too faced similar problem in 2.0.14 and after doing some research we config few parameters in Cassandra.yaml and was able to overcome GC pauses . Those are :


·         memtable_flush_writers : increased from 1 to 3 as from tpstats output  we can see mutations dropped so it means writes are getting blocked, so increasing number will have those catered.

·         memtable_total_space_in_mb : Default (1/4 of heap size), can lowered because larger long lived objects will create pressure on HEAP, so its better to reduce some amount of size.

·         Concurrent_compactors : Alain righlty pointed out this i.e reduce it to 8. You need to try this.

Also please check whether you have mutations drop in other nodes or not.

Hope this helps in your cluster too.

Regards
Amit Singh

From: Jonathan Haddad [mailto:jon@jonhaddad.com]
Sent: Wednesday, March 02, 2016 9:33 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Lot of GC on two nodes out of 7

Can you post a gist of the output of jstat -gccause (60 seconds worth)?  I think it's cool you're willing to experiment with alternative JVM settings but I've never seen anyone use max tenuring threshold of 50 either and I can't imagine it's helpful.  Keep in mind if your objects are actually reaching that threshold it means they've been copied 50x (really really slow) and also you're going to end up spilling your eden objects directly into your old gen if your survivor is full.  Considering the small amount of memory you're using for heap I'm really not surprised you're running into problems.

I recommend G1GC + 12GB heap and just let it optimize itself for almost all cases with the latest JVM versions.

On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ <ar...@gmail.com>> wrote:
It looks like you are doing a good work with this cluster and know a lot about JVM, that's good :-).

our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM

That's good hardware too.

With 64 GB of ram I would probably directly give a try to `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably.

Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the effects. But that's just an idea of something I would try to see the impacts, I don't think it will solve your current issues or even make it worse for this node.

Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would allow you to store the memtables off-heap. Those are 2 improvements reducing the heap pressure that you might be interested in.

I have spent time reading about all other options before including them and a similar configuration on our other prod cluster is showing good GC graphs via gcviewer.

So, let's look for an other reason.

there are MUTATION and READ messages dropped in high number on nodes in question and on other 5 nodes it varies between 1-3.

- Is Memory, CPU or disk a bottleneck? Is one of those running at the limits?

concurrent_compactors: 48

Reducing this to 8 would free some space for transactions (R&W requests). It is probably worth a try, even more when compaction is not keeping up and compaction throughput is not throttled.

Just found an issue about that: https://issues.apache.org/jira/browse/CASSANDRA-7139

Looks like `concurrent_compactors: 8` is the new default.

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com<ma...@thelastpickle.com>
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com






2016-03-02 12:27 GMT+01:00 Anishek Agarwal <an...@gmail.com>>:
Thanks a lot Alian for the details.
`HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
`MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You might want to keep this as it or even reduce it if you have less than 16 GB of native memory. Go with 8 GB if you have a lot of memory.
`-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so far. I had luck with values between 4 <--> 16 in the past. I would give  a try with 15.
`-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ? Using default and then tune from there to improve things is generally a good idea.


we have a lot of reads and writes onto the system so keeping the high new size to make sure enough is held in memory including caches / memtables etc --number of flush_writers : 4 for us. similarly keeping less in old generation to make sure we spend less time with CMS GC most of the data is transient in memory for us. Keeping high TenuringThreshold because we don't want objects going to old generation and just die in young generation given we have configured large survivor spaces.
using occupancyFraction as 70 since
given heap is 4G
survivor space is : 400 mb -- 2 survivor spaces
70 % of 2G (old generation) = 1.4G

so once we are just below 1.4G and we have to move the full survivor + some extra during a par new gc due to promotion failure, everything will fit in old generation, and will trigger CMS.

I have spent time reading about all other options before including them and a similar configuration on our other prod cluster is showing good GC graphs via gcviewer.

tp stats on all machines show flush writer blocked at : 0.3% of total

the two nodes in question have stats almost as below

  *   specifically there are pending was in readStage, MutationStage and RequestResponseStage

Pool Name                    Active   Pending      Completed   Blocked  All time blocked

ReadStage                        21        19     2141798645         0                 0

RequestResponseStage              0         1      803242391         0                 0

MutationStage                     0         0      291813703         0                 0

ReadRepairStage                   0         0      200544344         0                 0

ReplicateOnWriteStage             0         0              0         0                 0

GossipStage                       0         0         292477         0                 0

CacheCleanupExecutor              0         0              0         0                 0

MigrationStage                    0         0              0         0                 0

MemoryMeter                       0         0           2172         0                 0

FlushWriter                       0         0           2756         0                 6

ValidationExecutor                0         0            101         0                 0

InternalResponseStage             0         0              0         0                 0

AntiEntropyStage                  0         0            202         0                 0

MemtablePostFlusher               0         0           4395         0                 0

MiscStage                         0         0              0         0                 0

PendingRangeCalculator            0         0             20         0                 0

CompactionExecutor                4         4          49323         0                 0

commitlog_archiver                0         0              0         0                 0

HintedHandoff                     0         0            116         0                 0



Message type           Dropped

RANGE_SLICE                  0

READ_REPAIR                 36

PAGED_RANGE                  0

BINARY                       0

READ                     11471

MUTATION                   898

_TRACE                       0

REQUEST_RESPONSE             0

COUNTER_MUTATION             0

all the other 5 nodes show no pending numbers.


our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
compaction throughput is 0 MB/s
concurrent_compactors: 48
flush_writers: 4


I think Jeff is trying to spot a wide row messing with your system, so looking at the max row size on those nodes compared to other is more relevant than average size for this check.

i think is what you are looking for, please correct me if i am wrong

Compacted partition maximum bytes: 1629722
similar value on all 7 nodes.

grep -i "ERROR" /var/log/cassandra/system.log

there are MUTATION and READ messages dropped in high number on nodes in question and on other 5 nodes it varies between 1-3.

On Wed, Mar 2, 2016 at 4:15 PM, Alain RODRIGUEZ <ar...@gmail.com>> wrote:
Hi Anishek,

Even if it highly depends on your workload, here are my thoughts:

`HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
`MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You might want to keep this as it or even reduce it if you have less than 16 GB of native memory. Go with 8 GB if you have a lot of memory.
`-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so far. I had luck with values between 4 <--> 16 in the past. I would give  a try with 15.
`-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ? Using default and then tune from there to improve things is generally a good idea.

You also use a bunch of option I don't know about, if you are uncertain about them, you could try a default conf without the options you added and just the using the changes above from default https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh. Or you might find more useful information on a nice reference about this topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java Virtual Machine' part: https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and G1GC. G1GC is supposed to be easier to configure too.

the average row size for compacted partitions is about 1640 bytes on all nodes. We have replication factor 3 but the problem is only on two nodes.

I think Jeff is trying to spot a wide row messing with your system, so looking at the max row size on those nodes compared to other is more relevant than average size for this check.

the only other thing that stands out in cfstats is the read time and write time on the nodes with high GC is 5-7 times higher than other 5 nodes, but i think thats expected.

I would probably look at this the reverse way: I imagine that extra GC  is a consequence of something going wrong on those nodes as JVM / GC are configured the same way cluster-wide. GC / JVM issues are often due to Cassandra / system / hardware issues, inducing extra pressure on the JVM. I would try to tune JVM / GC only once the system is healthy. So I often saw high GC being a consequence rather than the root cause of an issue.

To explore this possibility:

Does this command show some dropped or blocked tasks? This would add pressure to heap.
nodetool tpstats

Do you have errors in logs? Always good to know when facing an issue.
grep -i "ERROR" /var/log/cassandra/system.log

How are compactions tuned (throughput + concurrent compactors)? This tuning might explain compactions not keeping up or a high GC pressure.

What are your disks / CPU? To help us giving you good arbitrary values to try.

Is there some iowait ? Could point to a bottleneck or bad hardware.
iostats -mx 5 100

...

Hope one of those will point you to an issue, but there are many more thing you could check.

Let us know how it goes,

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com<ma...@thelastpickle.com>
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



2016-03-02 10:33 GMT+01:00 Anishek Agarwal <an...@gmail.com>>:
also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G.

On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <an...@gmail.com>> wrote:
Hey Jeff,

one of the nodes with high GC has 1400 SST tables, all other nodes have about 500-900 SST tables. the other node with high GC has 636 SST tables.

the average row size for compacted partitions is about 1640 bytes on all nodes. We have replication factor 3 but the problem is only on two nodes.
the only other thing that stands out in cfstats is the read time and write time on the nodes with high GC is 5-7 times higher than other 5 nodes, but i think thats expected.

thanks
anishek




On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <je...@crowdstrike.com>> wrote:
Compaction falling behind will likely cause additional work on reads (more sstables to merge), but I’d be surprised if it manifested in super long GC. When you say twice as many sstables, how many is that?.

In cfstats, does anything stand out? Is max row size on those nodes larger than on other nodes?

What you don’t show in your JVM options is the new gen size – if you do have unusually large partitions on those two nodes (especially likely if you have rf=2 – if you have rf=3, then there’s probably a third node misbehaving you haven’t found yet), then raising new gen size can help handle the garbage created by reading large partitions without having to tolerate the promotion. Estimates for the amount of garbage vary, but it could be “gigabytes” of garbage on a very wide partition (see https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in progress to help mitigate that type of pain).

- Jeff

From: Anishek Agarwal
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
Date: Tuesday, March 1, 2016 at 11:12 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
Subject: Lot of GC on two nodes out of 7

Hello,

we have a cassandra cluster of 7 nodes, all of them have the same JVM GC configurations, all our writes /  reads use the TokenAware Policy wrapping a DCAware policy. All nodes are part of same Datacenter.

We are seeing that two nodes are having high GC collection times. Then mostly seem to spend time in GC like about 300-600 ms. This also seems to result in higher CPU utilisation on these machines. Other  5 nodes don't have this problem.

There is no additional repair activity going on the cluster, we are not sure why this is happening.
we checked cfhistograms on the two CF we have in the cluster and number of reads seems to be almost same.

we also used cfstats to see the number of ssttables on each node and one of the nodes with the above problem has twice the number of ssttables than other nodes. This still doesnot explain why two nodes have high GC Overheads. our GC config is as below:

JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"

JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"

JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"

JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"

JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"

JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"

JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"

JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"

JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"

JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"

JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"

JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"

JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"

JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"

# earlier value 131072 = 32768 * 4

JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"

JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"

JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"

JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"

#new

JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"

We are using cassandra 2.0.17. If anyone has any suggestion as to how what else we can look for to understand why this is happening please do reply.



Thanks
anishek

Re: Lot of GC on two nodes out of 7

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

Can you post a gist of the output of jstat -gccause (60 seconds worth)?  I
think it's cool you're willing to experiment with alternative JVM settings
but I've never seen anyone use max tenuring threshold of 50 either and I
can't imagine it's helpful.  Keep in mind if your objects are actually
reaching that threshold it means they've been copied 50x (really really
slow) and also you're going to end up spilling your eden objects directly
into your old gen if your survivor is full.  Considering the small amount
of memory you're using for heap I'm really not surprised you're running
into problems.

I recommend G1GC + 12GB heap and just let it optimize itself for almost all
cases with the latest JVM versions.

On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ <ar...@gmail.com> wrote:

> It looks like you are doing a good work with this cluster and know a lot
> about JVM, that's good :-).
>
> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
>
>
> That's good hardware too.
>
> With 64 GB of ram I would probably directly give a try to
> `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably.
>
> Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using
> `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the
> effects. But that's just an idea of something I would try to see the
> impacts, I don't think it will solve your current issues or even make it
> worse for this node.
>
> Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would
> allow you to store the memtables off-heap. Those are 2 improvements
> reducing the heap pressure that you might be interested in.
>
> I have spent time reading about all other options before including them
>> and a similar configuration on our other prod cluster is showing good GC
>> graphs via gcviewer.
>
>
> So, let's look for an other reason.
>
> there are MUTATION and READ messages dropped in high number on nodes in
>> question and on other 5 nodes it varies between 1-3.
>
>
> - Is Memory, CPU or disk a bottleneck? Is one of those running at the
> limits?
>
> concurrent_compactors: 48
>
>
> Reducing this to 8 would free some space for transactions (R&W requests).
> It is probably worth a try, even more when compaction is not keeping up and
> compaction throughput is not throttled.
>
> Just found an issue about that:
> https://issues.apache.org/jira/browse/CASSANDRA-7139
>
> Looks like `concurrent_compactors: 8` is the new default.
>
> C*heers,
> -----------------------
> Alain Rodriguez - alain@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
>
>
>
> 2016-03-02 12:27 GMT+01:00 Anishek Agarwal <an...@gmail.com>:
>
>> Thanks a lot Alian for the details.
>>
>>> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
>>> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
>>> might want to keep this as it or even reduce it if you have less than 16 GB
>>> of native memory. Go with 8 GB if you have a lot of memory.
>>> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
>>> far. I had luck with values between 4 <--> 16 in the past. I would give  a
>>> try with 15.
>>> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
>>> Using default and then tune from there to improve things is generally a
>>> good idea.
>>
>>
>>
>> we have a lot of reads and writes onto the system so keeping the high new
>> size to make sure enough is held in memory including caches / memtables etc
>> --number of flush_writers : 4 for us. similarly keeping less in old
>> generation to make sure we spend less time with CMS GC most of the data is
>> transient in memory for us. Keeping high TenuringThreshold because we don't
>> want objects going to old generation and just die in young generation given
>> we have configured large survivor spaces.
>> using occupancyFraction as 70 since
>> given heap is 4G
>> survivor space is : 400 mb -- 2 survivor spaces
>> 70 % of 2G (old generation) = 1.4G
>>
>> so once we are just below 1.4G and we have to move the full survivor +
>> some extra during a par new gc due to promotion failure, everything will
>> fit in old generation, and will trigger CMS.
>>
>> I have spent time reading about all other options before including them
>> and a similar configuration on our other prod cluster is showing good GC
>> graphs via gcviewer.
>>
>> tp stats on all machines show flush writer blocked at : 0.3% of total
>>
>> the two nodes in question have stats almost as below
>>
>>    - specifically there are pending was in readStage, MutationStage and
>>    RequestResponseStage
>>
>> Pool Name                    Active   Pending      Completed   Blocked
>> All time blocked
>>
>> ReadStage                        21        19     2141798645         0
>>               0
>>
>> RequestResponseStage              0         1      803242391         0
>>               0
>>
>> MutationStage                     0         0      291813703         0
>>               0
>>
>> ReadRepairStage                   0         0      200544344         0
>>               0
>>
>> ReplicateOnWriteStage             0         0              0         0
>>               0
>>
>> GossipStage                       0         0         292477         0
>>               0
>>
>> CacheCleanupExecutor              0         0              0         0
>>               0
>>
>> MigrationStage                    0         0              0         0
>>               0
>>
>> MemoryMeter                       0         0           2172         0
>>               0
>>
>> FlushWriter                       0         0           2756         0
>>               6
>>
>> ValidationExecutor                0         0            101         0
>>               0
>>
>> InternalResponseStage             0         0              0         0
>>               0
>>
>> AntiEntropyStage                  0         0            202         0
>>               0
>>
>> MemtablePostFlusher               0         0           4395         0
>>               0
>>
>> MiscStage                         0         0              0         0
>>               0
>>
>> PendingRangeCalculator            0         0             20         0
>>               0
>>
>> CompactionExecutor                4         4          49323         0
>>               0
>>
>> commitlog_archiver                0         0              0         0
>>               0
>>
>> HintedHandoff                     0         0            116         0
>>               0
>>
>>
>> Message type           Dropped
>>
>> RANGE_SLICE                  0
>>
>> READ_REPAIR                 36
>>
>> PAGED_RANGE                  0
>>
>> BINARY                       0
>>
>> READ                     11471
>>
>> MUTATION                   898
>>
>> _TRACE                       0
>>
>> REQUEST_RESPONSE             0
>>
>> COUNTER_MUTATION             0
>>
>> all the other 5 nodes show no pending numbers.
>>
>>
>> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
>> compaction throughput is 0 MB/s
>> concurrent_compactors: 48
>> flush_writers: 4
>>
>>
>>> I think Jeff is trying to spot a wide row messing with your system, so
>>> looking at the max row size on those nodes compared to other is more
>>> relevant than average size for this check.
>>
>>
>> i think is what you are looking for, please correct me if i am wrong
>>
>> Compacted partition maximum bytes: 1629722
>> similar value on all 7 nodes.
>>
>> grep -i "ERROR" /var/log/cassandra/system.log
>>
>>
>> there are MUTATION and READ messages dropped in high number on nodes in
>> question and on other 5 nodes it varies between 1-3.
>>
>> On Wed, Mar 2, 2016 at 4:15 PM, Alain RODRIGUEZ <ar...@gmail.com>
>> wrote:
>>
>>> Hi Anishek,
>>>
>>> Even if it highly depends on your workload, here are my thoughts:
>>>
>>> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
>>> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
>>> might want to keep this as it or even reduce it if you have less than 16 GB
>>> of native memory. Go with 8 GB if you have a lot of memory.
>>> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
>>> far. I had luck with values between 4 <--> 16 in the past. I would give  a
>>> try with 15.
>>> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
>>> Using default and then tune from there to improve things is generally a
>>> good idea.
>>>
>>> You also use a bunch of option I don't know about, if you are uncertain
>>> about them, you could try a default conf without the options you added and
>>> just the using the changes above from default
>>> https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh.
>>> Or you might find more useful information on a nice reference about
>>> this topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java
>>> Virtual Machine' part:
>>> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
>>>
>>> FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and
>>> G1GC. G1GC is supposed to be easier to configure too.
>>>
>>> the average row size for compacted partitions is about 1640 bytes on all
>>>> nodes. We have replication factor 3 but the problem is only on two nodes.
>>>>
>>>
>>> I think Jeff is trying to spot a wide row messing with your system, so
>>> looking at the max row size on those nodes compared to other is more
>>> relevant than average size for this check.
>>>
>>> the only other thing that stands out in cfstats is the read time and
>>>> write time on the nodes with high GC is 5-7 times higher than other 5
>>>> nodes, but i think thats expected.
>>>
>>>
>>> I would probably look at this the reverse way: I imagine that extra GC
>>>  is a consequence of something going wrong on those nodes as JVM / GC are
>>> configured the same way cluster-wide. GC / JVM issues are often due to
>>> Cassandra / system / hardware issues, inducing extra pressure on the JVM. I
>>> would try to tune JVM / GC only once the system is healthy. So I often saw
>>> high GC being a consequence rather than the root cause of an issue.
>>>
>>> To explore this possibility:
>>>
>>> Does this command show some dropped or blocked tasks? This would add
>>> pressure to heap.
>>> nodetool tpstats
>>>
>>> Do you have errors in logs? Always good to know when facing an issue.
>>> grep -i "ERROR" /var/log/cassandra/system.log
>>>
>>> How are compactions tuned (throughput + concurrent compactors)? This
>>> tuning might explain compactions not keeping up or a high GC pressure.
>>>
>>> What are your disks / CPU? To help us giving you good arbitrary values
>>> to try.
>>>
>>> Is there some iowait ? Could point to a bottleneck or bad hardware.
>>> iostats -mx 5 100
>>>
>>> ...
>>>
>>> Hope one of those will point you to an issue, but there are many more
>>> thing you could check.
>>>
>>> Let us know how it goes,
>>>
>>> C*heers,
>>> -----------------------
>>> Alain Rodriguez - alain@thelastpickle.com
>>> France
>>>
>>> The Last Pickle - Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>>
>>>
>>> 2016-03-02 10:33 GMT+01:00 Anishek Agarwal <an...@gmail.com>:
>>>
>>>> also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G.
>>>>
>>>> On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <an...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hey Jeff,
>>>>>
>>>>> one of the nodes with high GC has 1400 SST tables, all other nodes
>>>>> have about 500-900 SST tables. the other node with high GC has 636 SST
>>>>> tables.
>>>>>
>>>>> the average row size for compacted partitions is about 1640 bytes on
>>>>> all nodes. We have replication factor 3 but the problem is only on two
>>>>> nodes.
>>>>> the only other thing that stands out in cfstats is the read time and
>>>>> write time on the nodes with high GC is 5-7 times higher than other 5
>>>>> nodes, but i think thats expected.
>>>>>
>>>>> thanks
>>>>> anishek
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <jeff.jirsa@crowdstrike.com
>>>>> > wrote:
>>>>>
>>>>>> Compaction falling behind will likely cause additional work on reads
>>>>>> (more sstables to merge), but I’d be surprised if it manifested in super
>>>>>> long GC. When you say twice as many sstables, how many is that?.
>>>>>>
>>>>>> In cfstats, does anything stand out? Is max row size on those nodes
>>>>>> larger than on other nodes?
>>>>>>
>>>>>> What you don’t show in your JVM options is the new gen size – if you
>>>>>> do have unusually large partitions on those two nodes (especially likely if
>>>>>> you have rf=2 – if you have rf=3, then there’s probably a third node
>>>>>> misbehaving you haven’t found yet), then raising new gen size can help
>>>>>> handle the garbage created by reading large partitions without having to
>>>>>> tolerate the promotion. Estimates for the amount of garbage vary, but it
>>>>>> could be “gigabytes” of garbage on a very wide partition (see
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in
>>>>>> progress to help mitigate that type of pain).
>>>>>>
>>>>>> - Jeff
>>>>>>
>>>>>> From: Anishek Agarwal
>>>>>> Reply-To: "user@cassandra.apache.org"
>>>>>> Date: Tuesday, March 1, 2016 at 11:12 PM
>>>>>> To: "user@cassandra.apache.org"
>>>>>> Subject: Lot of GC on two nodes out of 7
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> we have a cassandra cluster of 7 nodes, all of them have the same JVM
>>>>>> GC configurations, all our writes /  reads use the TokenAware Policy
>>>>>> wrapping a DCAware policy. All nodes are part of same Datacenter.
>>>>>>
>>>>>> We are seeing that two nodes are having high GC collection times.
>>>>>> Then mostly seem to spend time in GC like about 300-600 ms. This also seems
>>>>>> to result in higher CPU utilisation on these machines. Other  5 nodes don't
>>>>>> have this problem.
>>>>>>
>>>>>> There is no additional repair activity going on the cluster, we are
>>>>>> not sure why this is happening.
>>>>>> we checked cfhistograms on the two CF we have in the cluster and
>>>>>> number of reads seems to be almost same.
>>>>>>
>>>>>> we also used cfstats to see the number of ssttables on each node and
>>>>>> one of the nodes with the above problem has twice the number of ssttables
>>>>>> than other nodes. This still doesnot explain why two nodes have high GC
>>>>>> Overheads. our GC config is as below:
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>>>>>>
>>>>>> # earlier value 131072 = 32768 * 4
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"
>>>>>>
>>>>>> #new
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"
>>>>>>
>>>>>> We are using cassandra 2.0.17. If anyone has any suggestion as to how
>>>>>> what else we can look for to understand why this is happening please do
>>>>>> reply.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> anishek
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Lot of GC on two nodes out of 7

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

It looks like you are doing a good work with this cluster and know a lot
about JVM, that's good :-).

our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM


That's good hardware too.

With 64 GB of ram I would probably directly give a try to
`MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably.

Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using
`-XX:MaxTenuringThreshold=15`, still on the canary node to observe the
effects. But that's just an idea of something I would try to see the
impacts, I don't think it will solve your current issues or even make it
worse for this node.

Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would
allow you to store the memtables off-heap. Those are 2 improvements
reducing the heap pressure that you might be interested in.

I have spent time reading about all other options before including them and
> a similar configuration on our other prod cluster is showing good GC graphs
> via gcviewer.


So, let's look for an other reason.

there are MUTATION and READ messages dropped in high number on nodes in
> question and on other 5 nodes it varies between 1-3.


- Is Memory, CPU or disk a bottleneck? Is one of those running at the
limits?

concurrent_compactors: 48


Reducing this to 8 would free some space for transactions (R&W requests).
It is probably worth a try, even more when compaction is not keeping up and
compaction throughput is not throttled.

Just found an issue about that:
https://issues.apache.org/jira/browse/CASSANDRA-7139

Looks like `concurrent_compactors: 8` is the new default.

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com






2016-03-02 12:27 GMT+01:00 Anishek Agarwal <an...@gmail.com>:

> Thanks a lot Alian for the details.
>
>> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
>> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
>> might want to keep this as it or even reduce it if you have less than 16 GB
>> of native memory. Go with 8 GB if you have a lot of memory.
>> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
>> far. I had luck with values between 4 <--> 16 in the past. I would give  a
>> try with 15.
>> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
>> Using default and then tune from there to improve things is generally a
>> good idea.
>
>
>
> we have a lot of reads and writes onto the system so keeping the high new
> size to make sure enough is held in memory including caches / memtables etc
> --number of flush_writers : 4 for us. similarly keeping less in old
> generation to make sure we spend less time with CMS GC most of the data is
> transient in memory for us. Keeping high TenuringThreshold because we don't
> want objects going to old generation and just die in young generation given
> we have configured large survivor spaces.
> using occupancyFraction as 70 since
> given heap is 4G
> survivor space is : 400 mb -- 2 survivor spaces
> 70 % of 2G (old generation) = 1.4G
>
> so once we are just below 1.4G and we have to move the full survivor +
> some extra during a par new gc due to promotion failure, everything will
> fit in old generation, and will trigger CMS.
>
> I have spent time reading about all other options before including them
> and a similar configuration on our other prod cluster is showing good GC
> graphs via gcviewer.
>
> tp stats on all machines show flush writer blocked at : 0.3% of total
>
> the two nodes in question have stats almost as below
>
>    - specifically there are pending was in readStage, MutationStage and
>    RequestResponseStage
>
> Pool Name                    Active   Pending      Completed   Blocked
> All time blocked
>
> ReadStage                        21        19     2141798645         0
>             0
>
> RequestResponseStage              0         1      803242391         0
>             0
>
> MutationStage                     0         0      291813703         0
>             0
>
> ReadRepairStage                   0         0      200544344         0
>             0
>
> ReplicateOnWriteStage             0         0              0         0
>             0
>
> GossipStage                       0         0         292477         0
>             0
>
> CacheCleanupExecutor              0         0              0         0
>             0
>
> MigrationStage                    0         0              0         0
>             0
>
> MemoryMeter                       0         0           2172         0
>             0
>
> FlushWriter                       0         0           2756         0
>             6
>
> ValidationExecutor                0         0            101         0
>             0
>
> InternalResponseStage             0         0              0         0
>             0
>
> AntiEntropyStage                  0         0            202         0
>             0
>
> MemtablePostFlusher               0         0           4395         0
>             0
>
> MiscStage                         0         0              0         0
>             0
>
> PendingRangeCalculator            0         0             20         0
>             0
>
> CompactionExecutor                4         4          49323         0
>             0
>
> commitlog_archiver                0         0              0         0
>             0
>
> HintedHandoff                     0         0            116         0
>             0
>
>
> Message type           Dropped
>
> RANGE_SLICE                  0
>
> READ_REPAIR                 36
>
> PAGED_RANGE                  0
>
> BINARY                       0
>
> READ                     11471
>
> MUTATION                   898
>
> _TRACE                       0
>
> REQUEST_RESPONSE             0
>
> COUNTER_MUTATION             0
>
> all the other 5 nodes show no pending numbers.
>
>
> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
> compaction throughput is 0 MB/s
> concurrent_compactors: 48
> flush_writers: 4
>
>
>> I think Jeff is trying to spot a wide row messing with your system, so
>> looking at the max row size on those nodes compared to other is more
>> relevant than average size for this check.
>
>
> i think is what you are looking for, please correct me if i am wrong
>
> Compacted partition maximum bytes: 1629722
> similar value on all 7 nodes.
>
> grep -i "ERROR" /var/log/cassandra/system.log
>
>
> there are MUTATION and READ messages dropped in high number on nodes in
> question and on other 5 nodes it varies between 1-3.
>
> On Wed, Mar 2, 2016 at 4:15 PM, Alain RODRIGUEZ <ar...@gmail.com>
> wrote:
>
>> Hi Anishek,
>>
>> Even if it highly depends on your workload, here are my thoughts:
>>
>> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
>> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
>> might want to keep this as it or even reduce it if you have less than 16 GB
>> of native memory. Go with 8 GB if you have a lot of memory.
>> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
>> far. I had luck with values between 4 <--> 16 in the past. I would give  a
>> try with 15.
>> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
>> Using default and then tune from there to improve things is generally a
>> good idea.
>>
>> You also use a bunch of option I don't know about, if you are uncertain
>> about them, you could try a default conf without the options you added and
>> just the using the changes above from default
>> https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh.
>> Or you might find more useful information on a nice reference about this
>> topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java
>> Virtual Machine' part:
>> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
>>
>> FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and
>> G1GC. G1GC is supposed to be easier to configure too.
>>
>> the average row size for compacted partitions is about 1640 bytes on all
>>> nodes. We have replication factor 3 but the problem is only on two nodes.
>>>
>>
>> I think Jeff is trying to spot a wide row messing with your system, so
>> looking at the max row size on those nodes compared to other is more
>> relevant than average size for this check.
>>
>> the only other thing that stands out in cfstats is the read time and
>>> write time on the nodes with high GC is 5-7 times higher than other 5
>>> nodes, but i think thats expected.
>>
>>
>> I would probably look at this the reverse way: I imagine that extra GC
>>  is a consequence of something going wrong on those nodes as JVM / GC are
>> configured the same way cluster-wide. GC / JVM issues are often due to
>> Cassandra / system / hardware issues, inducing extra pressure on the JVM. I
>> would try to tune JVM / GC only once the system is healthy. So I often saw
>> high GC being a consequence rather than the root cause of an issue.
>>
>> To explore this possibility:
>>
>> Does this command show some dropped or blocked tasks? This would add
>> pressure to heap.
>> nodetool tpstats
>>
>> Do you have errors in logs? Always good to know when facing an issue.
>> grep -i "ERROR" /var/log/cassandra/system.log
>>
>> How are compactions tuned (throughput + concurrent compactors)? This
>> tuning might explain compactions not keeping up or a high GC pressure.
>>
>> What are your disks / CPU? To help us giving you good arbitrary values to
>> try.
>>
>> Is there some iowait ? Could point to a bottleneck or bad hardware.
>> iostats -mx 5 100
>>
>> ...
>>
>> Hope one of those will point you to an issue, but there are many more
>> thing you could check.
>>
>> Let us know how it goes,
>>
>> C*heers,
>> -----------------------
>> Alain Rodriguez - alain@thelastpickle.com
>> France
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>>
>>
>> 2016-03-02 10:33 GMT+01:00 Anishek Agarwal <an...@gmail.com>:
>>
>>> also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G.
>>>
>>> On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <an...@gmail.com>
>>> wrote:
>>>
>>>> Hey Jeff,
>>>>
>>>> one of the nodes with high GC has 1400 SST tables, all other nodes have
>>>> about 500-900 SST tables. the other node with high GC has 636 SST tables.
>>>>
>>>> the average row size for compacted partitions is about 1640 bytes on
>>>> all nodes. We have replication factor 3 but the problem is only on two
>>>> nodes.
>>>> the only other thing that stands out in cfstats is the read time and
>>>> write time on the nodes with high GC is 5-7 times higher than other 5
>>>> nodes, but i think thats expected.
>>>>
>>>> thanks
>>>> anishek
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <je...@crowdstrike.com>
>>>> wrote:
>>>>
>>>>> Compaction falling behind will likely cause additional work on reads
>>>>> (more sstables to merge), but I’d be surprised if it manifested in super
>>>>> long GC. When you say twice as many sstables, how many is that?.
>>>>>
>>>>> In cfstats, does anything stand out? Is max row size on those nodes
>>>>> larger than on other nodes?
>>>>>
>>>>> What you don’t show in your JVM options is the new gen size – if you
>>>>> do have unusually large partitions on those two nodes (especially likely if
>>>>> you have rf=2 – if you have rf=3, then there’s probably a third node
>>>>> misbehaving you haven’t found yet), then raising new gen size can help
>>>>> handle the garbage created by reading large partitions without having to
>>>>> tolerate the promotion. Estimates for the amount of garbage vary, but it
>>>>> could be “gigabytes” of garbage on a very wide partition (see
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in
>>>>> progress to help mitigate that type of pain).
>>>>>
>>>>> - Jeff
>>>>>
>>>>> From: Anishek Agarwal
>>>>> Reply-To: "user@cassandra.apache.org"
>>>>> Date: Tuesday, March 1, 2016 at 11:12 PM
>>>>> To: "user@cassandra.apache.org"
>>>>> Subject: Lot of GC on two nodes out of 7
>>>>>
>>>>> Hello,
>>>>>
>>>>> we have a cassandra cluster of 7 nodes, all of them have the same JVM
>>>>> GC configurations, all our writes /  reads use the TokenAware Policy
>>>>> wrapping a DCAware policy. All nodes are part of same Datacenter.
>>>>>
>>>>> We are seeing that two nodes are having high GC collection times. Then
>>>>> mostly seem to spend time in GC like about 300-600 ms. This also seems to
>>>>> result in higher CPU utilisation on these machines. Other  5 nodes don't
>>>>> have this problem.
>>>>>
>>>>> There is no additional repair activity going on the cluster, we are
>>>>> not sure why this is happening.
>>>>> we checked cfhistograms on the two CF we have in the cluster and
>>>>> number of reads seems to be almost same.
>>>>>
>>>>> we also used cfstats to see the number of ssttables on each node and
>>>>> one of the nodes with the above problem has twice the number of ssttables
>>>>> than other nodes. This still doesnot explain why two nodes have high GC
>>>>> Overheads. our GC config is as below:
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>>>>>
>>>>> # earlier value 131072 = 32768 * 4
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"
>>>>>
>>>>> #new
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"
>>>>>
>>>>> We are using cassandra 2.0.17. If anyone has any suggestion as to how
>>>>> what else we can look for to understand why this is happening please do
>>>>> reply.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>> anishek
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Lot of GC on two nodes out of 7

Posted by Anishek Agarwal <an...@gmail.com>.

Thanks a lot Alian for the details.

> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
> might want to keep this as it or even reduce it if you have less than 16 GB
> of native memory. Go with 8 GB if you have a lot of memory.
> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
> far. I had luck with values between 4 <--> 16 in the past. I would give  a
> try with 15.
> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
> Using default and then tune from there to improve things is generally a
> good idea.



we have a lot of reads and writes onto the system so keeping the high new
size to make sure enough is held in memory including caches / memtables etc
--number of flush_writers : 4 for us. similarly keeping less in old
generation to make sure we spend less time with CMS GC most of the data is
transient in memory for us. Keeping high TenuringThreshold because we don't
want objects going to old generation and just die in young generation given
we have configured large survivor spaces.
using occupancyFraction as 70 since
given heap is 4G
survivor space is : 400 mb -- 2 survivor spaces
70 % of 2G (old generation) = 1.4G

so once we are just below 1.4G and we have to move the full survivor + some
extra during a par new gc due to promotion failure, everything will fit in
old generation, and will trigger CMS.

I have spent time reading about all other options before including them and
a similar configuration on our other prod cluster is showing good GC graphs
via gcviewer.

tp stats on all machines show flush writer blocked at : 0.3% of total

the two nodes in question have stats almost as below

   - specifically there are pending was in readStage, MutationStage and
   RequestResponseStage

Pool Name                    Active   Pending      Completed   Blocked  All
time blocked

ReadStage                        21        19     2141798645         0
            0

RequestResponseStage              0         1      803242391         0
            0

MutationStage                     0         0      291813703         0
            0

ReadRepairStage                   0         0      200544344         0
            0

ReplicateOnWriteStage             0         0              0         0
            0

GossipStage                       0         0         292477         0
            0

CacheCleanupExecutor              0         0              0         0
            0

MigrationStage                    0         0              0         0
            0

MemoryMeter                       0         0           2172         0
            0

FlushWriter                       0         0           2756         0
            6

ValidationExecutor                0         0            101         0
            0

InternalResponseStage             0         0              0         0
            0

AntiEntropyStage                  0         0            202         0
            0

MemtablePostFlusher               0         0           4395         0
            0

MiscStage                         0         0              0         0
            0

PendingRangeCalculator            0         0             20         0
            0

CompactionExecutor                4         4          49323         0
            0

commitlog_archiver                0         0              0         0
            0

HintedHandoff                     0         0            116         0
            0


Message type           Dropped

RANGE_SLICE                  0

READ_REPAIR                 36

PAGED_RANGE                  0

BINARY                       0

READ                     11471

MUTATION                   898

_TRACE                       0

REQUEST_RESPONSE             0

COUNTER_MUTATION             0

all the other 5 nodes show no pending numbers.


our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
compaction throughput is 0 MB/s
concurrent_compactors: 48
flush_writers: 4


> I think Jeff is trying to spot a wide row messing with your system, so
> looking at the max row size on those nodes compared to other is more
> relevant than average size for this check.


i think is what you are looking for, please correct me if i am wrong

Compacted partition maximum bytes: 1629722
similar value on all 7 nodes.

grep -i "ERROR" /var/log/cassandra/system.log


there are MUTATION and READ messages dropped in high number on nodes in
question and on other 5 nodes it varies between 1-3.

On Wed, Mar 2, 2016 at 4:15 PM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

> Hi Anishek,
>
> Even if it highly depends on your workload, here are my thoughts:
>
> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
> might want to keep this as it or even reduce it if you have less than 16 GB
> of native memory. Go with 8 GB if you have a lot of memory.
> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
> far. I had luck with values between 4 <--> 16 in the past. I would give  a
> try with 15.
> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
> Using default and then tune from there to improve things is generally a
> good idea.
>
> You also use a bunch of option I don't know about, if you are uncertain
> about them, you could try a default conf without the options you added and
> just the using the changes above from default
> https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh.
> Or you might find more useful information on a nice reference about this
> topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java
> Virtual Machine' part:
> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
>
> FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and
> G1GC. G1GC is supposed to be easier to configure too.
>
> the average row size for compacted partitions is about 1640 bytes on all
>> nodes. We have replication factor 3 but the problem is only on two nodes.
>>
>
> I think Jeff is trying to spot a wide row messing with your system, so
> looking at the max row size on those nodes compared to other is more
> relevant than average size for this check.
>
> the only other thing that stands out in cfstats is the read time and write
>> time on the nodes with high GC is 5-7 times higher than other 5 nodes, but
>> i think thats expected.
>
>
> I would probably look at this the reverse way: I imagine that extra GC  is
> a consequence of something going wrong on those nodes as JVM / GC are
> configured the same way cluster-wide. GC / JVM issues are often due to
> Cassandra / system / hardware issues, inducing extra pressure on the JVM. I
> would try to tune JVM / GC only once the system is healthy. So I often saw
> high GC being a consequence rather than the root cause of an issue.
>
> To explore this possibility:
>
> Does this command show some dropped or blocked tasks? This would add
> pressure to heap.
> nodetool tpstats
>
> Do you have errors in logs? Always good to know when facing an issue.
> grep -i "ERROR" /var/log/cassandra/system.log
>
> How are compactions tuned (throughput + concurrent compactors)? This
> tuning might explain compactions not keeping up or a high GC pressure.
>
> What are your disks / CPU? To help us giving you good arbitrary values to
> try.
>
> Is there some iowait ? Could point to a bottleneck or bad hardware.
> iostats -mx 5 100
>
> ...
>
> Hope one of those will point you to an issue, but there are many more
> thing you could check.
>
> Let us know how it goes,
>
> C*heers,
> -----------------------
> Alain Rodriguez - alain@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
> 2016-03-02 10:33 GMT+01:00 Anishek Agarwal <an...@gmail.com>:
>
>> also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G.
>>
>> On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <an...@gmail.com>
>> wrote:
>>
>>> Hey Jeff,
>>>
>>> one of the nodes with high GC has 1400 SST tables, all other nodes have
>>> about 500-900 SST tables. the other node with high GC has 636 SST tables.
>>>
>>> the average row size for compacted partitions is about 1640 bytes on all
>>> nodes. We have replication factor 3 but the problem is only on two nodes.
>>> the only other thing that stands out in cfstats is the read time and
>>> write time on the nodes with high GC is 5-7 times higher than other 5
>>> nodes, but i think thats expected.
>>>
>>> thanks
>>> anishek
>>>
>>>
>>>
>>>
>>> On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <je...@crowdstrike.com>
>>> wrote:
>>>
>>>> Compaction falling behind will likely cause additional work on reads
>>>> (more sstables to merge), but I’d be surprised if it manifested in super
>>>> long GC. When you say twice as many sstables, how many is that?.
>>>>
>>>> In cfstats, does anything stand out? Is max row size on those nodes
>>>> larger than on other nodes?
>>>>
>>>> What you don’t show in your JVM options is the new gen size – if you do
>>>> have unusually large partitions on those two nodes (especially likely if
>>>> you have rf=2 – if you have rf=3, then there’s probably a third node
>>>> misbehaving you haven’t found yet), then raising new gen size can help
>>>> handle the garbage created by reading large partitions without having to
>>>> tolerate the promotion. Estimates for the amount of garbage vary, but it
>>>> could be “gigabytes” of garbage on a very wide partition (see
>>>> https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in
>>>> progress to help mitigate that type of pain).
>>>>
>>>> - Jeff
>>>>
>>>> From: Anishek Agarwal
>>>> Reply-To: "user@cassandra.apache.org"
>>>> Date: Tuesday, March 1, 2016 at 11:12 PM
>>>> To: "user@cassandra.apache.org"
>>>> Subject: Lot of GC on two nodes out of 7
>>>>
>>>> Hello,
>>>>
>>>> we have a cassandra cluster of 7 nodes, all of them have the same JVM
>>>> GC configurations, all our writes /  reads use the TokenAware Policy
>>>> wrapping a DCAware policy. All nodes are part of same Datacenter.
>>>>
>>>> We are seeing that two nodes are having high GC collection times. Then
>>>> mostly seem to spend time in GC like about 300-600 ms. This also seems to
>>>> result in higher CPU utilisation on these machines. Other  5 nodes don't
>>>> have this problem.
>>>>
>>>> There is no additional repair activity going on the cluster, we are not
>>>> sure why this is happening.
>>>> we checked cfhistograms on the two CF we have in the cluster and number
>>>> of reads seems to be almost same.
>>>>
>>>> we also used cfstats to see the number of ssttables on each node and
>>>> one of the nodes with the above problem has twice the number of ssttables
>>>> than other nodes. This still doesnot explain why two nodes have high GC
>>>> Overheads. our GC config is as below:
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>>>>
>>>> # earlier value 131072 = 32768 * 4
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"
>>>>
>>>> #new
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"
>>>>
>>>> We are using cassandra 2.0.17. If anyone has any suggestion as to how
>>>> what else we can look for to understand why this is happening please do
>>>> reply.
>>>>
>>>>
>>>>
>>>> Thanks
>>>> anishek
>>>>
>>>>
>>>>
>>>
>>
>

Re: Lot of GC on two nodes out of 7

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

Hi Anishek,

Even if it highly depends on your workload, here are my thoughts:

`HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
`MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
might want to keep this as it or even reduce it if you have less than 16 GB
of native memory. Go with 8 GB if you have a lot of memory.
`-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
far. I had luck with values between 4 <--> 16 in the past. I would give  a
try with 15.
`-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
Using default and then tune from there to improve things is generally a
good idea.

You also use a bunch of option I don't know about, if you are uncertain
about them, you could try a default conf without the options you added and
just the using the changes above from default
https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh.
Or you might find more useful information on a nice reference about this
topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java
Virtual Machine' part:
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and
G1GC. G1GC is supposed to be easier to configure too.

the average row size for compacted partitions is about 1640 bytes on all
> nodes. We have replication factor 3 but the problem is only on two nodes.
>

I think Jeff is trying to spot a wide row messing with your system, so
looking at the max row size on those nodes compared to other is more
relevant than average size for this check.

the only other thing that stands out in cfstats is the read time and write
> time on the nodes with high GC is 5-7 times higher than other 5 nodes, but
> i think thats expected.

I would probably look at this the reverse way: I imagine that extra GC  is
a consequence of something going wrong on those nodes as JVM / GC are
configured the same way cluster-wide. GC / JVM issues are often due to
Cassandra / system / hardware issues, inducing extra pressure on the JVM. I
would try to tune JVM / GC only once the system is healthy. So I often saw
high GC being a consequence rather than the root cause of an issue.

To explore this possibility:

Does this command show some dropped or blocked tasks? This would add
pressure to heap.
nodetool tpstats

Do you have errors in logs? Always good to know when facing an issue.
grep -i "ERROR" /var/log/cassandra/system.log

How are compactions tuned (throughput + concurrent compactors)? This tuning
might explain compactions not keeping up or a high GC pressure.

What are your disks / CPU? To help us giving you good arbitrary values to
try.

Is there some iowait ? Could point to a bottleneck or bad hardware.
iostats -mx 5 100

...

Hope one of those will point you to an issue, but there are many more thing
you could check.

Let us know how it goes,

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-03-02 10:33 GMT+01:00 Anishek Agarwal <an...@gmail.com>:

> also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G.
>
> On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <an...@gmail.com> wrote:
>
>> Hey Jeff,
>>
>> one of the nodes with high GC has 1400 SST tables, all other nodes have
>> about 500-900 SST tables. the other node with high GC has 636 SST tables.
>>
>> the average row size for compacted partitions is about 1640 bytes on all
>> nodes. We have replication factor 3 but the problem is only on two nodes.
>> the only other thing that stands out in cfstats is the read time and
>> write time on the nodes with high GC is 5-7 times higher than other 5
>> nodes, but i think thats expected.
>>
>> thanks
>> anishek
>>
>>
>>
>>
>> On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <je...@crowdstrike.com>
>> wrote:
>>
>>> Compaction falling behind will likely cause additional work on reads
>>> (more sstables to merge), but I’d be surprised if it manifested in super
>>> long GC. When you say twice as many sstables, how many is that?.
>>>
>>> In cfstats, does anything stand out? Is max row size on those nodes
>>> larger than on other nodes?
>>>
>>> What you don’t show in your JVM options is the new gen size – if you do
>>> have unusually large partitions on those two nodes (especially likely if
>>> you have rf=2 – if you have rf=3, then there’s probably a third node
>>> misbehaving you haven’t found yet), then raising new gen size can help
>>> handle the garbage created by reading large partitions without having to
>>> tolerate the promotion. Estimates for the amount of garbage vary, but it
>>> could be “gigabytes” of garbage on a very wide partition (see
>>> https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in
>>> progress to help mitigate that type of pain).
>>>
>>> - Jeff
>>>
>>> From: Anishek Agarwal
>>> Reply-To: "user@cassandra.apache.org"
>>> Date: Tuesday, March 1, 2016 at 11:12 PM
>>> To: "user@cassandra.apache.org"
>>> Subject: Lot of GC on two nodes out of 7
>>>
>>> Hello,
>>>
>>> we have a cassandra cluster of 7 nodes, all of them have the same JVM GC
>>> configurations, all our writes /  reads use the TokenAware Policy wrapping
>>> a DCAware policy. All nodes are part of same Datacenter.
>>>
>>> We are seeing that two nodes are having high GC collection times. Then
>>> mostly seem to spend time in GC like about 300-600 ms. This also seems to
>>> result in higher CPU utilisation on these machines. Other  5 nodes don't
>>> have this problem.
>>>
>>> There is no additional repair activity going on the cluster, we are not
>>> sure why this is happening.
>>> we checked cfhistograms on the two CF we have in the cluster and number
>>> of reads seems to be almost same.
>>>
>>> we also used cfstats to see the number of ssttables on each node and one
>>> of the nodes with the above problem has twice the number of ssttables than
>>> other nodes. This still doesnot explain why two nodes have high GC
>>> Overheads. our GC config is as below:
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>>>
>>> # earlier value 131072 = 32768 * 4
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"
>>>
>>> #new
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"
>>>
>>> We are using cassandra 2.0.17. If anyone has any suggestion as to how
>>> what else we can look for to understand why this is happening please do
>>> reply.
>>>
>>>
>>>
>>> Thanks
>>> anishek
>>>
>>>
>>>
>>
>

Re: Lot of GC on two nodes out of 7

Posted by Anishek Agarwal <an...@gmail.com>.

also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G.

On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <an...@gmail.com> wrote:

> Hey Jeff,
>
> one of the nodes with high GC has 1400 SST tables, all other nodes have
> about 500-900 SST tables. the other node with high GC has 636 SST tables.
>
> the average row size for compacted partitions is about 1640 bytes on all
> nodes. We have replication factor 3 but the problem is only on two nodes.
> the only other thing that stands out in cfstats is the read time and write
> time on the nodes with high GC is 5-7 times higher than other 5 nodes, but
> i think thats expected.
>
> thanks
> anishek
>
>
>
>
> On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <je...@crowdstrike.com>
> wrote:
>
>> Compaction falling behind will likely cause additional work on reads
>> (more sstables to merge), but I’d be surprised if it manifested in super
>> long GC. When you say twice as many sstables, how many is that?.
>>
>> In cfstats, does anything stand out? Is max row size on those nodes
>> larger than on other nodes?
>>
>> What you don’t show in your JVM options is the new gen size – if you do
>> have unusually large partitions on those two nodes (especially likely if
>> you have rf=2 – if you have rf=3, then there’s probably a third node
>> misbehaving you haven’t found yet), then raising new gen size can help
>> handle the garbage created by reading large partitions without having to
>> tolerate the promotion. Estimates for the amount of garbage vary, but it
>> could be “gigabytes” of garbage on a very wide partition (see
>> https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in
>> progress to help mitigate that type of pain).
>>
>> - Jeff
>>
>> From: Anishek Agarwal
>> Reply-To: "user@cassandra.apache.org"
>> Date: Tuesday, March 1, 2016 at 11:12 PM
>> To: "user@cassandra.apache.org"
>> Subject: Lot of GC on two nodes out of 7
>>
>> Hello,
>>
>> we have a cassandra cluster of 7 nodes, all of them have the same JVM GC
>> configurations, all our writes /  reads use the TokenAware Policy wrapping
>> a DCAware policy. All nodes are part of same Datacenter.
>>
>> We are seeing that two nodes are having high GC collection times. Then
>> mostly seem to spend time in GC like about 300-600 ms. This also seems to
>> result in higher CPU utilisation on these machines. Other  5 nodes don't
>> have this problem.
>>
>> There is no additional repair activity going on the cluster, we are not
>> sure why this is happening.
>> we checked cfhistograms on the two CF we have in the cluster and number
>> of reads seems to be almost same.
>>
>> we also used cfstats to see the number of ssttables on each node and one
>> of the nodes with the above problem has twice the number of ssttables than
>> other nodes. This still doesnot explain why two nodes have high GC
>> Overheads. our GC config is as below:
>>
>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>>
>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>>
>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"
>>
>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>>
>> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>>
>> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"
>>
>> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"
>>
>> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>>
>> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>>
>> # earlier value 131072 = 32768 * 4
>>
>> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"
>>
>> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"
>>
>> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"
>>
>> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"
>>
>> #new
>>
>> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"
>>
>> We are using cassandra 2.0.17. If anyone has any suggestion as to how
>> what else we can look for to understand why this is happening please do
>> reply.
>>
>>
>>
>> Thanks
>> anishek
>>
>>
>>
>

Re: Lot of GC on two nodes out of 7

Posted by Anishek Agarwal <an...@gmail.com>.

Hey Jeff,

one of the nodes with high GC has 1400 SST tables, all other nodes have
about 500-900 SST tables. the other node with high GC has 636 SST tables.

the average row size for compacted partitions is about 1640 bytes on all
nodes. We have replication factor 3 but the problem is only on two nodes.
the only other thing that stands out in cfstats is the read time and write
time on the nodes with high GC is 5-7 times higher than other 5 nodes, but
i think thats expected.

thanks
anishek




On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <je...@crowdstrike.com>
wrote:

> Compaction falling behind will likely cause additional work on reads (more
> sstables to merge), but I’d be surprised if it manifested in super long GC.
> When you say twice as many sstables, how many is that?.
>
> In cfstats, does anything stand out? Is max row size on those nodes larger
> than on other nodes?
>
> What you don’t show in your JVM options is the new gen size – if you do
> have unusually large partitions on those two nodes (especially likely if
> you have rf=2 – if you have rf=3, then there’s probably a third node
> misbehaving you haven’t found yet), then raising new gen size can help
> handle the garbage created by reading large partitions without having to
> tolerate the promotion. Estimates for the amount of garbage vary, but it
> could be “gigabytes” of garbage on a very wide partition (see
> https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in progress
> to help mitigate that type of pain).
>
> - Jeff
>
> From: Anishek Agarwal
> Reply-To: "user@cassandra.apache.org"
> Date: Tuesday, March 1, 2016 at 11:12 PM
> To: "user@cassandra.apache.org"
> Subject: Lot of GC on two nodes out of 7
>
> Hello,
>
> we have a cassandra cluster of 7 nodes, all of them have the same JVM GC
> configurations, all our writes /  reads use the TokenAware Policy wrapping
> a DCAware policy. All nodes are part of same Datacenter.
>
> We are seeing that two nodes are having high GC collection times. Then
> mostly seem to spend time in GC like about 300-600 ms. This also seems to
> result in higher CPU utilisation on these machines. Other  5 nodes don't
> have this problem.
>
> There is no additional repair activity going on the cluster, we are not
> sure why this is happening.
> we checked cfhistograms on the two CF we have in the cluster and number of
> reads seems to be almost same.
>
> we also used cfstats to see the number of ssttables on each node and one
> of the nodes with the above problem has twice the number of ssttables than
> other nodes. This still doesnot explain why two nodes have high GC
> Overheads. our GC config is as below:
>
> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>
> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>
> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>
> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"
>
> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>
> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>
> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"
>
> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"
>
> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"
>
> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>
> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>
> # earlier value 131072 = 32768 * 4
>
> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"
>
> #new
>
> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"
>
> We are using cassandra 2.0.17. If anyone has any suggestion as to how what
> else we can look for to understand why this is happening please do reply.
>
>
>
> Thanks
> anishek
>
>
>

Re: Lot of GC on two nodes out of 7

Posted by Jeff Jirsa <je...@crowdstrike.com>.

Compaction falling behind will likely cause additional work on reads (more sstables to merge), but I’d be surprised if it manifested in super long GC. When you say twice as many sstables, how many is that?. 

In cfstats, does anything stand out? Is max row size on those nodes larger than on other nodes?

What you don’t show in your JVM options is the new gen size – if you do have unusually large partitions on those two nodes (especially likely if you have rf=2 – if you have rf=3, then there’s probably a third node misbehaving you haven’t found yet), then raising new gen size can help handle the garbage created by reading large partitions without having to tolerate the promotion. Estimates for the amount of garbage vary, but it could be “gigabytes” of garbage on a very wide partition (see https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in progress to help mitigate that type of pain).

- Jeff 

From:  Anishek Agarwal
Reply-To:  "user@cassandra.apache.org"
Date:  Tuesday, March 1, 2016 at 11:12 PM
To:  "user@cassandra.apache.org"
Subject:  Lot of GC on two nodes out of 7

Hello, 

we have a cassandra cluster of 7 nodes, all of them have the same JVM GC configurations, all our writes /  reads use the TokenAware Policy wrapping a DCAware policy. All nodes are part of same Datacenter.

We are seeing that two nodes are having high GC collection times. Then mostly seem to spend time in GC like about 300-600 ms. This also seems to result in higher CPU utilisation on these machines. Other  5 nodes don't have this problem.

There is no additional repair activity going on the cluster, we are not sure why this is happening. 
we checked cfhistograms on the two CF we have in the cluster and number of reads seems to be almost same. 

we also used cfstats to see the number of ssttables on each node and one of the nodes with the above problem has twice the number of ssttables than other nodes. This still doesnot explain why two nodes have high GC Overheads. our GC config is as below:
JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"

JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"

JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"

JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"

JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"

JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"

JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"

JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"

JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"

JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"

JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"

JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"

JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"

JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"

# earlier value 131072 = 32768 * 4

JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"

JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"

JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"

JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"

#new 

JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"


We are using cassandra 2.0.17. If anyone has any suggestion as to how what else we can look for to understand why this is happening please do reply. 



Thanks
anishek