You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by shalom sagges <sh...@gmail.com> on 2018/04/18 16:37:18 UTC

Dropped Mutations

Hi All,

I have a 44 node cluster (22 nodes on each DC).
Each node has 24 cores and 130 GB RAM, 3 TB HDDs.
Version 2.0.14 (soon to be upgraded)
~10K writes per second per node.
Heap size: 8 GB max, 2.4 GB newgen

I deployed Reaper and GC started to increase rapidly. I'm not sure if it's
because there was a lot of inconsistency in the data, but I decided to
increase the heap to 16 GB and new gen to 6 GB. I increased the max tenure
from 1 to 5.

I tested on a canary node and everything was fine but when I changed the
entire DC, I suddenly saw a lot of dropped mutations in the logs on most of
the nodes. (Reaper was not running on the cluster yet but a manual repair
was running).

Can the heap increment cause lots of dropped mutations?
When is a mutation considered as dropped? Is it during flush? Is it during
the write to the commit log or memtable?

Thanks!

Re: Dropped Mutations

Posted by Shalom Sagges <sh...@liveperson.com>.

Thanks a lot Hitesh!

I'll try to re-tune the heap to a lower level


Shalom Sagges
DBA
T: +972-74-700-4035
<http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
<http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
<https://liveperson.docsend.com/view/8iiswfp>


On Thu, Apr 19, 2018 at 12:42 AM, hitesh dua <hi...@gmail.com> wrote:

> Hi ,
>
> I'll recommend tuning you heap size further( preferably lower) as large
> Heap size can lead to Large Garbage collection pauses also known as also
> known as a stop-the-world event. A pause occurs when a region of memory is
> full and the JVM needs to make space to continue. During a pause all
> operations are suspended. Because a pause affects networking, the node can
> appear as down to other nodes in the cluster. Additionally, any Select and
> Insert statements will wait, which increases read and write latencies.
>
> Any pause of more than a second, or multiple pauses within a second that
> add to a large fraction of that second, should be avoided. The basic cause
> of the problem is the rate of data stored in memory outpaces the rate at
> which data can be removed
>
> MUTATION : If a write message is processed after its timeout
> (write_request_timeout_in_ms) it either sent a failure to the client or it
> met its requested consistency level and will relay on hinted handoff and
> read repairs to do the mutation if it succeeded.
>
> Another possible cause of the Issue could be you HDDs as that could too
> be a bottleneck.
>
> *MAX_HEAP_SIZE*
> The recommended maximum heap size depends on which GC is used:
> Hardware setupRecommended MAX_HEAP_SIZE
> Older computers Typically 8 GB.
> CMS for newer computers (8+ cores) with up to 256 GB RAM No more 14 GB.
>
>
> Thanks,
> Hitesh dua
> hiteshdua1@gmail.com
>
> On Wed, Apr 18, 2018 at 10:07 PM, shalom sagges <sh...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I have a 44 node cluster (22 nodes on each DC).
>> Each node has 24 cores and 130 GB RAM, 3 TB HDDs.
>> Version 2.0.14 (soon to be upgraded)
>> ~10K writes per second per node.
>> Heap size: 8 GB max, 2.4 GB newgen
>>
>> I deployed Reaper and GC started to increase rapidly. I'm not sure if
>> it's because there was a lot of inconsistency in the data, but I decided to
>> increase the heap to 16 GB and new gen to 6 GB. I increased the max tenure
>> from 1 to 5.
>>
>> I tested on a canary node and everything was fine but when I changed the
>> entire DC, I suddenly saw a lot of dropped mutations in the logs on most of
>> the nodes. (Reaper was not running on the cluster yet but a manual repair
>> was running).
>>
>> Can the heap increment cause lots of dropped mutations?
>> When is a mutation considered as dropped? Is it during flush? Is it
>> during the write to the commit log or memtable?
>>
>> Thanks!
>>
>>
>>
>>
>

-- 
This message may contain confidential and/or privileged information. 
If 
you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this 
message or any information herein. 
If you have received this message in 
error, please advise the sender immediately by reply email and delete this 
message. Thank you.

Re: Dropped Mutations

Posted by hitesh dua <hi...@gmail.com>.

Hi ,

I'll recommend tuning you heap size further( preferably lower) as large
Heap size can lead to Large Garbage collection pauses also known as also
known as a stop-the-world event. A pause occurs when a region of memory is
full and the JVM needs to make space to continue. During a pause all
operations are suspended. Because a pause affects networking, the node can
appear as down to other nodes in the cluster. Additionally, any Select and
Insert statements will wait, which increases read and write latencies.

Any pause of more than a second, or multiple pauses within a second that
add to a large fraction of that second, should be avoided. The basic cause
of the problem is the rate of data stored in memory outpaces the rate at
which data can be removed

MUTATION : If a write message is processed after its timeout
(write_request_timeout_in_ms) it either sent a failure to the client or it
met its requested consistency level and will relay on hinted handoff and
read repairs to do the mutation if it succeeded.

Another possible cause of the Issue could be you HDDs as that could too be
a bottleneck.

*MAX_HEAP_SIZE*
The recommended maximum heap size depends on which GC is used:
Hardware setupRecommended MAX_HEAP_SIZE
Older computers Typically 8 GB.
CMS for newer computers (8+ cores) with up to 256 GB RAM No more 14 GB.

Thanks,
Hitesh dua
hiteshdua1@gmail.com

On Wed, Apr 18, 2018 at 10:07 PM, shalom sagges <sh...@gmail.com>
wrote:

> Hi All,
>
> I have a 44 node cluster (22 nodes on each DC).
> Each node has 24 cores and 130 GB RAM, 3 TB HDDs.
> Version 2.0.14 (soon to be upgraded)
> ~10K writes per second per node.
> Heap size: 8 GB max, 2.4 GB newgen
>
> I deployed Reaper and GC started to increase rapidly. I'm not sure if it's
> because there was a lot of inconsistency in the data, but I decided to
> increase the heap to 16 GB and new gen to 6 GB. I increased the max tenure
> from 1 to 5.
>
> I tested on a canary node and everything was fine but when I changed the
> entire DC, I suddenly saw a lot of dropped mutations in the logs on most of
> the nodes. (Reaper was not running on the cluster yet but a manual repair
> was running).
>
> Can the heap increment cause lots of dropped mutations?
> When is a mutation considered as dropped? Is it during flush? Is it during
> the write to the commit log or memtable?
>
> Thanks!
>
>
>
>