You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Alexander Shutyaev <sh...@gmail.com> on 2013/12/19 13:12:13 UTC

MUTATION messages dropped

Hi all!

We've had a problem with cassandra recently. We had 2 one-minute periods
when we got a lot of timeouts on the client side (the only timeouts during
9 days we are using cassandra in production). In the logs we've found
corresponding messages saying something about MUTATION messages dropped.

Now, the official faq [1] says that this is an indicator that the load is
too high. We've checked our monitoring and found out that 1-minute average
cpu load had a local peak at the time of the problem, but it was like 0.8
against 0.2 usual which I guess is nothing for a 2 core virtual machine.
We've also checked java threads - there was no peak there and their count
was reasonable ~240-250.

Can anyone give us a hint - what should we monitor to see this "high load"
and what should we tune to make it acceptable?

Thanks in advance,
Alexander

[1] http://wiki.apache.org/cassandra/FAQ#dropped_messages

Re: MUTATION messages dropped

Posted by Aaron Morton <aa...@thelastpickle.com>.

> I ended up changing memtable_flush_queue_size to be large enough to contain the biggest flood I saw.
As part of the flush process the “Switch Lock” is taken to synchronise around the commit log. This is a reentrant Read Write lock, the flush path takes the write lock and write path takes the read part. When flushing a CF the write lock is taken, the commit log is updated, and memtable is added to the flush queue. If the queue is full then the write lock will be held blocking the write threads from taking the read lock. 

There are a few reasons why the queue may be full, the simple one is the disk IO is not fast enough. Others are that the commit log segments are too small, there are lots of CF’s and/or lots of secondary indexes, or nodetoo flush is called frequently. 

Increasing the size of the queue is a good work around, and the correct approach if you have a lot of CF’s and/or secondary indexes. 

Hope that helps.


-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 21/12/2013, at 6:03 am, Ken Hancock <ke...@schange.com> wrote:

> I ended up changing memtable_flush_queue_size to be large enough to contain the biggest flood I saw.
> 
> I monitored tpstats over time using a collection script and an analysis script that I wrote to figure out what my largest peaks were.  In my case, all my mutation drops correlated with hitting the maximum memtable_flush_queue_size and then mutations drops stopped as soon as the queue size dropped below the max.
> 
> I threw the scripts up on github in case they're useful...
> 
> https://github.com/hancockks/tpstats
> 
> 
> 
> 
> On Fri, Dec 20, 2013 at 1:08 AM, Alexander Shutyaev <sh...@gmail.com> wrote:
> Thanks for you answers.
> 
> srmore,
> 
> We are using v2.0.0. As for GC I guess it does not correlate in our case, because we had cassandra running 9 days under production load and no dropped messages and I guess that during this time there were a lot of GCs.
> 
> Ken,
> 
> I've checked the values you indicated. Here they are:
> 
> node1     6498
> node2     6476
> node3     6642
> 
> I guess this is not good :) What can we do to fix this problem?
> 
> 
> 2013/12/19 Ken Hancock <ke...@schange.com>
> We had issues where the number of CF families that were being flushed would align and then block writes for a very brief period. If that happened when a bunch of writes came in, we'd see a spike in Mutation drops.
> 
> Check nodetool tpstats for FlushWriter all time blocked.
> 
> 
> On Thu, Dec 19, 2013 at 7:12 AM, Alexander Shutyaev <sh...@gmail.com> wrote:
> Hi all!
> 
> We've had a problem with cassandra recently. We had 2 one-minute periods when we got a lot of timeouts on the client side (the only timeouts during 9 days we are using cassandra in production). In the logs we've found corresponding messages saying something about MUTATION messages dropped.
> 
> Now, the official faq [1] says that this is an indicator that the load is too high. We've checked our monitoring and found out that 1-minute average cpu load had a local peak at the time of the problem, but it was like 0.8 against 0.2 usual which I guess is nothing for a 2 core virtual machine. We've also checked java threads - there was no peak there and their count was reasonable ~240-250.
> 
> Can anyone give us a hint - what should we monitor to see this "high load" and what should we tune to make it acceptable?
> 
> Thanks in advance,
> Alexander
> 
> [1] http://wiki.apache.org/cassandra/FAQ#dropped_messages
> 
> 
> 
> -- 
> Ken Hancock | System Architect, Advanced Advertising 
> SeaChange International 
> 50 Nagog Park
> Acton, Massachusetts 01720
> ken.hancock@schange.com | www.schange.com | NASDAQ:SEAC 
> Office: +1 (978) 889-3329 |  ken.hancock@schange.com | hancockks | hancockks	
> 
> 
> This e-mail and any attachments may contain information which is SeaChange International confidential. The information enclosed is intended only for the addressees herein and may not be copied or forwarded without permission from SeaChange International.
> 
> 
> 
> 
> -- 
> Ken Hancock | System Architect, Advanced Advertising 
> SeaChange International 
> 50 Nagog Park
> Acton, Massachusetts 01720
> ken.hancock@schange.com | www.schange.com | NASDAQ:SEAC 
> Office: +1 (978) 889-3329 |  ken.hancock@schange.com | hancockks | hancockks	
> 
> 
> This e-mail and any attachments may contain information which is SeaChange International confidential. The information enclosed is intended only for the addressees herein and may not be copied or forwarded without permission from SeaChange International.

Re: MUTATION messages dropped

Posted by Ken Hancock <ke...@schange.com>.

I ended up changing memtable_flush_queue_size to be large enough to contain
the biggest flood I saw.

I monitored tpstats over time using a collection script and an analysis
script that I wrote to figure out what my largest peaks were.  In my case,
all my mutation drops correlated with hitting the maximum
memtable_flush_queue_size and then mutations drops stopped as soon as the
queue size dropped below the max.

I threw the scripts up on github in case they're useful...

https://github.com/hancockks/tpstats




On Fri, Dec 20, 2013 at 1:08 AM, Alexander Shutyaev <sh...@gmail.com>wrote:

> Thanks for you answers.
>
> *srmore*,
>
> We are using v2.0.0. As for GC I guess it does not correlate in our case,
> because we had cassandra running 9 days under production load and no
> dropped messages and I guess that during this time there were a lot of GCs.
>
> *Ken*,
>
> I've checked the values you indicated. Here they are:
>
> node1     6498
> node2     6476
> node3     6642
>
> I guess this is not good :) What can we do to fix this problem?
>
>
> 2013/12/19 Ken Hancock <ke...@schange.com>
>
>> We had issues where the number of CF families that were being flushed
>> would align and then block writes for a very brief period. If that happened
>> when a bunch of writes came in, we'd see a spike in Mutation drops.
>>
>> Check nodetool tpstats for FlushWriter all time blocked.
>>
>>
>> On Thu, Dec 19, 2013 at 7:12 AM, Alexander Shutyaev <sh...@gmail.com>wrote:
>>
>>> Hi all!
>>>
>>> We've had a problem with cassandra recently. We had 2 one-minute periods
>>> when we got a lot of timeouts on the client side (the only timeouts during
>>> 9 days we are using cassandra in production). In the logs we've found
>>> corresponding messages saying something about MUTATION messages dropped.
>>>
>>> Now, the official faq [1] says that this is an indicator that the load
>>> is too high. We've checked our monitoring and found out that 1-minute
>>> average cpu load had a local peak at the time of the problem, but it was
>>> like 0.8 against 0.2 usual which I guess is nothing for a 2 core virtual
>>> machine. We've also checked java threads - there was no peak there and
>>> their count was reasonable ~240-250.
>>>
>>> Can anyone give us a hint - what should we monitor to see this "high
>>> load" and what should we tune to make it acceptable?
>>>
>>> Thanks in advance,
>>> Alexander
>>>
>>> [1] http://wiki.apache.org/cassandra/FAQ#dropped_messages
>>>
>>
>>
>>
>> --
>>  *Ken Hancock *| System Architect, Advanced Advertising
>> SeaChange International
>> 50 Nagog Park
>> Acton, Massachusetts 01720
>> ken.hancock@schange.com | www.schange.com | NASDAQ:SEAC<http://www.schange.com/en-US/Company/InvestorRelations.aspx>
>>
>> Office: +1 (978) 889-3329 | [image: Google Talk:] ken.hancock@schange.com
>>  | [image: Skype:]hancockks | [image: Yahoo IM:]hancockks [image:
>> LinkedIn] <http://www.linkedin.com/in/kenhancock>
>>
>> [image: SeaChange International]
>>  <http://www.schange.com/>This e-mail and any attachments may contain
>> information which is SeaChange International confidential. The information
>> enclosed is intended only for the addressees herein and may not be copied
>> or forwarded without permission from SeaChange International.
>>
>
>


-- 
*Ken Hancock *| System Architect, Advanced Advertising
SeaChange International
50 Nagog Park
Acton, Massachusetts 01720
ken.hancock@schange.com | www.schange.com |
NASDAQ:SEAC<http://www.schange.com/en-US/Company/InvestorRelations.aspx>

Office: +1 (978) 889-3329 | [image: Google Talk:]
ken.hancock@schange.com | [image:
Skype:]hancockks | [image: Yahoo IM:]hancockks [image:
LinkedIn]<http://www.linkedin.com/in/kenhancock>

[image: SeaChange International]
 <http://www.schange.com/>This e-mail and any attachments may contain
information which is SeaChange International confidential. The information
enclosed is intended only for the addressees herein and may not be copied
or forwarded without permission from SeaChange International.

Re: MUTATION messages dropped

Posted by Jared Biel <ja...@bolderthinking.com>.

I can't comment on your specific issue, but I don't know if running 2.0.0
in production is a good idea. At the very least I'd try upgrading to the
latest 2.0.X (currently 2.0.3.)

 https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/

On 20 December 2013 06:08, Alexander Shutyaev <sh...@gmail.com> wrote:

> Thanks for you answers.
>
> *srmore*,
>
> We are using v2.0.0. As for GC I guess it does not correlate in our case,
> because we had cassandra running 9 days under production load and no
> dropped messages and I guess that during this time there were a lot of GCs.
>
> *Ken*,
>
> I've checked the values you indicated. Here they are:
>
> node1     6498
> node2     6476
> node3     6642
>
> I guess this is not good :) What can we do to fix this problem?
>
>
> 2013/12/19 Ken Hancock <ke...@schange.com>
>
>> We had issues where the number of CF families that were being flushed
>> would align and then block writes for a very brief period. If that happened
>> when a bunch of writes came in, we'd see a spike in Mutation drops.
>>
>> Check nodetool tpstats for FlushWriter all time blocked.
>>
>>
>> On Thu, Dec 19, 2013 at 7:12 AM, Alexander Shutyaev <sh...@gmail.com>wrote:
>>
>>> Hi all!
>>>
>>> We've had a problem with cassandra recently. We had 2 one-minute periods
>>> when we got a lot of timeouts on the client side (the only timeouts during
>>> 9 days we are using cassandra in production). In the logs we've found
>>> corresponding messages saying something about MUTATION messages dropped.
>>>
>>> Now, the official faq [1] says that this is an indicator that the load
>>> is too high. We've checked our monitoring and found out that 1-minute
>>> average cpu load had a local peak at the time of the problem, but it was
>>> like 0.8 against 0.2 usual which I guess is nothing for a 2 core virtual
>>> machine. We've also checked java threads - there was no peak there and
>>> their count was reasonable ~240-250.
>>>
>>> Can anyone give us a hint - what should we monitor to see this "high
>>> load" and what should we tune to make it acceptable?
>>>
>>> Thanks in advance,
>>> Alexander
>>>
>>> [1] http://wiki.apache.org/cassandra/FAQ#dropped_messages
>>>
>>
>>
>>
>> --
>>  *Ken Hancock *| System Architect, Advanced Advertising
>> SeaChange International
>> 50 Nagog Park
>> Acton, Massachusetts 01720
>> ken.hancock@schange.com | www.schange.com | NASDAQ:SEAC<http://www.schange.com/en-US/Company/InvestorRelations.aspx>
>>
>> Office: +1 (978) 889-3329 | [image: Google Talk:] ken.hancock@schange.com
>>  | [image: Skype:]hancockks | [image: Yahoo IM:]hancockks [image:
>> LinkedIn] <http://www.linkedin.com/in/kenhancock>
>>
>> [image: SeaChange International]
>>  <http://www.schange.com/>This e-mail and any attachments may contain
>> information which is SeaChange International confidential. The information
>> enclosed is intended only for the addressees herein and may not be copied
>> or forwarded without permission from SeaChange International.
>>
>
>

Re: MUTATION messages dropped

Posted by Alexander Shutyaev <sh...@gmail.com>.

Thanks for you answers.

*srmore*,

We are using v2.0.0. As for GC I guess it does not correlate in our case,
because we had cassandra running 9 days under production load and no
dropped messages and I guess that during this time there were a lot of GCs.

*Ken*,

I've checked the values you indicated. Here they are:

node1     6498
node2     6476
node3     6642

I guess this is not good :) What can we do to fix this problem?


2013/12/19 Ken Hancock <ke...@schange.com>

> We had issues where the number of CF families that were being flushed
> would align and then block writes for a very brief period. If that happened
> when a bunch of writes came in, we'd see a spike in Mutation drops.
>
> Check nodetool tpstats for FlushWriter all time blocked.
>
>
> On Thu, Dec 19, 2013 at 7:12 AM, Alexander Shutyaev <sh...@gmail.com>wrote:
>
>> Hi all!
>>
>> We've had a problem with cassandra recently. We had 2 one-minute periods
>> when we got a lot of timeouts on the client side (the only timeouts during
>> 9 days we are using cassandra in production). In the logs we've found
>> corresponding messages saying something about MUTATION messages dropped.
>>
>> Now, the official faq [1] says that this is an indicator that the load is
>> too high. We've checked our monitoring and found out that 1-minute average
>> cpu load had a local peak at the time of the problem, but it was like 0.8
>> against 0.2 usual which I guess is nothing for a 2 core virtual machine.
>> We've also checked java threads - there was no peak there and their count
>> was reasonable ~240-250.
>>
>> Can anyone give us a hint - what should we monitor to see this "high
>> load" and what should we tune to make it acceptable?
>>
>> Thanks in advance,
>> Alexander
>>
>> [1] http://wiki.apache.org/cassandra/FAQ#dropped_messages
>>
>
>
>
> --
> *Ken Hancock *| System Architect, Advanced Advertising
> SeaChange International
> 50 Nagog Park
> Acton, Massachusetts 01720
> ken.hancock@schange.com | www.schange.com | NASDAQ:SEAC<http://www.schange.com/en-US/Company/InvestorRelations.aspx>
>
> Office: +1 (978) 889-3329 | [image: Google Talk:] ken.hancock@schange.com
>  | [image: Skype:]hancockks | [image: Yahoo IM:]hancockks [image:
> LinkedIn] <http://www.linkedin.com/in/kenhancock>
>
> [image: SeaChange International]
>  <http://www.schange.com/>This e-mail and any attachments may contain
> information which is SeaChange International confidential. The information
> enclosed is intended only for the addressees herein and may not be copied
> or forwarded without permission from SeaChange International.
>

Re: MUTATION messages dropped

Posted by Ken Hancock <ke...@schange.com>.

We had issues where the number of CF families that were being flushed would
align and then block writes for a very brief period. If that happened when
a bunch of writes came in, we'd see a spike in Mutation drops.

Check nodetool tpstats for FlushWriter all time blocked.

On Thu, Dec 19, 2013 at 7:12 AM, Alexander Shutyaev <sh...@gmail.com>wrote:

> Hi all!
>
> We've had a problem with cassandra recently. We had 2 one-minute periods
> when we got a lot of timeouts on the client side (the only timeouts during
> 9 days we are using cassandra in production). In the logs we've found
> corresponding messages saying something about MUTATION messages dropped.
>
> Now, the official faq [1] says that this is an indicator that the load is
> too high. We've checked our monitoring and found out that 1-minute average
> cpu load had a local peak at the time of the problem, but it was like 0.8
> against 0.2 usual which I guess is nothing for a 2 core virtual machine.
> We've also checked java threads - there was no peak there and their count
> was reasonable ~240-250.
>
> Can anyone give us a hint - what should we monitor to see this "high load"
> and what should we tune to make it acceptable?
>
> Thanks in advance,
> Alexander
>
> [1] http://wiki.apache.org/cassandra/FAQ#dropped_messages
>

-- 
*Ken Hancock *| System Architect, Advanced Advertising
SeaChange International
50 Nagog Park
Acton, Massachusetts 01720
ken.hancock@schange.com | www.schange.com |
NASDAQ:SEAC<http://www.schange.com/en-US/Company/InvestorRelations.aspx>

Office: +1 (978) 889-3329 | [image: Google Talk:]
ken.hancock@schange.com | [image:
Skype:]hancockks | [image: Yahoo IM:]hancockks [image:
LinkedIn]<http://www.linkedin.com/in/kenhancock>

[image: SeaChange International]
 <http://www.schange.com/>This e-mail and any attachments may contain
information which is SeaChange International confidential. The information
enclosed is intended only for the addressees herein and may not be copied
or forwarded without permission from SeaChange International.

Re: MUTATION messages dropped

Posted by srmore <co...@gmail.com>.

What version of Cassandra are you running ? I used to see them a lot with
1.2.9, I could correlate the dropped messages with the heap usage almost
every time, so check in the logs whether you are getting GC'd. In this
respect 1.2.12 appears to be more stable. Moving to 1.2.12 took care of
this for us.

Thanks,
Sandeep


On Thu, Dec 19, 2013 at 6:12 AM, Alexander Shutyaev <sh...@gmail.com>wrote:

> Hi all!
>
> We've had a problem with cassandra recently. We had 2 one-minute periods
> when we got a lot of timeouts on the client side (the only timeouts during
> 9 days we are using cassandra in production). In the logs we've found
> corresponding messages saying something about MUTATION messages dropped.
>
> Now, the official faq [1] says that this is an indicator that the load is
> too high. We've checked our monitoring and found out that 1-minute average
> cpu load had a local peak at the time of the problem, but it was like 0.8
> against 0.2 usual which I guess is nothing for a 2 core virtual machine.
> We've also checked java threads - there was no peak there and their count
> was reasonable ~240-250.
>
> Can anyone give us a hint - what should we monitor to see this "high load"
> and what should we tune to make it acceptable?
>
> Thanks in advance,
> Alexander
>
> [1] http://wiki.apache.org/cassandra/FAQ#dropped_messages
>