You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Robert Wille <rw...@fold3.com> on 2015/06/13 02:52:21 UTC

Dropped mutation messages

I am preparing to migrate a large amount of data to Cassandra. In order to test my migration code, I’ve been doing some dry runs to a test cluster. My test cluster is 2.0.15, 3 nodes, RF=1 and CL=QUORUM. I know RF=1 and CL=QUORUM is a weird combination, but my production cluster that will eventually receive this data is RF=3. I am running with RF=1 so its faster while I work out the kinks in the migration.

There are a few things that have puzzled me, after writing several 10’s of millions records to my test cluster.

My main concern is that I have a few tens of thousands of dropped mutation messages. I’m overloading my cluster. I never have more than about 10% CPU utilization (even my I/O wait is negligible). A curious thing about that is that the driver hasn’t thrown any exceptions, even though mutations have been dropped. I’ve seen dropped mutation messages on my production cluster, but like this, I’ve never gotten errors back from the client. I had always assumed that one node dropped mutation messages, but the other two did not, and so quorum was satisfied. With RF=1, I don’t understand how mutation messages are being dropped and the client doesn’t tell me about it. Does this mean my cluster is missing data, and I have no idea?

Each node has a couple dozen all-time blocked FlushWriters. Is that bad?

I have around 100 dropped counter mutations, which is very weird because I don’t write any counters. I have counters in my schema for tracking view counts, but the migration code doesn’t write them. How could I get dropped counter mutation messages when I don’t modify them?

Any insights would be appreciated. Thanks in advance.

Robert

Re: Dropped mutation messages

Posted by Robert Wille <rw...@fold3.com>.

My primary concern isn’t so much that I am dropping mutation messages, but the fact that I don’t get errors when it happens. According to my understanding, a dropped mutation message should always result in an error thrown back to the client when RF=1.

I looked at my logs to see what was going on when the dropped mutation messages occurred. Two nodes in my cluster dropped a bunch of messages at roughly the same time, and a third node dropped messages about two minutes later. No GC messages in the logs at that time on any node. Nothing else of interest in the logs around that time.

The logs showed the dropped mutation messages to be much older than I assumed they would be. I assumed they would be from when I was doing heavy migration, but they are from when I first started using the cluster.

For unknown reasons, my writes spiked heavily at the time the messages were dropped:

[cid:AFEC3FE2-CA25-42A0-8E12-32B075D281E3@iarchives.com]

My data migration tool is heavily threaded, and perhaps there was a bug in the code that limits the concurrency, and I fixed it without realizing it. I really have no reasonable explanation for that spike. Whatever the cause, the spike caused a lot of GC, but not enough to produce any GC events in the logs. My guess is that there were so many writes that they simply took too long and timed out. But again, it is very disturbing that the client never knew about it.

The good news is that I haven’t dropped mutation messages since I began migrating data in earnest, and I’ve imported close to a billion records so far. I should probably crank up the threading on my migration code to force errors on the cluster and see what happens. If I can reproduce the dropped messages without getting a timeout in the client, then perhaps I can file a jira.

Thanks to those that responded, and hopefully the information from my logs and OpsCenter will help people that are following this thread, or that may stumble across it in the future.

Robert

On Jun 13, 2015, at 12:09 PM, Anuj Wadehra <an...@yahoo.co.in>> wrote:

U said RF=1...missed that..so not sure eventual consistency is creating issues..

Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android<https://overview.mail.yahoo.com/mobile/?.src=Android>

________________________________
From:"Anuj Wadehra" <an...@yahoo.co.in>>
Date:Sat, 13 Jun, 2015 at 11:31 pm
Subject:Re: Dropped mutation messages

I think the messages dropped are the asynchronous ones required to maintain eventual consistency. Client may not be complaining as the data gets commited to one node synchronously..but dropped when sent to other nodes asynchronously..

We resolved similar issue in our cluster by increasing memtable_flush_writers to 3 from 1 ( we were writing to multiple cf simultaneously).

We also fixed GC issues and reduced total_memtable_size_in_mb to ensure that most memtables are flushed early in heavy write loads.

Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android<https://overview.mail.yahoo.com/mobile/?.src=Android>

________________________________
From:"Robert Wille" <rw...@fold3.com>>
Date:Sat, 13 Jun, 2015 at 8:29 pm
Subject:Re: Dropped mutation messages

Internode messages which are received by a node, but do not get not to be processed within rpc_timeout are dropped rather than processed. As the coordinator node will no longer be waiting for a response. If the Coordinator node does not receive Consistency Level responses before the rpc_timeout it will return a TimedOutException to the client.

I understand that, but that’s where this makes no sense. I’m running with RF=1, and CL=QUORUM, which means each update goes to one node, and I need one response for a success. I have many thousands of dropped mutation messages, but no TimedOutExceptions thrown back to the client. If I have GC problems, or other issues that are making my cluster unresponsive, I can deal with that. But having writes that fail and no error is clearly not acceptable. How is it possible to be getting errors and not be informed about them?

Thanks

Robert

Re: Dropped mutation messages

Posted by Robert Wille <rw...@fold3.com>.

For unknown reasons, my writes spiked heavily at the time the messages were dropped:

[cid:AFEC3FE2-CA25-42A0-8E12-32B075D281E3@iarchives.com]

Thanks to those that responded, and hopefully the information from my logs and OpsCenter will help people that are following this thread, or that may stumble across it in the future.

Robert

On Jun 13, 2015, at 12:09 PM, Anuj Wadehra <an...@yahoo.co.in>> wrote:

U said RF=1...missed that..so not sure eventual consistency is creating issues..

Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android<https://overview.mail.yahoo.com/mobile/?.src=Android>

________________________________
From:"Anuj Wadehra" <an...@yahoo.co.in>>
Date:Sat, 13 Jun, 2015 at 11:31 pm
Subject:Re: Dropped mutation messages

We resolved similar issue in our cluster by increasing memtable_flush_writers to 3 from 1 ( we were writing to multiple cf simultaneously).

We also fixed GC issues and reduced total_memtable_size_in_mb to ensure that most memtables are flushed early in heavy write loads.

Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android<https://overview.mail.yahoo.com/mobile/?.src=Android>

________________________________
From:"Robert Wille" <rw...@fold3.com>>
Date:Sat, 13 Jun, 2015 at 8:29 pm
Subject:Re: Dropped mutation messages

Thanks

Robert

Re: Dropped mutation messages

Posted by Anuj Wadehra <an...@yahoo.co.in>.

U said RF=1...missed that..so not sure eventual consistency is creating issues..

Thanks

Anuj Wadehra

Sent from Yahoo Mail on Android

From:"Anuj Wadehra" <an...@yahoo.co.in>
Date:Sat, 13 Jun, 2015 at 11:31 pm
Subject:Re: Dropped mutation messages

We resolved similar issue in our cluster by increasing memtable_flush_writers to 3 from 1 ( we were writing to multiple cf simultaneously).

We also fixed GC issues and reduced total_memtable_size_in_mb to ensure that most memtables are flushed early in heavy write loads.

Thanks

Anuj Wadehra

Sent from Yahoo Mail on Android

From:"Robert Wille" <rw...@fold3.com>
Date:Sat, 13 Jun, 2015 at 8:29 pm
Subject:Re: Dropped mutation messages

Thanks

Robert

Re: Dropped mutation messages

Posted by Anuj Wadehra <an...@yahoo.co.in>.

We resolved similar issue in our cluster by increasing memtable_flush_writers to 3 from 1 ( we were writing to multiple cf simultaneously).

We also fixed GC issues and reduced total_memtable_size_in_mb to ensure that most memtables are flushed early in heavy write loads.

Thanks

Anuj Wadehra

Sent from Yahoo Mail on Android

From:"Robert Wille" <rw...@fold3.com>
Date:Sat, 13 Jun, 2015 at 8:29 pm
Subject:Re: Dropped mutation messages

Thanks

Robert

Re: Dropped mutation messages

Posted by Robert Wille <rw...@fold3.com>.

Internode messages which are received by a node, but do not get not to be processed within rpc_timeout are dropped rather than processed. As the coordinator node will no longer be waiting for a response. If the Coordinator node does not receive Consistency Level responses before the rpc_timeout it will return a TimedOutException to the client.

I understand that, but that’s where this makes no sense. I’m running with RF=1, and CL=QUORUM, which means each update goes to one node, and I need one response for a success. I have many thousands of dropped mutation messages, but no TimedOutExceptions thrown back to the client. If I have GC problems, or other issues that are making my cluster unresponsive, I can deal with that. But having writes that fail and no error is clearly not acceptable. How is it possible to be getting errors and not be informed about them?

Thanks

Robert

Re: Dropped mutation messages

Posted by Robert Wille <rw...@fold3.com>.

I meant to say I’m *not* overloading my cluster.

On Jun 12, 2015, at 6:52 PM, Robert Wille <rw...@fold3.com> wrote:

> I am preparing to migrate a large amount of data to Cassandra. In order to test my migration code, I’ve been doing some dry runs to a test cluster. My test cluster is 2.0.15, 3 nodes, RF=1 and CL=QUORUM. I know RF=1 and CL=QUORUM is a weird combination, but my production cluster that will eventually receive this data is RF=3. I am running with RF=1 so its faster while I work out the kinks in the migration.
> 
> There are a few things that have puzzled me, after writing several 10’s of millions records to my test cluster.
> 
> My main concern is that I have a few tens of thousands of dropped mutation messages. I’m overloading my cluster. I never have more than about 10% CPU utilization (even my I/O wait is negligible). A curious thing about that is that the driver hasn’t thrown any exceptions, even though mutations have been dropped. I’ve seen dropped mutation messages on my production cluster, but like this, I’ve never gotten errors back from the client. I had always assumed that one node dropped mutation messages, but the other two did not, and so quorum was satisfied. With RF=1, I don’t understand how mutation messages are being dropped and the client doesn’t tell me about it. Does this mean my cluster is missing data, and I have no idea?
> 
> Each node has a couple dozen all-time blocked FlushWriters. Is that bad?
> 
> I have around 100 dropped counter mutations, which is very weird because I don’t write any counters. I have counters in my schema for tracking view counts, but the migration code doesn’t write them. How could I get dropped counter mutation messages when I don’t modify them?
> 
> Any insights would be appreciated. Thanks in advance.
> 
> Robert
>