You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Christopher Wirt <ch...@struq.com> on 2013/08/05 21:04:22 UTC

Counters and replication

Hello,

 

Question about counters, replication and the ReplicateOnWriteStage

 

I've recently turned on a new CF which uses a counter column. 

 

We have a three DC setup running Cassandra 1.2.4 with vNodes, hex core
processors, 32Gb memory.

DC 1 - 9 nodes with RF 3

DC 2 - 3 nodes with RF 2 

DC 3 - 3 nodes with RF 2

 

DC 1 one receives most of the updates to this counter column. ~3k per sec.

 

I've disabled any client reads while I sort out this issue.

Disk utilization is very low

Memory is aplenty (while not reading)

Schema:

CREATE TABLE cf1 (

  uid uuid,

  id1 int,

  id2 int,

  id3 int,

  ct counter,

  PRIMARY KEY (uid, id1, id2, id3)

) WITH .

 

Three of the machines in DC 1 are reporting very high CPU load.

Looking at tpstats there is a large number of pending ReplicateOnWriteStage
just on those machines.

 

Why would only three of the machines be reporting this? 

Assuming its distributed by uuid value there should be an even load across
the cluster, yea?

Am I missing something about how distributed counters work?

 

Is changing CL to ONE fine if I'm not too worried about 100% consistency?

 

 

Thanks,

Chris

RE: Counters and replication

Posted by Christopher Wirt <ch...@struq.com>.

Hi Richard,

Thanks for your reply.

 

The uid value is a generated guid and should distribute nicely.

I've just checked the data yesterday there are only 3 uids out of millions
for which there would have been more than 1000 increments.

We started with 256 num_tokens.

Client and server side I can see the writes being balanced.

 

 

Anyway, think I've got things under control now.

 

I appears I hadn't set an sstable size on the cf compaction strategy (LCS).
I guess this was defaulting to 10MB.

 

After setting this to 256MB one of the 'bad' nodes fixed itself. The other
two appeared to stall mid compaction, but after a quick restart both resumed
compaction with acceptable CPU utilization.

 

Any insight into how this caused the issue is welcome. 

 

 

 

Thanks

 

 

 

 

From: Richard Low [mailto:richard@wentnet.com] 
Sent: 05 August 2013 20:30
To: user@cassandra.apache.org
Subject: Re: Counters and replication

 

On 5 August 2013 20:04, Christopher Wirt <ch...@struq.com> wrote:

Hello,

 

Question about counters, replication and the ReplicateOnWriteStage

 

I've recently turned on a new CF which uses a counter column. 

 

We have a three DC setup running Cassandra 1.2.4 with vNodes, hex core
processors, 32Gb memory.

DC 1 - 9 nodes with RF 3

DC 2 - 3 nodes with RF 2 

DC 3 - 3 nodes with RF 2

 

DC 1 one receives most of the updates to this counter column. ~3k per sec.

 

I've disabled any client reads while I sort out this issue.

Disk utilization is very low

Memory is aplenty (while not reading)

Schema:

CREATE TABLE cf1 (

  uid uuid,

  id1 int,

  id2 int,

  id3 int,

  ct counter,

  PRIMARY KEY (uid, id1, id2, id3)

) WITH .

 

Three of the machines in DC 1 are reporting very high CPU load.

Looking at tpstats there is a large number of pending ReplicateOnWriteStage
just on those machines.

 

Why would only three of the machines be reporting this? 

Assuming its distributed by uuid value there should be an even load across
the cluster, yea?

Am I missing something about how distributed counters work?

 

If you have many different uid values and your cluster is balanced then you
should see even load.  Were your tokens chosen randomly?  Did you start out
with num_tokens set high or upgrade from num_tokens=1 or an earlier
Cassandra version?  Is it possible your workload is incrementing the counter
for one particular uid much more than the others?

 

The distribution of counters works the same as for non-counters in terms of
which nodes receive which values.  However, there is a read on the
coordinator (randomly chosen for each inc) to read the current value and
replicate it to the remaining replicas.  This makes counter increments much
more expensive than normal inserts, even if all your counters fit in cache.
This is done in the ReplicateOnWriteStage, which is why you are seeing that
queue build up.

 

Is changing CL to ONE fine if I'm not too worried about 100% consistency?

 

Yes, but to make the biggest difference you will need to turn off
replicate_on_write (alter table cf1 with replicate_on_write = false;) but
this *guarantees* your counts aren't replicated, even if all replicas are
up.  It avoids doing the read, so makes a huge difference to performance,
but means that if a node is unavailable later on, you *will* read
inconsistent counts.  (Or, worse, if a node fails, you will lose counts
forever.)  This is in contrast to CL.ONE inserts for normal values when
inserts are still attempted on all replicas, but only one is required to
succeed.

 

So you might be able to get a temporary performance boost by changing
replicate_on_write if your counter values aren't important.  But this won't
solve the root of the problem.

 

Richard.

Re: Counters and replication

Posted by Andrew Bialecki <an...@gmail.com>.

We've seen high CPU in tests on stress tests with counters. With our
workload, we had some hot counters (e.g. ones with 100s increments/sec)
with RF = 3, which caused the load to spike and replicate on write tasks to
back up on those three nodes. Richard already gave a good overview of why
this happens. As he said, changing the consistency level won't help you.
It'll decrease the write latency because the write will ack once it queues
the replicate on write task, but the node will still queue a task to
replicate the write to the other replicas. As mentioned, the only Cassandra
config fix is setting replicate_on_write to false, but that's definitely
not recommended unless you don't mind losing your counter values if a node
goes down.

Other options which will require work outside of Cassandra:

1. Partition hot counters and write to each randomly and then aggregate
them together at read-time. This is basically the same trick as writing to
a hot time series.
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
2. Absorb and aggregate increments and only increment in Cassandra every so
often. For instance, if a counter needs to incremented 100 times/sec,
increment an in-memory counter and then "flush" those increments at once by
issuing one increment/sec that has the sum of all the aggregates for that
time period. I believe Twitter does something like this (
http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011slide
26).
3. Faster disks.

All that said, you say you're seeing low disk utilization, which is
inconsistent with what what we saw. In our tests, we saw ~100% disk
utilization on the nodes for the hot counters, which made it easy to
determine what was going on. If disk isn't your bottleneck, then you
probably have a different issue.

On Mon, Aug 5, 2013 at 3:30 PM, Richard Low <ri...@wentnet.com> wrote:

> On 5 August 2013 20:04, Christopher Wirt <ch...@struq.com> wrote:
>
>> Hello,****
>>
>> ** **
>>
>> Question about counters, replication and the ReplicateOnWriteStage****
>>
>> ** **
>>
>> I’ve recently turned on a new CF which uses a counter column. ****
>>
>> ** **
>>
>> We have a three DC setup running Cassandra 1.2.4 with vNodes, hex core
>> processors, 32Gb memory.****
>>
>> DC 1 - 9 nodes with RF 3****
>>
>> DC 2 - 3 nodes with RF 2 ****
>>
>> DC 3 - 3 nodes with RF 2****
>>
>> ** **
>>
>> DC 1 one receives most of the updates to this counter column. ~3k per sec.
>> ****
>>
>> ** **
>>
>> I’ve disabled any client reads while I sort out this issue.****
>>
>> Disk utilization is very low****
>>
>> Memory is aplenty (while not reading)****
>>
>> Schema:****
>>
>> CREATE TABLE cf1 (****
>>
>>   uid uuid,****
>>
>>   id1 int,****
>>
>>   id2 int,****
>>
>>   id3 int,****
>>
>>   ct counter,****
>>
>>   PRIMARY KEY (uid, id1, id2, id3)****
>>
>> ) WITH …****
>>
>> ** **
>>
>> Three of the machines in DC 1 are reporting very high CPU load.****
>>
>> Looking at tpstats there is a large number of pending
>> ReplicateOnWriteStage just on those machines.****
>>
>> ** **
>>
>> Why would only three of the machines be reporting this? ****
>>
>> Assuming its distributed by uuid value there should be an even load
>> across the cluster, yea?****
>>
>> Am I missing something about how distributed counters work?
>>
>
> If you have many different uid values and your cluster is balanced then
> you should see even load.  Were your tokens chosen randomly?  Did you start
> out with num_tokens set high or upgrade from num_tokens=1 or an earlier
> Cassandra version?  Is it possible your workload is incrementing the
> counter for one particular uid much more than the others?
>
> The distribution of counters works the same as for non-counters in terms
> of which nodes receive which values.  However, there is a read on the
> coordinator (randomly chosen for each inc) to read the current value and
> replicate it to the remaining replicas.  This makes counter increments much
> more expensive than normal inserts, even if all your counters fit in cache.
>  This is done in the ReplicateOnWriteStage, which is why you are seeing
> that queue build up.
>
>
>> **
>>
>> Is changing CL to ONE fine if I’m not too worried about 100% consistency?
>>
>
> Yes, but to make the biggest difference you will need to turn off
> replicate_on_write (alter table cf1 with replicate_on_write = false;) but
> this *guarantees* your counts aren't replicated, even if all replicas are
> up.  It avoids doing the read, so makes a huge difference to performance,
> but means that if a node is unavailable later on, you *will* read
> inconsistent counts.  (Or, worse, if a node fails, you will lose counts
> forever.)  This is in contrast to CL.ONE inserts for normal values when
> inserts are still attempted on all replicas, but only one is required to
> succeed.
>
> So you might be able to get a temporary performance boost by changing
> replicate_on_write if your counter values aren't important.  But this won't
> solve the root of the problem.
>
> Richard.
>

Re: Counters and replication

Posted by Richard Low <ri...@wentnet.com>.

On 5 August 2013 20:04, Christopher Wirt <ch...@struq.com> wrote:

> Hello,****
>
> ** **
>
> Question about counters, replication and the ReplicateOnWriteStage****
>
> ** **
>
> I’ve recently turned on a new CF which uses a counter column. ****
>
> ** **
>
> We have a three DC setup running Cassandra 1.2.4 with vNodes, hex core
> processors, 32Gb memory.****
>
> DC 1 - 9 nodes with RF 3****
>
> DC 2 - 3 nodes with RF 2 ****
>
> DC 3 - 3 nodes with RF 2****
>
> ** **
>
> DC 1 one receives most of the updates to this counter column. ~3k per sec.
> ****
>
> ** **
>
> I’ve disabled any client reads while I sort out this issue.****
>
> Disk utilization is very low****
>
> Memory is aplenty (while not reading)****
>
> Schema:****
>
> CREATE TABLE cf1 (****
>
>   uid uuid,****
>
>   id1 int,****
>
>   id2 int,****
>
>   id3 int,****
>
>   ct counter,****
>
>   PRIMARY KEY (uid, id1, id2, id3)****
>
> ) WITH …****
>
> ** **
>
> Three of the machines in DC 1 are reporting very high CPU load.****
>
> Looking at tpstats there is a large number of pending
> ReplicateOnWriteStage just on those machines.****
>
> ** **
>
> Why would only three of the machines be reporting this? ****
>
> Assuming its distributed by uuid value there should be an even load across
> the cluster, yea?****
>
> Am I missing something about how distributed counters work?
>

If you have many different uid values and your cluster is balanced then you
should see even load.  Were your tokens chosen randomly?  Did you start out
with num_tokens set high or upgrade from num_tokens=1 or an earlier
Cassandra version?  Is it possible your workload is incrementing the
counter for one particular uid much more than the others?

The distribution of counters works the same as for non-counters in terms of
which nodes receive which values.  However, there is a read on the
coordinator (randomly chosen for each inc) to read the current value and
replicate it to the remaining replicas.  This makes counter increments much
more expensive than normal inserts, even if all your counters fit in cache.
 This is done in the ReplicateOnWriteStage, which is why you are seeing
that queue build up.

> **
>
> Is changing CL to ONE fine if I’m not too worried about 100% consistency?
>

Yes, but to make the biggest difference you will need to turn off
replicate_on_write (alter table cf1 with replicate_on_write = false;) but
this *guarantees* your counts aren't replicated, even if all replicas are
up.  It avoids doing the read, so makes a huge difference to performance,
but means that if a node is unavailable later on, you *will* read
inconsistent counts.  (Or, worse, if a node fails, you will lose counts
forever.)  This is in contrast to CL.ONE inserts for normal values when
inserts are still attempted on all replicas, but only one is required to
succeed.

So you might be able to get a temporary performance boost by changing
replicate_on_write if your counter values aren't important.  But this won't
solve the root of the problem.

Richard.