You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Jeremy Hanna <je...@gmail.com> on 2011/08/23 08:40:20 UTC

4/20 nodes get disproportionate amount of mutations

We've been having issues where as soon as we start doing heavy writes (via hadoop) recently, it really hammers 4 nodes out of 20.  We're using random partitioner and we've set the initial tokens for our 20 nodes according to the general spacing formula, except for a few token offsets as we've replaced dead nodes.

When I say hammers, I look at nodetool tpstats: those 4 nodes have completed something like 70 million mutation stage events whereas the rest of the cluster have completed from 2-20 million mutation stage events.  Therefore, on the 4 nodes, we find in the logs there is evidence of backing up in the mutation stage and a lot of read repair message drops.  It looks like there is quite a bit of flushing is going on and consequently auto minor compactions.

We are running 0.7.8 and have about 34 column families (when counting secondary indexes as column families) so we can't get too large with our memtable throughput in mb.  We would like to upgrade to 0.8.4 (not least because of JAMM) but it seems that something else is going on with our cluster if we are using RP and balanced initial tokens and still have 4 hot nodes.

Do these symptoms and context sound familiar to anyone?  Does anyone have any suggestions as to how to address this kind of case - disproportionate write load?

Thanks,

Jeremy

Re: 4/20 nodes get disproportionate amount of mutations

Posted by Jeremy Hanna <je...@gmail.com>.

I just mean that when it tries to put a replica on another "rack" which is part of what the replication strategy does in case a whole rack goes down, it looks to the next token in the ring that is on another rack.  If you don't alternate racks (or in this case availability zones) in token order, that can lead to serious hotspots.  For more on this with ec2, see: http://www.slideshare.net/mattdennis/cassandra-on-ec2/5 where he talks about alternating zones.

On Aug 25, 2011, at 10:45 AM, mcasandra wrote:

> Thanks for the update
> 
> Jeremy Hanna wrote:
>> 
>> It appears though that when choosing the non-local replicas, it looks for
>> the next token in the ring of the same rack and the next token of a
>> different rack (depending on which it is looking for).  
> 
> Can you please explain this little more?
> 
> 
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/4-20-nodes-get-disproportionate-amount-of-mutations-tp6714958p6724943.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: 4/20 nodes get disproportionate amount of mutations

Posted by mcasandra <mo...@gmail.com>.

Thanks for the update

Jeremy Hanna wrote:
> 
> It appears though that when choosing the non-local replicas, it looks for
> the next token in the ring of the same rack and the next token of a
> different rack (depending on which it is looking for).  

Can you please explain this little more?


--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/4-20-nodes-get-disproportionate-amount-of-mutations-tp6714958p6724943.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: 4/20 nodes get disproportionate amount of mutations

Posted by Jeremy Hanna <je...@gmail.com>.

As somewhat of a conclusion to this thread, we have resolved the major issue having to do with the hotspots.  We were balanced between availability zones in aws/ec2 us-east - a,b,c with the number of nodes in our cluster.  However we didn't alternate by rack with the token order.  We are using the property file snitch and had defined each node as being part of a single DC (for now) and a rack (a-c).  But we should have made the tokens in order go from a to b to c with their rack.

So as an example of alternating:
<token1>: <node in rack b>
<token2>: <node in rack a>
<token3>: <node in rack c>
<token4>: <node in rack b>
…

So we took some time and shifted things around and things appear to be much more spread out across the cluster.

As a side note: how we discovered the root of the problem.  We kept poring over logs and stats but it helped to trial DataStax's OpsCenter product to get a view of the cluster.  All that info is available via jmx and we could have mapped out node replication ourselves or with another monitoring product.  However that's what helped us.  That along with Brandon and others from the community helped us discover the reason.  Thanks again.

The reason why the order matters - currently when replicating I believe NetworkTopologyStrategy uses a pattern of choosing local for the first replica, in-rack for second replica, and off-rack for the next replica (depending on replication factor).  It appears though that when choosing the non-local replicas, it looks for the next token in the ring of the same rack and the next token of a different rack (depending on which it is looking for).  So that is why alternating by rack is important.  That might be able to be smarter in the future which would be nice - to not have to care and let Cassandra spread the replication around intelligently.

On Aug 23, 2011, at 6:02 AM, Jeremy Hanna wrote:

> 
> On Aug 23, 2011, at 3:43 AM, aaron morton wrote:
> 
>> Dropped messages in ReadRepair is odd. Are you also dropping mutations ? 
>> 
>> There are two tasks performed on the ReadRepair stage. The digests are compared on this stage, and secondly the repair happens on the stage. Comparing digests is quick. Doing the repair could take a bit longer, all the cf's returned are collated, filtered and deletes removed.  
>> 
>> We don't do background Read Repair on range scans, they do have foreground digest checking though.
>> 
>> What CL are you using ? 
> 
> CL.ONE for hadoop writes, CL.QUORUM for hadoop reads
> 
>> 
>> begin crazy theory:
>> 
>> 	Could there be a very big row that is out of sync ? The increased RR would be resulting in mutations been sent back to the replicas. Which would give you a hot spot in mutations.
>> 	
>> 	Check max compacted row size on the hot nodes. 
>> 	
>> 	Turn the logging up to DEBUG on the hot machines for o.a.c.service.RowRepairResolver and look for the "resolve:…" message it has the time taken.
> 
> The max compacted size didn't seem unreasonable - about a MB.  I turned up logging to DEBUG for that class and I get plenty of dropped READ_REPAIR messages, but nothing coming out of DEBUG in the logs to indicate the time taken that I can see.
> 
>> 
>> Cheers
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 23/08/2011, at 7:52 PM, Jeremy Hanna wrote:
>> 
>>> 
>>> On Aug 23, 2011, at 2:25 AM, Peter Schuller wrote:
>>> 
>>>>> We've been having issues where as soon as we start doing heavy writes (via hadoop) recently, it really hammers 4 nodes out of 20.  We're using random partitioner and we've set the initial tokens for our 20 nodes according to the general spacing formula, except for a few token offsets as we've replaced dead nodes.
>>>> 
>>>> Is the hadoop job iterating over keys in the cluster in token order
>>>> perhaps, and you're generating writes to those keys? That would
>>>> explain a "moving hotspot" along the cluster.
>>> 
>>> Yes - we're iterating over all the keys of particular column families, doing joins using pig as we enrich and perform measure calculations.  When we write, we're usually writing out for a certain small subset of keys which shouldn't have hotspots with RandomPartitioner afaict.
>>> 
>>>> 
>>>> -- 
>>>> / Peter Schuller (@scode on twitter)
>>> 
>> 
>

Re: 4/20 nodes get disproportionate amount of mutations

Posted by Jeremy Hanna <je...@gmail.com>.

On Aug 23, 2011, at 3:43 AM, aaron morton wrote:

> Dropped messages in ReadRepair is odd. Are you also dropping mutations ? 
> 
> There are two tasks performed on the ReadRepair stage. The digests are compared on this stage, and secondly the repair happens on the stage. Comparing digests is quick. Doing the repair could take a bit longer, all the cf's returned are collated, filtered and deletes removed.  
> 
> We don't do background Read Repair on range scans, they do have foreground digest checking though.
> 
> What CL are you using ? 

CL.ONE for hadoop writes, CL.QUORUM for hadoop reads

> 
> begin crazy theory:
> 
> 	Could there be a very big row that is out of sync ? The increased RR would be resulting in mutations been sent back to the replicas. Which would give you a hot spot in mutations.
> 	
> 	Check max compacted row size on the hot nodes. 
> 	
> 	Turn the logging up to DEBUG on the hot machines for o.a.c.service.RowRepairResolver and look for the "resolve:…" message it has the time taken.

The max compacted size didn't seem unreasonable - about a MB.  I turned up logging to DEBUG for that class and I get plenty of dropped READ_REPAIR messages, but nothing coming out of DEBUG in the logs to indicate the time taken that I can see.

> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 23/08/2011, at 7:52 PM, Jeremy Hanna wrote:
> 
>> 
>> On Aug 23, 2011, at 2:25 AM, Peter Schuller wrote:
>> 
>>>> We've been having issues where as soon as we start doing heavy writes (via hadoop) recently, it really hammers 4 nodes out of 20.  We're using random partitioner and we've set the initial tokens for our 20 nodes according to the general spacing formula, except for a few token offsets as we've replaced dead nodes.
>>> 
>>> Is the hadoop job iterating over keys in the cluster in token order
>>> perhaps, and you're generating writes to those keys? That would
>>> explain a "moving hotspot" along the cluster.
>> 
>> Yes - we're iterating over all the keys of particular column families, doing joins using pig as we enrich and perform measure calculations.  When we write, we're usually writing out for a certain small subset of keys which shouldn't have hotspots with RandomPartitioner afaict.
>> 
>>> 
>>> -- 
>>> / Peter Schuller (@scode on twitter)
>> 
>

Re: 4/20 nodes get disproportionate amount of mutations

Posted by aaron morton <aa...@thelastpickle.com>.

Dropped messages in ReadRepair is odd. Are you also dropping mutations ? 

There are two tasks performed on the ReadRepair stage. The digests are compared on this stage, and secondly the repair happens on the stage. Comparing digests is quick. Doing the repair could take a bit longer, all the cf's returned are collated, filtered and deletes removed.  

We don't do background Read Repair on range scans, they do have foreground digest checking though.

What CL are you using ? 

begin crazy theory:

	Could there be a very big row that is out of sync ? The increased RR would be resulting in mutations been sent back to the replicas. Which would give you a hot spot in mutations.
	
	Check max compacted row size on the hot nodes. 
	
	Turn the logging up to DEBUG on the hot machines for o.a.c.service.RowRepairResolver and look for the "resolve:…" message it has the time taken.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 23/08/2011, at 7:52 PM, Jeremy Hanna wrote:

> 
> On Aug 23, 2011, at 2:25 AM, Peter Schuller wrote:
> 
>>> We've been having issues where as soon as we start doing heavy writes (via hadoop) recently, it really hammers 4 nodes out of 20.  We're using random partitioner and we've set the initial tokens for our 20 nodes according to the general spacing formula, except for a few token offsets as we've replaced dead nodes.
>> 
>> Is the hadoop job iterating over keys in the cluster in token order
>> perhaps, and you're generating writes to those keys? That would
>> explain a "moving hotspot" along the cluster.
> 
> Yes - we're iterating over all the keys of particular column families, doing joins using pig as we enrich and perform measure calculations.  When we write, we're usually writing out for a certain small subset of keys which shouldn't have hotspots with RandomPartitioner afaict.
> 
>> 
>> -- 
>> / Peter Schuller (@scode on twitter)
>

Re: 4/20 nodes get disproportionate amount of mutations

Posted by Jeremy Hanna <je...@gmail.com>.

On Aug 23, 2011, at 2:25 AM, Peter Schuller wrote:

>> We've been having issues where as soon as we start doing heavy writes (via hadoop) recently, it really hammers 4 nodes out of 20.  We're using random partitioner and we've set the initial tokens for our 20 nodes according to the general spacing formula, except for a few token offsets as we've replaced dead nodes.
> 
> Is the hadoop job iterating over keys in the cluster in token order
> perhaps, and you're generating writes to those keys? That would
> explain a "moving hotspot" along the cluster.

Yes - we're iterating over all the keys of particular column families, doing joins using pig as we enrich and perform measure calculations.  When we write, we're usually writing out for a certain small subset of keys which shouldn't have hotspots with RandomPartitioner afaict.

> 
> -- 
> / Peter Schuller (@scode on twitter)

Re: 4/20 nodes get disproportionate amount of mutations

Posted by Peter Schuller <pe...@infidyne.com>.

> We've been having issues where as soon as we start doing heavy writes (via hadoop) recently, it really hammers 4 nodes out of 20.  We're using random partitioner and we've set the initial tokens for our 20 nodes according to the general spacing formula, except for a few token offsets as we've replaced dead nodes.

Is the hadoop job iterating over keys in the cluster in token order
perhaps, and you're generating writes to those keys? That would
explain a "moving hotspot" along the cluster.

-- 
/ Peter Schuller (@scode on twitter)