You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Michael Morris <mi...@gmail.com> on 2012/08/16 16:56:15 UTC

nodetool repair uses insane amount of disk space

Occasionally as I'm doing my regular anti-entropy repair I end up with a
node that uses an exceptional amount of disk space (node should have about
5-6 GB of data on it, but ends up with 25+GB, and consumes the limited
amount of disk space I have available)

How come a node would consume 5x its normal data size during the repair
process?

My setup is kind of strange in that it's only about 80-100GB of data on a
35 node cluster, with 2 data centers and 3 racks, however the rack
assignments are unbalanced.  One data center has 8 nodes, and the other
data center is split into 2 racks with one rack of 9 nodes, and the other
with 18 nodes.  However, within each rack, the tokens are distributed
equally. It's a long sad story about how we ended up this way, but it
basically boils down to having to utilize existing resources to resolve a
production issue.

Additionally, the repair process takes (what I feel is) an extremely long
time to complete (36+ hours), and it always seems that nodes are streaming
data to each other, even on back-to-back executions of the repair.

Any help on these issues is appreciated.

- Mike

Re: nodetool repair uses insane amount of disk space

Posted by Jim Cistaro <jc...@netflix.com>.

We see similar issues with some of the repairs at Netflix.

Regarding the growth in payload… we see similar symptoms where nodes can double or triple size.  Part of this may be because the repair may deal in large chunks for comparisons.  This means that even if there is one byte of entropy, you copy over a large chunk.  Another reason for the large growth is that if node A is inconsistent with replicas on B and C, you will copy over multiple sets of large chunks (one from each of the replicas) - even more sets in a multi datacenter environment.  (We are still investigating/analyzing the causes of such occurrences in our clusters – the above explanation is a possible cause.)

Are you only seeing growth on one node in the system?  You might want to check if other nodes logs show gossip issues with this node (and then check if you are creating a lot of hints and check your hint settings to make sure you save and replay them) and that may be why you see this even on back to back execution.

It is worth noting that we do major compactions (I am not suggesting you do this, just pointing it out for reference) and then see the payload shrink back down to normal.  So a lot of that payload increase appears to be redundant (most likely due to the chunking issue above).

Regarding processing time… Are you repairing each node serially? Are you repairing with primary range option?
AFAIK … You most likely want to use –pr.  Otherwise the further you get into the list of nodes, the more data has to go through the validation compaction (because you increased the size of some of your nodes).  Using –pr means you only repair a range once when repairing the cluster.  Without it, you repair the range on each node/replica.

Jim


From: aaron morton <aa...@thelastpickle.com>>
Reply-To: <us...@cassandra.apache.org>>
Date: Fri, 17 Aug 2012 20:40:54 +1200
To: <us...@cassandra.apache.org>>
Subject: Re: nodetool repair uses insane amount of disk space

I would take a look at the replication: whats the RF per DC and what does nodetool ring say. It's hard (as in no recommended) to get NTS with rack allocation working correctly. Without know much more I would try to understand what the topology is and if it can be simplified.

Additionally, the repair process takes (what I feel is) an extremely long time to complete (36+ hours), and it always seems that nodes are streaming data to each other, even on back-to-back executions of the repair.
Run some metrics to clock the network IO during repair.
Also run an experiment to repair a single CF twice from the same node and look at the logs for the second run. This will give us an idea of how much data is being transferred.
Note that very wide rows can result in large repair transfers as the whole row is diff'd and transferred if needed.

Hope that helps.


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 17/08/2012, at 11:14 AM, Michael Morris <mi...@gmail.com>> wrote:

Upgraded to 1.1.3 from 1.0.8 about 2 weeks ago.

On Thu, Aug 16, 2012 at 5:57 PM, aaron morton <aa...@thelastpickle.com>> wrote:
What version are using ? There were issues with repair using lots-o-space in 0.8.X, it's fixed in 1.X

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com<http://www.thelastpickle.com/>

On 17/08/2012, at 2:56 AM, Michael Morris <mi...@gmail.com>> wrote:

Occasionally as I'm doing my regular anti-entropy repair I end up with a node that uses an exceptional amount of disk space (node should have about 5-6 GB of data on it, but ends up with 25+GB, and consumes the limited amount of disk space I have available)

How come a node would consume 5x its normal data size during the repair process?

My setup is kind of strange in that it's only about 80-100GB of data on a 35 node cluster, with 2 data centers and 3 racks, however the rack assignments are unbalanced.  One data center has 8 nodes, and the other data center is split into 2 racks with one rack of 9 nodes, and the other with 18 nodes.  However, within each rack, the tokens are distributed equally. It's a long sad story about how we ended up this way, but it basically boils down to having to utilize existing resources to resolve a production issue.

Additionally, the repair process takes (what I feel is) an extremely long time to complete (36+ hours), and it always seems that nodes are streaming data to each other, even on back-to-back executions of the repair.

Any help on these issues is appreciated.

- Mike

Re: nodetool repair uses insane amount of disk space

Posted by aaron morton <aa...@thelastpickle.com>.

I would take a look at the replication: whats the RF per DC and what does nodetool ring say. It's hard (as in no recommended) to get NTS with rack allocation working correctly. Without know much more I would try to understand what the topology is and if it can be simplified. 

>> Additionally, the repair process takes (what I feel is) an extremely long time to complete (36+ hours), and it always seems that nodes are streaming data to each other, even on back-to-back executions of the repair.
Run some metrics to clock the network IO during repair. 
Also run an experiment to repair a single CF twice from the same node and look at the logs for the second run. This will give us an idea of how much data is being transferred. 
Note that very wide rows can result in large repair transfers as the whole row is diff'd and transferred if needed.
 
Hope that helps. 


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 17/08/2012, at 11:14 AM, Michael Morris <mi...@gmail.com> wrote:

> Upgraded to 1.1.3 from 1.0.8 about 2 weeks ago.
> 
> On Thu, Aug 16, 2012 at 5:57 PM, aaron morton <aa...@thelastpickle.com> wrote:
> What version are using ? There were issues with repair using lots-o-space in 0.8.X, it's fixed in 1.X
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 17/08/2012, at 2:56 AM, Michael Morris <mi...@gmail.com> wrote:
> 
>> Occasionally as I'm doing my regular anti-entropy repair I end up with a node that uses an exceptional amount of disk space (node should have about 5-6 GB of data on it, but ends up with 25+GB, and consumes the limited amount of disk space I have available)
>> 
>> How come a node would consume 5x its normal data size during the repair process?
>> 
>> My setup is kind of strange in that it's only about 80-100GB of data on a 35 node cluster, with 2 data centers and 3 racks, however the rack assignments are unbalanced.  One data center has 8 nodes, and the other data center is split into 2 racks with one rack of 9 nodes, and the other with 18 nodes.  However, within each rack, the tokens are distributed equally. It's a long sad story about how we ended up this way, but it basically boils down to having to utilize existing resources to resolve a production issue.
>> 
>> Additionally, the repair process takes (what I feel is) an extremely long time to complete (36+ hours), and it always seems that nodes are streaming data to each other, even on back-to-back executions of the repair.
>> 
>> Any help on these issues is appreciated.
>> 
>> - Mike
>> 
> 
>

Re: nodetool repair uses insane amount of disk space

Posted by Michael Morris <mi...@gmail.com>.

Upgraded to 1.1.3 from 1.0.8 about 2 weeks ago.

On Thu, Aug 16, 2012 at 5:57 PM, aaron morton <aa...@thelastpickle.com>wrote:

> What version are using ? There were issues with repair using lots-o-space
> in 0.8.X, it's fixed in 1.X
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 17/08/2012, at 2:56 AM, Michael Morris <mi...@gmail.com>
> wrote:
>
> Occasionally as I'm doing my regular anti-entropy repair I end up with a
> node that uses an exceptional amount of disk space (node should have about
> 5-6 GB of data on it, but ends up with 25+GB, and consumes the limited
> amount of disk space I have available)
>
> How come a node would consume 5x its normal data size during the repair
> process?
>
> My setup is kind of strange in that it's only about 80-100GB of data on a
> 35 node cluster, with 2 data centers and 3 racks, however the rack
> assignments are unbalanced.  One data center has 8 nodes, and the other
> data center is split into 2 racks with one rack of 9 nodes, and the other
> with 18 nodes.  However, within each rack, the tokens are distributed
> equally. It's a long sad story about how we ended up this way, but it
> basically boils down to having to utilize existing resources to resolve a
> production issue.
>
> Additionally, the repair process takes (what I feel is) an extremely long
> time to complete (36+ hours), and it always seems that nodes are streaming
> data to each other, even on back-to-back executions of the repair.
>
> Any help on these issues is appreciated.
>
> - Mike
>
>
>

Re: nodetool repair uses insane amount of disk space

Posted by aaron morton <aa...@thelastpickle.com>.

What version are using ? There were issues with repair using lots-o-space in 0.8.X, it's fixed in 1.X

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 17/08/2012, at 2:56 AM, Michael Morris <mi...@gmail.com> wrote:

> Occasionally as I'm doing my regular anti-entropy repair I end up with a node that uses an exceptional amount of disk space (node should have about 5-6 GB of data on it, but ends up with 25+GB, and consumes the limited amount of disk space I have available)
> 
> How come a node would consume 5x its normal data size during the repair process?
> 
> My setup is kind of strange in that it's only about 80-100GB of data on a 35 node cluster, with 2 data centers and 3 racks, however the rack assignments are unbalanced.  One data center has 8 nodes, and the other data center is split into 2 racks with one rack of 9 nodes, and the other with 18 nodes.  However, within each rack, the tokens are distributed equally. It's a long sad story about how we ended up this way, but it basically boils down to having to utilize existing resources to resolve a production issue.
> 
> Additionally, the repair process takes (what I feel is) an extremely long time to complete (36+ hours), and it always seems that nodes are streaming data to each other, even on back-to-back executions of the repair.
> 
> Any help on these issues is appreciated.
> 
> - Mike
>

Re: nodetool repair uses insane amount of disk space

Posted by Michael Morris <mi...@gmail.com>.

Thanks everyone, for the pointers.  I've found an opportunity to simplify
the setup, still 2 DCs and 3 rack setup (RF = 1 for DC with 1 rack, and RF
= 2 for DC with 2 racks), but now each rack contains 9 nodes with even
token distribution.

Once I got the new topology in place, I ran multiple repairs (serially) on
a single node to see if I could get the merkel trees to sync up with the
other nodes in that range.  I knew the 1st time, and even expected the 2nd
run, would be a bit out of sync.  What surprised me was that on the 3rd
repair run, there were still over 600 ranges out of sync for 1 CF, and over
1000 ranges out of sync for another CF.  To me, this isn't a big deal
(unless someone more knowledgeable about these things thinks it is), but
the repair process isn't using nearly as much space while it's doing its
work.

Re: nodetool repair uses insane amount of disk space

Posted by Peter Schuller <pe...@infidyne.com>.

> How come a node would consume 5x its normal data size during the repair
> process?

https://issues.apache.org/jira/browse/CASSANDRA-2699

It's likely a variation based on how out of synch you happen to be,
and whether you have a neighbor that's also been repaired and bloated
up already.

> My setup is kind of strange in that it's only about 80-100GB of data on a 35
> node cluster, with 2 data centers and 3 racks, however the rack assignments
> are unbalanced.  One data center has 8 nodes, and the other data center is
> split into 2 racks with one rack of 9 nodes, and the other with 18 nodes.
> However, within each rack, the tokens are distributed equally. It's a long
> sad story about how we ended up this way, but it basically boils down to
> having to utilize existing resources to resolve a production issue.

https://issues.apache.org/jira/browse/CASSANDRA-3810

In terms of DCs, different DC:s are effectively independent of each
other in terms of replica placement. So there is no need or desire for
two DC:s to be symmetrical.

The racks are important though if you are trying to take advantage of
racks being somewhat independent failure domains (for reasons outlined
in 3810 above).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)