You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Wei Zhu <wz...@yahoo.com> on 2013/01/31 19:50:40 UTC

General question regarding bootstrap and nodetool repair

Hi,
After messing around with my Cassandra cluster recently, I think I need some basic understanding on how things work behind scene regarding data streaming.
Let's say we have three node cluster with RF = 3.  If node 3 for some reason dies and I want to replace it with a new node with the same (maybe minus one) range. During the bootstrap, how the data is streamed?
From what I observed, Node 3 has replicates for its primary range on node 4, 5. So it streams the data from them and starts to compact them. Also, node 3 holds replicates for primary range of node 2, so it streams data from node 2 and node 4. Similarly, it holds replicates for node 1. So data streamed from node 1 and node 2. So during the bootstaping, it basically gets the data from all the replicates (2 copies each), so it will require double the disk space in order to hold the data? Over the time, those SStables will be compacted and redundant will be removed? Is it true?

if we issue nodetool repair -pr on node 3, apart from streaming data from node 4, 5 to 3. We also see data stream between node 4, 5 since they hold the replicates. But I don't see log regarding "merkle tree calculation" on node 4,5. Just wondering how they know what data to stream in order to repair node 4, 5?

Thanks.
-Wei

Re: General question regarding bootstrap and nodetool repair

Posted by Wei Zhu <wz...@yahoo.com>.
Thanks Rob. I think you are right on it.

Here is what I found:

https://github.com/apache/cassandra/blob/cassandra-1.1.0/src/java/org/apache/cassandra/dht/RangeStreamer.java#L140


It sorts the end point by proximity and in 

https://github.com/apache/cassandra/blob/cassandra-1.1.0/src/java/org/apache/cassandra/dht/RangeStreamer.java#L171


It fetches the data from the only one source.

That answers my question. So we will have to run repair after the bootstrap to make sure the consistency. 

Thanks.
-Wei



________________________________
 From: Rob Coli <rc...@palominodb.com>
To: user@cassandra.apache.org 
Sent: Thursday, January 31, 2013 1:50 PM
Subject: Re: General question regarding bootstrap and nodetool repair
 
On Thu, Jan 31, 2013 at 12:19 PM, Wei Zhu <wz...@yahoo.com> wrote:
> But I am still not sure how about the my first question regarding the
> bootstrap, anyone?

As I understand it, bootstrap occurs from a single replica. Which
replica is chosen is based on some internal estimation of which is
closest/least loaded/etc. But only from a single replica, so in RF=3,
in order to be consistent with both you still have to run a repair.

=Rob

-- 
=Robert Coli
AIM&GTALK - rcoli@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb

Re: General question regarding bootstrap and nodetool repair

Posted by Rob Coli <rc...@palominodb.com>.
On Thu, Jan 31, 2013 at 12:19 PM, Wei Zhu <wz...@yahoo.com> wrote:
> But I am still not sure how about the my first question regarding the
> bootstrap, anyone?

As I understand it, bootstrap occurs from a single replica. Which
replica is chosen is based on some internal estimation of which is
closest/least loaded/etc. But only from a single replica, so in RF=3,
in order to be consistent with both you still have to run a repair.

=Rob

-- 
=Robert Coli
AIM&GTALK - rcoli@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb

Re: General question regarding bootstrap and nodetool repair

Posted by Wei Zhu <wz...@yahoo.com>.
I decided to dig in to the source code, looks like in the case of nodetool repair, if the current node sees the difference between the remote nodes based on the merkle tree calculation, it will start a streamrepair session to ask the remote nodes to stream data between  each other. 

But I am still not sure how about the my first question regarding the bootstrap, anyone?

Thanks.
-Wei

________________________________
 From: Wei Zhu <wz...@yahoo.com>
To: Cassandr usergroup <us...@cassandra.apache.org> 
Sent: Thursday, January 31, 2013 10:50 AM
Subject: General question regarding bootstrap and nodetool repair
 

Hi,
After messing around with my Cassandra cluster recently, I think I need some basic understanding on how things work behind scene regarding data streaming.
Let's say we have three node cluster with RF = 3.  If node 3 for some reason dies and I want to replace it with a new node with the same (maybe minus one) range. During the bootstrap, how the data is streamed?
From what I observed, Node 3 has replicates for its primary range on node 4, 5. So it streams the data from them and starts to compact them. Also, node 3 holds replicates for primary range of node 2, so it streams data from node 2 and node 4. Similarly, it holds replicates for node 1. So data streamed from node 1 and node 2. So during the bootstaping, it basically gets the data from all the replicates (2 copies each), so it will require double the disk space in order to hold the data? Over the time, those SStables will be compacted and redundant will be removed? Is it true?

if we issue nodetool repair -pr on node 3, apart from streaming data from node 4, 5 to 3. We also see data stream between node 4, 5 since they hold the replicates. But I don't see log regarding "merkle tree calculation" on node 4,5. Just wondering how they know what data to stream in order to repair node 4, 5?

Thanks.
-Wei