You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Yang <te...@gmail.com> on 2015/07/28 10:31:22 UTC

Re: question about bootstrapping sequence

I'm wondering how the Cassandra protocol brings a newly bootstrapped node
"up to speed".

for ease of illustration, let's say we just have one key, K, and the value
is continually updated: 1,2 ,3 ,4 ....

originally we have 1 node, A, now node B joins, and needs to bootstrap and
get its newly assigned range (just "K") from A.

now let's say A has seen updates 1,2,3 up to this point. according to the
StreamingRequestVerbHandler  , A does a flush of its memtable, then streams
out the new sstables.


but what while the (newly-flushed) sstable is being streamed out from A,
before B fully received them, A now gets more updates: 4,5,6.... ?

now B gets the streamed range, and happily declares itself ready, and joins
the ring.  but now it's actually not "up to speed" with the "old members".
cuz A now has a value K=6 while B has K=3


of course when clients query now, A's and B's results are reconciled, so
client gets latest result. but would B stay forever "not up to speed" ? how
can we make it up to speed?  cuz although the following is a very
hypothetical scenario, it will lead to lost writes: say B is still in the
"not up to date " state, then another node is removed and a new node is
inserted, then after more of such cycles, all the "up to date" nodes are
gone, and we essentially lose the latest writes.

Re: question about bootstrapping sequence

Posted by Yang <te...@gmail.com>.

thanks. hmmm somehow I had the impression that untill B's streamingIn
finished it does not adverise itself to other servers for receiving fresh
replications. looks I'm wrong here, ler me check the code......
On Jul 28, 2015 2:07 PM, "Robert Coli" <rc...@eventbrite.com> wrote:

> On Tue, Jul 28, 2015 at 1:01 PM, Yang <te...@gmail.com> wrote:
>
>> Thanks. but I don't think having more nodes in the example changes the
>> issue I outlined.
>>
>> say u have just key "X", rf = 3,  nodes A, B, D are responsible for "X".
>>
>> in stable mode, the  updates X=1, 2, 3, goes to all 3 servers.
>>
>> then at this time, node C joins, bootstraps, gets the sstables from B.
>> but on B, ***right after memtableswitch()***, updates X=4,5,6 arrive and
>> update the new memtable (the same updates also go to A and D). then B
>> continues to stream to C, and C gets its state to X=3.
>>
>
> You appear to be missing the point in my original mail : the memtable
> switch is irrelevant, because C is receiving the same writes into memtables
> that B is.
>
> They're not counted for the purposes of consistency, but they are
> otherwise received just as if C were a an actual replica.
>
> Bootstrapping is two parts :
>
> 1) streaming of sstables
> 2) "extra" replication
>
> Your mental model appears to ignore 2), which is why you care what
> flushed? Perhaps I am still misunderstanding the scenario you are
> describing?
>
> =Rob
>
>

Re: question about bootstrapping sequence

Posted by Robert Coli <rc...@eventbrite.com>.

On Tue, Jul 28, 2015 at 1:01 PM, Yang <te...@gmail.com> wrote:

> Thanks. but I don't think having more nodes in the example changes the
> issue I outlined.
>
> say u have just key "X", rf = 3,  nodes A, B, D are responsible for "X".
>
> in stable mode, the  updates X=1, 2, 3, goes to all 3 servers.
>
> then at this time, node C joins, bootstraps, gets the sstables from B. but
> on B, ***right after memtableswitch()***, updates X=4,5,6 arrive and update
> the new memtable (the same updates also go to A and D). then B continues to
> stream to C, and C gets its state to X=3.
>

You appear to be missing the point in my original mail : the memtable
switch is irrelevant, because C is receiving the same writes into memtables
that B is.

They're not counted for the purposes of consistency, but they are otherwise
received just as if C were a an actual replica.

Bootstrapping is two parts :

1) streaming of sstables
2) "extra" replication

Your mental model appears to ignore 2), which is why you care what flushed?
Perhaps I am still misunderstanding the scenario you are describing?

=Rob

Re: question about bootstrapping sequence

Posted by Yang <te...@gmail.com>.

Thanks. but I don't think having more nodes in the example changes the
issue I outlined.

say u have just key "X", rf = 3,  nodes A, B, D are responsible for "X".

in stable mode, the  updates X=1, 2, 3, goes to all 3 servers.

then at this time, node C joins, bootstraps, gets the sstables from B. but
on B, ***right after memtableswitch()***, updates X=4,5,6 arrive and update
the new memtable (the same updates also go to A and D). then B continues to
stream to C, and C gets its state to X=3.

now node C declares itself ready, and D gives up ownership of key "X". but
now the state of C and A, B are different.

On Tue, Jul 28, 2015 at 12:40 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Tue, Jul 28, 2015 at 1:31 AM, Yang <te...@gmail.com> wrote:
>
>> I'm wondering how the Cassandra protocol brings a newly bootstrapped node
>> "up to speed".
>>
>
> Bootstrapping nodes get "extra" replicated copies of data for the range
> they are joining.
>
> So if before the bootstrap the nodes responsible for Key "X" are :
>
> A B D
>
> and you add node C "between" B and D which takes over a sub-set of their
> replicas, writes go to the set A,B,C,D for the duration.
>
> =Rob
>
>

Re: question about bootstrapping sequence

Posted by Robert Coli <rc...@eventbrite.com>.

On Tue, Jul 28, 2015 at 1:31 AM, Yang <te...@gmail.com> wrote:

> I'm wondering how the Cassandra protocol brings a newly bootstrapped node
> "up to speed".
>

Bootstrapping nodes get "extra" replicated copies of data for the range
they are joining.

So if before the bootstrap the nodes responsible for Key "X" are :

A B D

and you add node C "between" B and D which takes over a sub-set of their
replicas, writes go to the set A,B,C,D for the duration.

=Rob