You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Oleksandr Shulgin <ol...@zalando.de> on 2018/01/16 14:51:07 UTC

New token allocation and adding a new DC

Hello,

We want to add a new rack to an existing cluster (a new Availability Zone
on AWS).

Currently we have 12 nodes in 2 racks with ~4 TB data per node.  We also
want to have bigger number of smaller nodes.  In order to minimize the
streaming we want to add a new DC which will span 3 racks and then
decommission the old DC.

Following the documented procedure we are going to create all nodes in the
new DC with auto_bootstrap=false and a distinct dc_suffix.  Then we are
going to run `nodetool rebuild OLD_DC` on every node.

Since we are observing some uneven load distribution in the old DC, we
wanted to make use of new token allocation algorithm of Cassandra 3.0+ when
building the new DC.

To our understanding, this is currently not supported, because the new
algorithm can only be used during proper node bootstrap?

In theory it should still be possible to allocate tokens in the new DC by
telling Cassandra which keyspace to optimize for and from which remote DC
the data will be streamed ultimately, or am I missing something?

Reading through the original implementation ticket I didn't find any
reference to interaction with rebuild:
https://issues.apache.org/jira/browse/CASSANDRA-7032
Nor do I find any open tickets that would discuss the topic.

Is it reasonable to open an issue for that or is there some obvious blocker?

Thanks,
-- 
Alex

Re: New token allocation and adding a new DC

Posted by Alexander Dejanovski <al...@thelastpickle.com>.

Well, that's a shame...

That part of the code has been changed in trunk and now it uses
BootStrapper.getBootstrapTokens() instead of getRandomToken() when auto
boostrap is disabled :
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L938

I was hoping this would already be the case in 3.0.x/3.11.x :(
Maybe that change should be backported to 3.11.x ?

It doesn't seem like a big change actually (I can be wrong though,
Cassandra is a complex beast...) and your use case doesn't seem to be that
exotic.
One would expect that a new DC can be created with balanced ownership,
which is obviously not the case.


On Wed, Jan 17, 2018 at 6:27 PM Oleksandr Shulgin <
oleksandr.shulgin@zalando.de> wrote:

> On Wed, Jan 17, 2018 at 4:21 AM, kurt greaves <ku...@instaclustr.com>
> wrote:
>
>> I believe you are able to get away with just altering the keyspace to
>> include both DC's even before the DC exists, and then adding your nodes to
>> that new DC using the algorithm. Note you'll probably want to take the
>> opportunity to reduce the number of vnodes to something reasonable. Based
>> off memory from previous testing you can get a good token balance with 16
>> vnodes if you have at least 6 nodes per rack (with RF=3 and 3 racks).
>>
>
> Alexander, Kurt,
>
> Thank you for the suggestions.
>
> None of them did work in the end, unfortunately:
>
> 1. Using auto_bootstrap=false always results in random token allocation,
> ignoring the allocate_tokens_for_keyspace option.
>
> The token allocation option is only considered if shouldBootstrap()
> returns true:
>
> https://github.com/apache/cassandra/blob/cassandra-3.0.15/src/java/org/apache/cassandra/service/StorageService.java#L790
> if (shouldBootstrap()) {
>
> https://github.com/apache/cassandra/blob/cassandra-3.0.15/src/java/org/apache/cassandra/service/StorageService.java#L842
>   BootStrapper.getBootstrapTokens()  (the only place in code using the
> token allocation option)
>
> https://github.com/apache/cassandra/blob/cassandra-3.0.15/src/java/org/apache/cassandra/service/StorageService.java#L901
> else { ...
>
> 2. Using auto_bootstrap=true and allocate_tokens_for_keyspace=data_ks
> gives us balanced range ownership on the new empty DC.  The problem though,
> is that rebuilding of an already bootstrapped node doesn't work: the node
> believes that it already has all the data.
>
> We are going to proceed by manually assigning a small number of tokens to
> the nodes in new DC with auto_bootstrap=false and only use the automatic
> token allocation when we need to scale it out.  This seems to be the only
> supported way to use it anyway.
>
> Regards,
> --
> Alex
>
>

-- 
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: New token allocation and adding a new DC

Posted by Dikang Gu <di...@gmail.com>.

I fixed the new allocation algorithm in non bootstrap case,
https://issues.apache.org/jira/browse/CASSANDRA-13080?filter=-2, the fix is
in 3.12+, but not in 3.0.


On Wed, Jan 24, 2018 at 9:32 AM, Oleksandr Shulgin <
oleksandr.shulgin@zalando.de> wrote:

> On Thu, Jan 18, 2018 at 5:19 AM, kurt greaves <ku...@instaclustr.com>
> wrote:
>
>> Didn't know that about auto_bootstrap and the algorithm. We should
>> probably fix that. Can you create a JIRA for that issue?
>>
>
> Will do.
>
>
>> Workaround for #2 would be to truncate system.available_ranges after
>> "bootstrap".
>>
>
> Thanks, that seems to help.
>
> Initially we cannot login with cqlsh to run the truncate command on such a
> "bootstrapped" node.  But with the help of yet another workaround, namely
> pulling in the roles data by means of repairing system_auth keyspace only,
> it seems to be possible.  At least we see that netstats reports the ongoing
> streaming operations this time.
>
> --
> Alex
>
>


-- 
Dikang

Re: New token allocation and adding a new DC

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Thu, Jan 18, 2018 at 5:19 AM, kurt greaves <ku...@instaclustr.com> wrote:

> Didn't know that about auto_bootstrap and the algorithm. We should
> probably fix that. Can you create a JIRA for that issue?
>

Will do.

> Workaround for #2 would be to truncate system.available_ranges after
> "bootstrap".
>

Thanks, that seems to help.

Initially we cannot login with cqlsh to run the truncate command on such a
"bootstrapped" node.  But with the help of yet another workaround, namely
pulling in the roles data by means of repairing system_auth keyspace only,
it seems to be possible.  At least we see that netstats reports the ongoing
streaming operations this time.

--
Alex

Re: New token allocation and adding a new DC

Posted by kurt greaves <ku...@instaclustr.com>.

Didn't know that about auto_bootstrap and the algorithm. We should probably
fix that. Can you create a JIRA for that issue? Workaround for #2 would be
to truncate system.available_ranges after "bootstrap".

On 17 January 2018 at 17:26, Oleksandr Shulgin <oleksandr.shulgin@zalando.de
> wrote:

> On Wed, Jan 17, 2018 at 4:21 AM, kurt greaves <ku...@instaclustr.com>
> wrote:
>
>> I believe you are able to get away with just altering the keyspace to
>> include both DC's even before the DC exists, and then adding your nodes to
>> that new DC using the algorithm. Note you'll probably want to take the
>> opportunity to reduce the number of vnodes to something reasonable. Based
>> off memory from previous testing you can get a good token balance with 16
>> vnodes if you have at least 6 nodes per rack (with RF=3 and 3 racks).
>>
>
> Alexander, Kurt,
>
> Thank you for the suggestions.
>
> None of them did work in the end, unfortunately:
>
> 1. Using auto_bootstrap=false always results in random token allocation,
> ignoring the allocate_tokens_for_keyspace option.
>
> The token allocation option is only considered if shouldBootstrap()
> returns true:
> https://github.com/apache/cassandra/blob/cassandra-3.0.
> 15/src/java/org/apache/cassandra/service/StorageService.java#L790  if
> (shouldBootstrap()) {
> https://github.com/apache/cassandra/blob/cassandra-3.0.
> 15/src/java/org/apache/cassandra/service/StorageService.java#L842
>   BootStrapper.getBootstrapTokens()  (the only place in code using the
> token allocation option)
> https://github.com/apache/cassandra/blob/cassandra-3.0.
> 15/src/java/org/apache/cassandra/service/StorageService.java#L901  else {
> ...
>
> 2. Using auto_bootstrap=true and allocate_tokens_for_keyspace=data_ks
> gives us balanced range ownership on the new empty DC.  The problem though,
> is that rebuilding of an already bootstrapped node doesn't work: the node
> believes that it already has all the data.
>
> We are going to proceed by manually assigning a small number of tokens to
> the nodes in new DC with auto_bootstrap=false and only use the automatic
> token allocation when we need to scale it out.  This seems to be the only
> supported way to use it anyway.
>
> Regards,
> --
> Alex
>
>

Re: New token allocation and adding a new DC

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Wed, Jan 17, 2018 at 4:21 AM, kurt greaves <ku...@instaclustr.com> wrote:

> I believe you are able to get away with just altering the keyspace to
> include both DC's even before the DC exists, and then adding your nodes to
> that new DC using the algorithm. Note you'll probably want to take the
> opportunity to reduce the number of vnodes to something reasonable. Based
> off memory from previous testing you can get a good token balance with 16
> vnodes if you have at least 6 nodes per rack (with RF=3 and 3 racks).
>

Alexander, Kurt,

Thank you for the suggestions.

None of them did work in the end, unfortunately:

1. Using auto_bootstrap=false always results in random token allocation,
ignoring the allocate_tokens_for_keyspace option.

The token allocation option is only considered if shouldBootstrap() returns
true:
https://github.com/apache/cassandra/blob/cassandra-3.0.15/src/java/org/apache/cassandra/service/StorageService.java#L790
if (shouldBootstrap()) {
https://github.com/apache/cassandra/blob/cassandra-3.0.15/src/java/org/apache/cassandra/service/StorageService.java#L842
  BootStrapper.getBootstrapTokens()  (the only place in code using the
token allocation option)
https://github.com/apache/cassandra/blob/cassandra-3.0.15/src/java/org/apache/cassandra/service/StorageService.java#L901
else { ...

2. Using auto_bootstrap=true and allocate_tokens_for_keyspace=data_ks gives
us balanced range ownership on the new empty DC.  The problem though, is
that rebuilding of an already bootstrapped node doesn't work: the node
believes that it already has all the data.

We are going to proceed by manually assigning a small number of tokens to
the nodes in new DC with auto_bootstrap=false and only use the automatic
token allocation when we need to scale it out.  This seems to be the only
supported way to use it anyway.

Regards,
--
Alex

Re: New token allocation and adding a new DC

Posted by kurt greaves <ku...@instaclustr.com>.

I believe you are able to get away with just altering the keyspace to
include both DC's even before the DC exists, and then adding your nodes to
that new DC using the algorithm. Note you'll probably want to take the
opportunity to reduce the number of vnodes to something reasonable. Based
off memory from previous testing you can get a good token balance with 16
vnodes if you have at least 6 nodes per rack (with RF=3 and 3 racks).


On 16 January 2018 at 16:02, Oleksandr Shulgin <oleksandr.shulgin@zalando.de
> wrote:

> On Tue, Jan 16, 2018 at 4:16 PM, Alexander Dejanovski <
> alex@thelastpickle.com> wrote:
>
>> Hi Oleksandr,
>>
>> if bootstrap is disabled, it will only skip the streaming phase but will
>> still go through token allocation and thus should use the new algorithm.
>> The algorithm won't try to spread data based on size on disk but it will
>> try to spread token ownership as evenly as possible.
>>
>> The problem you'll run into is that ownership for a specific keyspace
>> will be null as long as the replication strategy isn't updated to create
>> replicas on the new DC.
>> Quickly thinking would make me do the following :
>>
>>    - Create enough nodes in the new DC to match the target replication
>>    factor
>>    - Alter the replication strategy to add the target number of replicas
>>    in the new DC (they will start getting writes, and hopefully you've already
>>    segregated reads)
>>    - Continue adding nodes in the new DC (with auto_bootstrap = false),
>>    specifying the right keyspace to optimize token allocations
>>    - Run rebuild on all nodes in the new DC
>>
>> I honestly never used it but that's my understanding of how it should
>> work.
>>
>
> Oh, that's neat.  We will try this and see if it helps.
>
> Thank you!
> --
> Alex
>
>

Re: New token allocation and adding a new DC

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Tue, Jan 16, 2018 at 4:16 PM, Alexander Dejanovski <
alex@thelastpickle.com> wrote:

> Hi Oleksandr,
>
> if bootstrap is disabled, it will only skip the streaming phase but will
> still go through token allocation and thus should use the new algorithm.
> The algorithm won't try to spread data based on size on disk but it will
> try to spread token ownership as evenly as possible.
>
> The problem you'll run into is that ownership for a specific keyspace will
> be null as long as the replication strategy isn't updated to create
> replicas on the new DC.
> Quickly thinking would make me do the following :
>
>    - Create enough nodes in the new DC to match the target replication
>    factor
>    - Alter the replication strategy to add the target number of replicas
>    in the new DC (they will start getting writes, and hopefully you've already
>    segregated reads)
>    - Continue adding nodes in the new DC (with auto_bootstrap = false),
>    specifying the right keyspace to optimize token allocations
>    - Run rebuild on all nodes in the new DC
>
> I honestly never used it but that's my understanding of how it should work.
>

Oh, that's neat.  We will try this and see if it helps.

Thank you!
--
Alex

Re: New token allocation and adding a new DC

Posted by Alexander Dejanovski <al...@thelastpickle.com>.

Hi Oleksandr,

if bootstrap is disabled, it will only skip the streaming phase but will
still go through token allocation and thus should use the new algorithm.
The algorithm won't try to spread data based on size on disk but it will
try to spread token ownership as evenly as possible.

The problem you'll run into is that ownership for a specific keyspace will
be null as long as the replication strategy isn't updated to create
replicas on the new DC.
Quickly thinking would make me do the following :

   - Create enough nodes in the new DC to match the target replication
   factor
   - Alter the replication strategy to add the target number of replicas in
   the new DC (they will start getting writes, and hopefully you've already
   segregated reads)
   - Continue adding nodes in the new DC (with auto_bootstrap = false),
   specifying the right keyspace to optimize token allocations
   - Run rebuild on all nodes in the new DC

I honestly never used it but that's my understanding of how it should work.

Cheers,


On Tue, Jan 16, 2018 at 3:51 PM Oleksandr Shulgin <
oleksandr.shulgin@zalando.de> wrote:

> Hello,
>
> We want to add a new rack to an existing cluster (a new Availability Zone
> on AWS).
>
> Currently we have 12 nodes in 2 racks with ~4 TB data per node.  We also
> want to have bigger number of smaller nodes.  In order to minimize the
> streaming we want to add a new DC which will span 3 racks and then
> decommission the old DC.
>
> Following the documented procedure we are going to create all nodes in the
> new DC with auto_bootstrap=false and a distinct dc_suffix.  Then we are
> going to run `nodetool rebuild OLD_DC` on every node.
>
> Since we are observing some uneven load distribution in the old DC, we
> wanted to make use of new token allocation algorithm of Cassandra 3.0+ when
> building the new DC.
>
> To our understanding, this is currently not supported, because the new
> algorithm can only be used during proper node bootstrap?
>
> In theory it should still be possible to allocate tokens in the new DC by
> telling Cassandra which keyspace to optimize for and from which remote DC
> the data will be streamed ultimately, or am I missing something?
>
> Reading through the original implementation ticket I didn't find any
> reference to interaction with rebuild:
> https://issues.apache.org/jira/browse/CASSANDRA-7032
> Nor do I find any open tickets that would discuss the topic.
>
> Is it reasonable to open an issue for that or is there some obvious
> blocker?
>
> Thanks,
> --
> Alex
>
>

-- 
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com