You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ratis.apache.org by Riguz Lee <dr...@riguz.com> on 2022/07/01 07:30:43 UTC

How to correctly scale the raft cluster?

Hi,




I'm testing scaling up/down the raft cluster, but ratis is not working as expected in new cluster. My steps are:




* Initialize a cluster with 5 nodes, the size and peers of the cluster is configured in a configuration file, let's say N=5. The cluster works perfectly, raft logs are synchronized across the cluster.

* Start 6 new nodes with new configuration N=11, while keeping the previous nodes running

* Recreate the previous nodes with N=11 one by one




According the raft paper, raft should be able to handle configuration change by design, but after the above steps, what I've found is that:




- New nodes not able to join the cluster

- Old nodes still has a size of 5(by client.getGroupManagementApi(peerId).info(groupId))




So how should I scale the cluster correctly? A few thoughts of mine:




* Definitely the old cluster should not be stopped while starting new nodes, otherwise new nodes might be able to elect new leader(eg. N=11 with 6 new nodes)  and raft logs in old nodes will be overriden.

* It's possible to update the old configuration first by using client.admin().setConfiguration(), let's say set N=11 first, then start new nodes. However,  since 5 < 11/2, the cluster won't be able to elect leader until at least 1 new node join. 

* Or may be we should limit the count when scaling? From N=5 -&gt; N=7 -&gt; N=9 -&gt; N=11. 




Thanks,

Riguz Lee

Re: How to correctly scale the raft cluster?

Posted by tison <wa...@gmail.com>.
Thanks for your explanation. With a closer look I can see that after the
new server install the snapshot it will call state machine `reinitialize'
method and recover the state.

Best,
tison.


Tsz Wo Sze <sz...@gmail.com> 于2022年7月4日周一 01:51写道:

> Hi tison,
>
> > When a new node join an existing cluster in this case, how it exactly
> applies logs? Will it applies raft logs from the beginning one by one?
>
> The leader takes a snapshot from time to time and then purges the old log
> entries.  When a new server joins the group, the leader will first send a
> snapshot, if there is any, and then send the remaining log entries.
>
> Tsz-Wo
>
>
> On Sun, Jul 3, 2022 at 9:26 AM tison <wa...@gmail.com> wrote:
>
>> Hi Tsz-Wo,
>>
>> I have one question that may be a bit out of topic:
>>
>> When a new node join an existing cluster in this case, how it exactly
>> applies logs? Will it applies raft logs from the beginning one by one? TiKV
>> will sends a snapshot of state machine from leader to the new node in this
>> case to speed up the catch up progress and I'd like to know how Ratis
>> implements catch-up.
>>
>> Best,
>> tison.
>>
>>
>> Tsz Wo Sze <sz...@gmail.com> 于2022年7月2日周六 01:31写道:
>>
>>> > * It's possible to update the old configuration first by using
>>> client.admin().setConfiguration(), let's say set N=11 first, then start new
>>> nodes. However, since 5 < 11/2, the cluster won't be able to elect leader
>>> until at least 1 new node join.
>>> Yes, you are right.  Also, even if one node has joined, the group has to
>>> wait for it to catch up with the previous log entries in order to obtain a
>>> majority for committing new entries.
>>>
>>> > * Or may be we should limit the count when scaling? From N=5 -> N=7 ->
>>> N=9 -> N=11.
>>>
>>> We may start the 6 new nodes as listeners first.  Listeners receive log
>>> entries but they are not voting members and they won't be counted for
>>> majority.  When the listeners catch up, we may change them to normal nodes
>>> so that they become voting members.
>>>
>>> Tsz-Wo
>>>
>>>
>>> On Fri, Jul 1, 2022 at 10:23 AM Tsz Wo Sze <sz...@gmail.com> wrote:
>>>
>>>> Hi Riguz,
>>>>
>>>> > Start 6 new nodes with new configuration N=11, while keeping the
>>>> previous nodes running
>>>>
>>>> This step probably won't work as expected since it will create a new
>>>> group but not adding nodes to the original group.  We must use the
>>>> setConfiguration API to change configuration (add/remove nodes); see
>>>> https://github.com/apache/ratis/blob/bd83e7d7fd41540c8bda6bd92a52ac99ccec2076/ratis-client/src/main/java/org/apache/ratis/client/api/AdminApi.java#L35
>>>>
>>>> Hope it helps.  Thanks a lot for trying Ratis!
>>>>
>>>> Tsz-Wo
>>>>
>>>> On Fri, Jul 1, 2022 at 12:30 AM Riguz Lee <dr...@riguz.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> I'm testing scaling up/down the raft cluster, but ratis is not working
>>>>> as expected in new cluster. My steps are:
>>>>>
>>>>>
>>>>> * Initialize a cluster with 5 nodes, the size and peers of the cluster
>>>>> is configured in a configuration file, let's say N=5. The cluster works
>>>>> perfectly, raft logs are synchronized across the cluster.
>>>>>
>>>>> * Start 6 new nodes with new configuration N=11, while keeping the
>>>>> previous nodes running
>>>>>
>>>>> * Recreate the previous nodes with N=11 one by one
>>>>>
>>>>>
>>>>> According the raft paper, raft should be able to handle configuration
>>>>> change by design, but after the above steps, what I've found is that:
>>>>>
>>>>>
>>>>> - New nodes not able to join the cluster
>>>>>
>>>>> - Old nodes still has a size of 5(by
>>>>> *client.getGroupManagementApi(peerId).info(groupId)*)
>>>>>
>>>>>
>>>>> So how should I scale the cluster correctly? A few thoughts of mine:
>>>>>
>>>>>
>>>>> * Definitely the old cluster should not be stopped while starting new
>>>>> nodes, otherwise new nodes might be able to elect new leader(eg. N=11 with
>>>>> 6 new nodes) and raft logs in old nodes will be overriden.
>>>>>
>>>>> * It's possible to update the old configuration first by using
>>>>> client.admin().setConfiguration(), let's say set N=11 first, then start new
>>>>> nodes. However, since 5 < 11/2, the cluster won't be able to elect leader
>>>>> until at least 1 new node join.
>>>>>
>>>>> * Or may be we should limit the count when scaling? From N=5 -> N=7 ->
>>>>> N=9 -> N=11.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Riguz Lee
>>>>>
>>>>>
>>>>>
>>>>>

Re: How to correctly scale the raft cluster?

Posted by Tsz Wo Sze <sz...@gmail.com>.
Hi tison,

> When a new node join an existing cluster in this case, how it exactly
applies logs? Will it applies raft logs from the beginning one by one?

The leader takes a snapshot from time to time and then purges the old log
entries.  When a new server joins the group, the leader will first send a
snapshot, if there is any, and then send the remaining log entries.

Tsz-Wo


On Sun, Jul 3, 2022 at 9:26 AM tison <wa...@gmail.com> wrote:

> Hi Tsz-Wo,
>
> I have one question that may be a bit out of topic:
>
> When a new node join an existing cluster in this case, how it exactly
> applies logs? Will it applies raft logs from the beginning one by one? TiKV
> will sends a snapshot of state machine from leader to the new node in this
> case to speed up the catch up progress and I'd like to know how Ratis
> implements catch-up.
>
> Best,
> tison.
>
>
> Tsz Wo Sze <sz...@gmail.com> 于2022年7月2日周六 01:31写道:
>
>> > * It's possible to update the old configuration first by using
>> client.admin().setConfiguration(), let's say set N=11 first, then start new
>> nodes. However, since 5 < 11/2, the cluster won't be able to elect leader
>> until at least 1 new node join.
>> Yes, you are right.  Also, even if one node has joined, the group has to
>> wait for it to catch up with the previous log entries in order to obtain a
>> majority for committing new entries.
>>
>> > * Or may be we should limit the count when scaling? From N=5 -> N=7 ->
>> N=9 -> N=11.
>>
>> We may start the 6 new nodes as listeners first.  Listeners receive log
>> entries but they are not voting members and they won't be counted for
>> majority.  When the listeners catch up, we may change them to normal nodes
>> so that they become voting members.
>>
>> Tsz-Wo
>>
>>
>> On Fri, Jul 1, 2022 at 10:23 AM Tsz Wo Sze <sz...@gmail.com> wrote:
>>
>>> Hi Riguz,
>>>
>>> > Start 6 new nodes with new configuration N=11, while keeping the
>>> previous nodes running
>>>
>>> This step probably won't work as expected since it will create a new
>>> group but not adding nodes to the original group.  We must use the
>>> setConfiguration API to change configuration (add/remove nodes); see
>>> https://github.com/apache/ratis/blob/bd83e7d7fd41540c8bda6bd92a52ac99ccec2076/ratis-client/src/main/java/org/apache/ratis/client/api/AdminApi.java#L35
>>>
>>> Hope it helps.  Thanks a lot for trying Ratis!
>>>
>>> Tsz-Wo
>>>
>>> On Fri, Jul 1, 2022 at 12:30 AM Riguz Lee <dr...@riguz.com> wrote:
>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>> I'm testing scaling up/down the raft cluster, but ratis is not working
>>>> as expected in new cluster. My steps are:
>>>>
>>>>
>>>> * Initialize a cluster with 5 nodes, the size and peers of the cluster
>>>> is configured in a configuration file, let's say N=5. The cluster works
>>>> perfectly, raft logs are synchronized across the cluster.
>>>>
>>>> * Start 6 new nodes with new configuration N=11, while keeping the
>>>> previous nodes running
>>>>
>>>> * Recreate the previous nodes with N=11 one by one
>>>>
>>>>
>>>> According the raft paper, raft should be able to handle configuration
>>>> change by design, but after the above steps, what I've found is that:
>>>>
>>>>
>>>> - New nodes not able to join the cluster
>>>>
>>>> - Old nodes still has a size of 5(by
>>>> *client.getGroupManagementApi(peerId).info(groupId)*)
>>>>
>>>>
>>>> So how should I scale the cluster correctly? A few thoughts of mine:
>>>>
>>>>
>>>> * Definitely the old cluster should not be stopped while starting new
>>>> nodes, otherwise new nodes might be able to elect new leader(eg. N=11 with
>>>> 6 new nodes) and raft logs in old nodes will be overriden.
>>>>
>>>> * It's possible to update the old configuration first by using
>>>> client.admin().setConfiguration(), let's say set N=11 first, then start new
>>>> nodes. However, since 5 < 11/2, the cluster won't be able to elect leader
>>>> until at least 1 new node join.
>>>>
>>>> * Or may be we should limit the count when scaling? From N=5 -> N=7 ->
>>>> N=9 -> N=11.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Riguz Lee
>>>>
>>>>
>>>>
>>>>

Re: How to correctly scale the raft cluster?

Posted by tison <wa...@gmail.com>.
Hi Tsz-Wo,

I have one question that may be a bit out of topic:

When a new node join an existing cluster in this case, how it exactly
applies logs? Will it applies raft logs from the beginning one by one? TiKV
will sends a snapshot of state machine from leader to the new node in this
case to speed up the catch up progress and I'd like to know how Ratis
implements catch-up.

Best,
tison.


Tsz Wo Sze <sz...@gmail.com> 于2022年7月2日周六 01:31写道:

> > * It's possible to update the old configuration first by using
> client.admin().setConfiguration(), let's say set N=11 first, then start new
> nodes. However, since 5 < 11/2, the cluster won't be able to elect leader
> until at least 1 new node join.
> Yes, you are right.  Also, even if one node has joined, the group has to
> wait for it to catch up with the previous log entries in order to obtain a
> majority for committing new entries.
>
> > * Or may be we should limit the count when scaling? From N=5 -> N=7 ->
> N=9 -> N=11.
>
> We may start the 6 new nodes as listeners first.  Listeners receive log
> entries but they are not voting members and they won't be counted for
> majority.  When the listeners catch up, we may change them to normal nodes
> so that they become voting members.
>
> Tsz-Wo
>
>
> On Fri, Jul 1, 2022 at 10:23 AM Tsz Wo Sze <sz...@gmail.com> wrote:
>
>> Hi Riguz,
>>
>> > Start 6 new nodes with new configuration N=11, while keeping the
>> previous nodes running
>>
>> This step probably won't work as expected since it will create a new
>> group but not adding nodes to the original group.  We must use the
>> setConfiguration API to change configuration (add/remove nodes); see
>> https://github.com/apache/ratis/blob/bd83e7d7fd41540c8bda6bd92a52ac99ccec2076/ratis-client/src/main/java/org/apache/ratis/client/api/AdminApi.java#L35
>>
>> Hope it helps.  Thanks a lot for trying Ratis!
>>
>> Tsz-Wo
>>
>> On Fri, Jul 1, 2022 at 12:30 AM Riguz Lee <dr...@riguz.com> wrote:
>>
>>>
>>>
>>> Hi,
>>>
>>>
>>> I'm testing scaling up/down the raft cluster, but ratis is not working
>>> as expected in new cluster. My steps are:
>>>
>>>
>>> * Initialize a cluster with 5 nodes, the size and peers of the cluster
>>> is configured in a configuration file, let's say N=5. The cluster works
>>> perfectly, raft logs are synchronized across the cluster.
>>>
>>> * Start 6 new nodes with new configuration N=11, while keeping the
>>> previous nodes running
>>>
>>> * Recreate the previous nodes with N=11 one by one
>>>
>>>
>>> According the raft paper, raft should be able to handle configuration
>>> change by design, but after the above steps, what I've found is that:
>>>
>>>
>>> - New nodes not able to join the cluster
>>>
>>> - Old nodes still has a size of 5(by
>>> *client.getGroupManagementApi(peerId).info(groupId)*)
>>>
>>>
>>> So how should I scale the cluster correctly? A few thoughts of mine:
>>>
>>>
>>> * Definitely the old cluster should not be stopped while starting new
>>> nodes, otherwise new nodes might be able to elect new leader(eg. N=11 with
>>> 6 new nodes) and raft logs in old nodes will be overriden.
>>>
>>> * It's possible to update the old configuration first by using
>>> client.admin().setConfiguration(), let's say set N=11 first, then start new
>>> nodes. However, since 5 < 11/2, the cluster won't be able to elect leader
>>> until at least 1 new node join.
>>>
>>> * Or may be we should limit the count when scaling? From N=5 -> N=7 ->
>>> N=9 -> N=11.
>>>
>>>
>>> Thanks,
>>>
>>> Riguz Lee
>>>
>>>
>>>
>>>

Re: How to correctly scale the raft cluster?

Posted by Tsz Wo Sze <sz...@gmail.com>.
> * It's possible to update the old configuration first by using
client.admin().setConfiguration(), let's say set N=11 first, then start new
nodes. However, since 5 < 11/2, the cluster won't be able to elect leader
until at least 1 new node join.
Yes, you are right.  Also, even if one node has joined, the group has to
wait for it to catch up with the previous log entries in order to obtain a
majority for committing new entries.

> * Or may be we should limit the count when scaling? From N=5 -> N=7 ->
N=9 -> N=11.

We may start the 6 new nodes as listeners first.  Listeners receive log
entries but they are not voting members and they won't be counted for
majority.  When the listeners catch up, we may change them to normal nodes
so that they become voting members.

Tsz-Wo


On Fri, Jul 1, 2022 at 10:23 AM Tsz Wo Sze <sz...@gmail.com> wrote:

> Hi Riguz,
>
> > Start 6 new nodes with new configuration N=11, while keeping the
> previous nodes running
>
> This step probably won't work as expected since it will create a new group
> but not adding nodes to the original group.  We must use the
> setConfiguration API to change configuration (add/remove nodes); see
> https://github.com/apache/ratis/blob/bd83e7d7fd41540c8bda6bd92a52ac99ccec2076/ratis-client/src/main/java/org/apache/ratis/client/api/AdminApi.java#L35
>
> Hope it helps.  Thanks a lot for trying Ratis!
>
> Tsz-Wo
>
> On Fri, Jul 1, 2022 at 12:30 AM Riguz Lee <dr...@riguz.com> wrote:
>
>>
>>
>> Hi,
>>
>>
>> I'm testing scaling up/down the raft cluster, but ratis is not working as
>> expected in new cluster. My steps are:
>>
>>
>> * Initialize a cluster with 5 nodes, the size and peers of the cluster is
>> configured in a configuration file, let's say N=5. The cluster works
>> perfectly, raft logs are synchronized across the cluster.
>>
>> * Start 6 new nodes with new configuration N=11, while keeping the
>> previous nodes running
>>
>> * Recreate the previous nodes with N=11 one by one
>>
>>
>> According the raft paper, raft should be able to handle configuration
>> change by design, but after the above steps, what I've found is that:
>>
>>
>> - New nodes not able to join the cluster
>>
>> - Old nodes still has a size of 5(by
>> *client.getGroupManagementApi(peerId).info(groupId)*)
>>
>>
>> So how should I scale the cluster correctly? A few thoughts of mine:
>>
>>
>> * Definitely the old cluster should not be stopped while starting new
>> nodes, otherwise new nodes might be able to elect new leader(eg. N=11 with
>> 6 new nodes) and raft logs in old nodes will be overriden.
>>
>> * It's possible to update the old configuration first by using
>> client.admin().setConfiguration(), let's say set N=11 first, then start new
>> nodes. However, since 5 < 11/2, the cluster won't be able to elect leader
>> until at least 1 new node join.
>>
>> * Or may be we should limit the count when scaling? From N=5 -> N=7 ->
>> N=9 -> N=11.
>>
>>
>> Thanks,
>>
>> Riguz Lee
>>
>>
>>
>>

Re: How to correctly scale the raft cluster?

Posted by Tsz Wo Sze <sz...@gmail.com>.
Hi Riguz,

> Start 6 new nodes with new configuration N=11, while keeping the previous
nodes running

This step probably won't work as expected since it will create a new group
but not adding nodes to the original group.  We must use the
setConfiguration API to change configuration (add/remove nodes); see
https://github.com/apache/ratis/blob/bd83e7d7fd41540c8bda6bd92a52ac99ccec2076/ratis-client/src/main/java/org/apache/ratis/client/api/AdminApi.java#L35

Hope it helps.  Thanks a lot for trying Ratis!

Tsz-Wo

On Fri, Jul 1, 2022 at 12:30 AM Riguz Lee <dr...@riguz.com> wrote:

>
>
> Hi,
>
>
> I'm testing scaling up/down the raft cluster, but ratis is not working as
> expected in new cluster. My steps are:
>
>
> * Initialize a cluster with 5 nodes, the size and peers of the cluster is
> configured in a configuration file, let's say N=5. The cluster works
> perfectly, raft logs are synchronized across the cluster.
>
> * Start 6 new nodes with new configuration N=11, while keeping the
> previous nodes running
>
> * Recreate the previous nodes with N=11 one by one
>
>
> According the raft paper, raft should be able to handle configuration
> change by design, but after the above steps, what I've found is that:
>
>
> - New nodes not able to join the cluster
>
> - Old nodes still has a size of 5(by
> *client.getGroupManagementApi(peerId).info(groupId)*)
>
>
> So how should I scale the cluster correctly? A few thoughts of mine:
>
>
> * Definitely the old cluster should not be stopped while starting new
> nodes, otherwise new nodes might be able to elect new leader(eg. N=11 with
> 6 new nodes) and raft logs in old nodes will be overriden.
>
> * It's possible to update the old configuration first by using
> client.admin().setConfiguration(), let's say set N=11 first, then start new
> nodes. However, since 5 < 11/2, the cluster won't be able to elect leader
> until at least 1 new node join.
>
> * Or may be we should limit the count when scaling? From N=5 -> N=7 -> N=9
> -> N=11.
>
>
> Thanks,
>
> Riguz Lee
>
>
>
>