You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ratis.apache.org by 宋子阳 <sz...@163.com> on 2022/04/07 09:10:46 UTC

Live lock issues on installSnapshot & setConfiguration

Hi,
Currently I’m integrating Ratis as our consensus backbone in Apache IoTDB, and I encountered weird situation that cause the system into livelock:

My original configuration contains a single Group(1) with a single member(1), which is certainly the leader. Now I want to add a new member (follower 2) into this group, and I implement it as follows:

Client.getGroupManaApi(2).add(group(1));
Client.admin().setConfiguration([1,2]);

Then I observed event sequence which causes the livelock:
1. addGroup successes and follower 2 lifecycle is STARTING
2. Leader 1 send the latest snapshot to follower 2, which contains the **old conf [1]**
3. Follower 2 successfully install snapshot, discovered itself excluded in the conf, and turns the lifecycle into CLOSE
4. Leader 1 recv installSnapshot reply, add new conf [1,2] to the log and applies this conf
5. Since Follower 2 is closed, Leader 1 step down to follower for LOST_MAJORITY_HEARTBEATS, and this group can’t serve anymore.

Am I use the groupManagementApi or adminApi wrong? How can I solve this problem?


William Song
Apache IoTDB

Re: Live lock issues on installSnapshot & setConfiguration

Posted by Tsz Wo Sze <sz...@gmail.com>.
Have you tried to start Follower 2 again so that it might be able to join
the group?

BTW, we should turn off the LOST_MAJORITY_HEARTBEATS feature for a single
server raft group.  Since there is only one server, it will "lose majority"
as soon as one or more servers have been added to the group.

Please let me know if you have more questions.
Tsz-Wo



On Thu, Apr 7, 2022 at 5:24 PM Tsz Wo Sze <sz...@gmail.com> wrote:

> Hi William,
>
> Please include the new conf when adding the group, i.e.
> Client.getGroupManaApi(2).add(group1[1,2]).
>
> Hope it helps.
> Tsz-Wo
>
> On Thu, Apr 7, 2022 at 5:11 PM 宋子阳 <sz...@163.com> wrote:
>
>> Hi,
>> Currently I’m integrating Ratis as our consensus backbone in Apache
>> IoTDB, and I encountered weird situation that cause the system into
>> livelock:
>>
>> My original configuration contains a single Group(1) with a single
>> member(1), which is certainly the leader. Now I want to add a new member
>> (follower 2) into this group, and I implement it as follows:
>>
>> Client.getGroupManaApi(2).add(group(1));
>> Client.admin().setConfiguration([1,2]);
>>
>> Then I observed event sequence which causes the livelock:
>> 1. addGroup successes and follower 2 lifecycle is STARTING
>> 2. Leader 1 send the latest snapshot to follower 2, which contains the
>> **old conf [1]**
>> 3. Follower 2 successfully install snapshot, discovered itself excluded
>> in the conf, and turns the lifecycle into CLOSE
>> 4. Leader 1 recv installSnapshot reply, add new conf [1,2] to the log and
>> applies this conf
>> 5. Since Follower 2 is closed, Leader 1 step down to follower for
>> LOST_MAJORITY_HEARTBEATS, and this group can’t serve anymore.
>>
>> Am I use the groupManagementApi or adminApi wrong? How can I solve this
>> problem?
>>
>>
>> William Song
>> Apache IoTDB
>>
>

Re: Live lock issues on installSnapshot & setConfiguration

Posted by Tsz Wo Sze <sz...@gmail.com>.
Hi William,

Please include the new conf when adding the group, i.e.
Client.getGroupManaApi(2).add(group1[1,2]).

Hope it helps.
Tsz-Wo

On Thu, Apr 7, 2022 at 5:11 PM 宋子阳 <sz...@163.com> wrote:

> Hi,
> Currently I’m integrating Ratis as our consensus backbone in Apache IoTDB,
> and I encountered weird situation that cause the system into livelock:
>
> My original configuration contains a single Group(1) with a single
> member(1), which is certainly the leader. Now I want to add a new member
> (follower 2) into this group, and I implement it as follows:
>
> Client.getGroupManaApi(2).add(group(1));
> Client.admin().setConfiguration([1,2]);
>
> Then I observed event sequence which causes the livelock:
> 1. addGroup successes and follower 2 lifecycle is STARTING
> 2. Leader 1 send the latest snapshot to follower 2, which contains the
> **old conf [1]**
> 3. Follower 2 successfully install snapshot, discovered itself excluded in
> the conf, and turns the lifecycle into CLOSE
> 4. Leader 1 recv installSnapshot reply, add new conf [1,2] to the log and
> applies this conf
> 5. Since Follower 2 is closed, Leader 1 step down to follower for
> LOST_MAJORITY_HEARTBEATS, and this group can’t serve anymore.
>
> Am I use the groupManagementApi or adminApi wrong? How can I solve this
> problem?
>
>
> William Song
> Apache IoTDB
>