You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ratis.apache.org by "Xinyu Tan (Jira)" <ji...@apache.org> on 2023/05/13 03:00:00 UTC

[jira] [Updated] (RATIS-1841) Fixed bug where cluster restart failed to transfer snapshot

     [ https://issues.apache.org/jira/browse/RATIS-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xinyu Tan updated RATIS-1841:
-----------------------------
    Attachment: screenshot-1.png

> Fixed bug where cluster restart failed to transfer snapshot
> -----------------------------------------------------------
>
>                 Key: RATIS-1841
>                 URL: https://issues.apache.org/jira/browse/RATIS-1841
>             Project: Ratis
>          Issue Type: Bug
>    Affects Versions: 2.4.1, 2.5.0
>            Reporter: Xinyu Tan
>            Assignee: Xinyu Tan
>            Priority: Major
>         Attachments: image-2023-05-13-10-50-14-537.png, screenshot-1.png
>
>
> Hi, We have discovered that the [problem|https://issues.apache.org/jira/browse/RATIS-1838] we reported earlier is still recurring. The problem is that a multi-replica cluster may fail to restart after transfering a outdated snapshot
> Upon further investigation, we found that the issue originates from the resetClient function. Specifically, there is a flaw in its logic that causes it to incorrectly set the nextIndex of followers to 0, which leads to the error message shown in the attached screenshot.
> !image-2023-05-13-10-50-14-537.png!
> Upon reviewing the code, we determined that the issue arose only after merging the [PR|https://github.com/apache/ratis/pull/805]. Surprisingly, the code was correct prior to merging.
> After investigating further, we determined that the solution was to remove the index judgment, as the conditions onError and request == null were sufficient to encompass the required test conditions.
> PTAL~[~szetszwo]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)