You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ratis.apache.org by "Xinyu Tan (Jira)" <ji...@apache.org> on 2023/05/13 02:58:00 UTC

[jira] [Created] (RATIS-1841) Fixed bug where cluster restart failed to transfer snapshot

Xinyu Tan created RATIS-1841:
--------------------------------

             Summary: Fixed bug where cluster restart failed to transfer snapshot
                 Key: RATIS-1841
                 URL: https://issues.apache.org/jira/browse/RATIS-1841
             Project: Ratis
          Issue Type: Bug
    Affects Versions: 2.5.0, 2.4.1
            Reporter: Xinyu Tan
            Assignee: Xinyu Tan
         Attachments: image-2023-05-13-10-50-14-537.png

Hi, We have discovered that the [problem|https://issues.apache.org/jira/browse/RATIS-1838] we reported earlier is still recurring. The problem is that a multi-replica cluster may fail to restart after transfering a outdated snapshot

Upon further investigation, we found that the issue originates from the resetClient function. Specifically, there is a flaw in its logic that causes it to incorrectly set the nextIndex of followers to 0, which leads to the error message shown in the attached screenshot.

!image-2023-05-13-10-50-14-537.png!

Upon reviewing the code, we determined that the issue arose only after merging the [PR|https://github.com/apache/ratis/pull/805]. Surprisingly, the code was correct prior to merging.

After investigating further, we determined that the solution was to remove the index judgment, as the conditions onError and request == null were sufficient to encompass the required test conditions.

PTAL~[~szetszwo]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)