You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ratis.apache.org by "Ethan Rose (Jira)" <ji...@apache.org> on 2021/02/03 20:32:00 UTC

[jira] [Created] (RATIS-1305) Leader stuck in infinite install snapshot cycle when logs have been purged

Ethan Rose created RATIS-1305:
---------------------------------

             Summary: Leader stuck in infinite install snapshot cycle when logs have been purged
                 Key: RATIS-1305
                 URL: https://issues.apache.org/jira/browse/RATIS-1305
             Project: Ratis
          Issue Type: Bug
          Components: server
            Reporter: Ethan Rose
            Assignee: Ethan Rose
             Fix For: 1.1.0


After logs have been purged from the leader and followers, the leader repeatedly attempts to send snapshots to the followers, who reject them because there have not been any new transactions to apply. The leader continues to send snapshots infinitely, however, and the cluster becomes unresponsive.

Here is an example of the log messages. om1 is the leader, om2 and om3 are followers.

On the leader om1:

{code}

om1_1 | 2021-02-02 17:17:23,261 [om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender: followerNextIndex = 337 but logStartIndex = -1, notify follower to install snapshot-(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,272 [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: followerNextIndex = 337 but logStartIndex = -1, notify follower to install snapshot-(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,286 [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send om1->om3#0-t1,notify:(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,286 [om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender: send om1->om2#0-t1,notify:(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: received a reply om1<-om3#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: Already Installed Snapshot Index 336.
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO leader.FollowerInfo: om1@group-D66704EFC61C->om3: snapshotIndex: setUnconditionally 0 -> 336
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO leader.FollowerInfo: om1@group-D66704EFC61C->om3: matchIndex: setUnconditionally 336 -> 336
om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: setUnconditionally 337 -> 337
om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: updateToMax old=337, new=337, updated? false
om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: received a reply om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: Already Installed Snapshot Index 336.

{code}

 

On follower om2:

{code}

om2_1 | 2021-02-02 17:17:23,306 [grpc-default-executor-0] INFO server.RaftServer$Division: om2@group-D66704EFC61C: receive installSnapshot: om1->om2#0-t1,notify:(t:1, i:337)
om2_1 | 2021-02-02 17:17:23,312 [grpc-default-executor-0] INFO server.RaftServer$Division: om2@group-D66704EFC61C: StateMachine snapshotIndex is 336
om2_1 | 2021-02-02 17:17:23,560 [grpc-default-executor-0] INFO server.RaftServer$Division: om2@group-D66704EFC61C: set new configuration configurationEntry {
om2_1 | peers {
om2_1 | id: "om1"
om2_1 | address: "om1:9872"
om2_1 | }
om2_1 | peers {
om2_1 | id: "om3"
om2_1 | address: "om3:9872"
om2_1 | }
om2_1 | peers {
om2_1 | id: "om2"
om2_1 | address: "om2:9872"
om2_1 | }
om2_1 | }
om2_1 | from snapshot
om2_1 | 2021-02-02 17:17:23,561 [grpc-default-executor-0] INFO server.RaftServer$Division: om2@group-D66704EFC61C: set configuration 0: [om1|rpc:om1:9872|dataStream:|priority:0, om3|rpc:om3:9872|dataStream:|priority:0, om2|rpc:om2:9872|dataStream:|priority:0], old=null
om2_1 | 2021-02-02 17:17:23,567 [grpc-default-executor-0] INFO server.RaftServer$Division: om2@group-D66704EFC61C: reply installSnapshot: om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
om2_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-0] INFO server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT, lastRequest: om1->om2#0-t1,notify:(t:1, i:337)

{code}

 

These log messages are repeated forever until the cluster is terminated. The term and index numbers do not change.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)