You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ratis.apache.org by "Tsz-wo Sze (Jira)" <ji...@apache.org> on 2020/02/04 23:27:00 UTC

[jira] [Commented] (RATIS-794) Ratils leader should retry append requests based on follower commit info in case of intermittent append failures

    [ https://issues.apache.org/jira/browse/RATIS-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030207#comment-17030207 ] 

Tsz-wo Sze commented on RATIS-794:
----------------------------------

[~shashikant], any comments on the new patch?

> Ratils leader should retry append requests based on follower commit info in case of intermittent append failures
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: RATIS-794
>                 URL: https://issues.apache.org/jira/browse/RATIS-794
>             Project: Ratis
>          Issue Type: Bug
>          Components: server
>            Reporter: Shashikant Banerjee
>            Assignee: Tsz-wo Sze
>            Priority: Major
>             Fix For: 0.5.0
>
>         Attachments: r794_20200122.patch, r794_20200124.patch
>
>
> During Ozone testing, it was observed that a leader election happens in between the test , where a follower has caught to a certain index 313. The new leader starts sends an append request to the follower which fails with grpc Exception. This leads to leader reset the connection and start from the beginning (index 1). 
>  
>  
> {code:java}
> 2020-01-13 14:56:32,995 INFO org.apache.ratis.server.impl.RaftServerImpl: 0.0.0.0:9858@group-4F125BF42C14: changes role from CANDIDATE to LEADER at term 7 for changeToLeader
> 2020-01-13 14:56:32,995 INFO org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: Leader change notification received for group: group-4F125BF42C14 with new leaderId: ed90869c-317e-4303-8922-9fa83a3983cb
> 2020-01-13 14:56:33,042 WARN org.apache.ratis.grpc.server.GrpcLogAppender: 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858-AppendLogResponseHandler: Failed appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
> 2020-01-13 14:56:33,043 DEBUG org.apache.ratis.util.PeerProxyMap: ed90869c-317e-4303-8922-9fa83a3983cb: reset proxy for b65b0b6c-b0bb-429f-a23d-467c72d4b85c
> 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.LifeCycle: b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858: RUNNING -> CLOSING
> 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.LifeCycle: b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858: CLOSING -> CLOSED
> 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.LifeCycle: b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858: NEW
> 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.TimeoutScheduler: new ScheduledThreadPoolExecutor
> 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.PeerProxyMap: ed90869c-317e-4303-8922-9fa83a3983cb: Closing proxy for peer b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858
> 2020-01-13 14:56:33,045 DEBUG org.apache.ratis.util.TimeoutScheduler: schedule a task: timeout 6000ms, sid 1 
> 2020-01-13 14:56:33,047 INFO org.apache.ratis.server.impl.FollowerInfo: 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858: nextIndex: updateUnconditionally 314 -> 1 ---------------------> set the next index for the follower back to 1 and  starts from 1)
> 2020-01-13 14:56:35,840 DEBUG org.apache.ratis.grpc.server.GrpcLogAppender: 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858-AppendLogResponseHandler: received the first reply ed90869c-317e-4303-8922-9fa83a3983cb<-b65b0b6c-b0bb-429f-a23d-467c72d4b85c#2:OK,SUCCESS,nextIndex:314,term:5,followerCommit:313, request=AppendEntriesRequest:cid=2,entriesCount=0,lastEntry=null .  -------------------> (Receives the response from follower indficating follower is at 312)
> Although the follower is at 313, the leader keeps on sending the appendRequests from index 1. 
> 2020-01-13 14:56:35,841 DEBUG org.apache.ratis.server.impl.FollowerInfo: 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858: nextIndex: updateIncreasingly 1 -> 2
> 2020-01-13 14:56:35,841 DEBUG org.apache.ratis.util.TimeoutScheduler: schedule a task: timeout 6000ms, sid 7
> 2020-01-13 14:56:35,843 DEBUG org.apache.ratis.server.impl.FollowerInfo: 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858: nextIndex: updateIncreasingly 2 -> 3
> 2020-01-13 14:56:35,843 DEBUG org.apache.ratis.util.TimeoutScheduler: schedule a task: timeout 6000ms, sid 8
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)