You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ratis.apache.org by GitBox <gi...@apache.org> on 2020/12/29 03:32:36 UTC

[GitHub] [incubator-ratis] runzhiwang opened a new pull request #383: RATIS-1273. Fix split brain by leader lease

runzhiwang opened a new pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383


   ## What changes were proposed in this pull request?
   
   **What's the problem ?**
   
   For example, there are 3 servers: s1, s2, s3, and s1 is leader. When split-brain happens, s2 was elected as new leader, but s1 still think it's leader, when client read from s1, if s2 has processed write request, client will read old data from s1.
   
   **How to fix ?**
   Assign the leader with a lease, the leader would use the
   normal heartbeat mechanism to maintain a lease. Once the leader’s heartbeats were acknowledged by a majority of the cluster, it would extends its lease to `start+ election timeout`, since the followers shouldn’t time out before then, so we can make sure there will no new leader was elected(need pre-vote feature and need to consider transferLeadership feature) , so before start + election timeout, there will not split-brain happens.
   ![image](https://user-images.githubusercontent.com/51938049/103255180-4d581280-49c3-11eb-855e-ccd972755a4b.png).
   
   Why need pre-vote feature ?
   
   As the image shows, s1 is leader, but server1 can not connect with s2, even though server1 extend its lease to start+ election timeout when s1 receive acknowledgement from s3, but before start+ election timeout, s1 isolated from all servers, and s2 maybe timeout and start election and change to leader immediately with vote from s3, so both s1 and s2 think itself as leader before start+ election timeout. But if with pre-vote feature, when s2 request vote, s3 check s1's leadership is still valid, s3 will reject vote to s2, only one leader exists.
   
   ![image](https://user-images.githubusercontent.com/51938049/103255778-9c9f4280-49c5-11eb-8ae6-4a6fa035b2ad.png)
   
   How to address transferLeadership ?
   
   For example, s1 is leader and extend its lease to `start+ election timeout` when s1 receive acknowledgement from s2 and s3. But before `start+ election timeout`, admin maybe call transferLeadership(s2), after s1 send StartLeaderElectionRequest to s2, s1 isolated from all servers, then s2 start election and change to leader immediately with vote from s3, so both s1 and s2 think itself as leader before start+ election timeout.
   
   So s1 should step down as a follower when s1 send StartLeaderElectionRequest to s2.
   
   @szetszwo Could you help review this proposal ?
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/RATIS-1273
   
   ## How was this patch tested?
   
   TODO
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang closed pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang closed pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] szetszwo commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
szetszwo commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-757980043


   @runzhiwang , in general, I agree that PreVote should help for all the applications.  However, PreVote needs an additional phase before the real election.  It could potentially slows down some applications.  Individual applications like Ozone may want to benchmark it.  If it is not configurable, it is impossible to benchmark.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-755035694


   @szetszwo Hi, because this PR depends on Pre-Vote PR: https://github.com/apache/incubator-ratis/pull/161. So I want to modify Pre-Vote PR first, after Pre-Vote PR was merged, then I continue to work on this PR. what do you think ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-754469628


   @szetszwo Could you help review this proposal ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang edited a comment on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang edited a comment on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-763400887


   this pr depends on: https://github.com/apache/incubator-ratis/pull/398


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang edited a comment on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang edited a comment on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-761709658






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-761709658


   @szetszwo Hi, with leader lease, the CI becomes flaky, there are 2 reasons:
   1. The CI environment' machine has a low performance, some times one rpc call cost more then 20ms from send to receive.
   2. In current ratis implementation,  leader sends log or heartbeat to follower every 75ms, if there is log, leader will not send heartbeat again. For example, at 0ms there is log and leader send log to follower, at 75ms there is log and leader send log to follower again, ..., at 750ms there is log and leader send log to follower again. So you can find from 0ms to 750ms, log always exist, leader always send log, never heartbeat. But if each log need more than 1000ms to: WriteDisk, applyTransaction, then leader will not receive any reply from 0-1000ms, then leader lease becomes invalid frequently.
   
   I think we have following options:
   1. Increase rpc.timeout.min from 150ms to 1500ms
   2. Default disable leader lease, then CI need not to consider leader lease
   
   What do you think ?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on a change in pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#discussion_r551927347



##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
##########
@@ -104,6 +104,20 @@ public RaftPeerRole getCurrentRole() {
       return getRole().getCurrentRole();
     }
 
+    @Override
+    public boolean isLeader() {

Review comment:
       @szetszwo Oh, sorry. By "user", I mean OM or SCM in Ozone, because they need to check isLeader to decide whether do some thing, for example only leader SCM send command to datanode, so SCM need to know whether it's leader. Then the problem becomes SCM call which method: isLeader() or isLeaderReady(). From the name of method, maybe SCM is likely to use isLeader().




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on a change in pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#discussion_r551906792



##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
##########
@@ -104,6 +104,20 @@ public RaftPeerRole getCurrentRole() {
       return getRole().getCurrentRole();
     }
 
+    @Override
+    public boolean isLeader() {

Review comment:
       > Assume that the leader won't step down if it has lost the leader lease.
   
   @szetszwo Yes, leader won't step down if it has lost the leader lease.
   Because I'm afraid if step down, leader election will happen more frequent. If not step down, leader lost the lease maybe can get the lease again in future. If leader actully need to step down, we do achieve it by checkLeaderShip, so there is a buffer time between lost lease and step down for leader to get lease again.  Step down or not, I think it's still under discussion.
   
   > If we change isLeader(), the client will see NotLeaderException and retry with a different peer.
   
   Yes, but user maybe prefer to call isLeader, maybe retry with a different peer does not matter ? not sure.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-762545226


   > Let's also make it configurable? It seems that both ways have its own benefit.
   
   @szetszwo Thanks, I agree.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] szetszwo commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
szetszwo commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-756887260


   > We may not need the LEADER_LEASE_TIMEOUT_RATIO_KEY at the very beginning. ...
   
   I agree with @GlenGeng that we may not need LEADER_LEASE_TIMEOUT_RATIO_KEY.  If we take rpc-send-time right before sending out appendEntries, it seems pretty safe to use min-rpc-timeout as the leader-lease-timeout.  If split brain happens, it has to take at least (min-rpc-timeout + leader election time) to elect a new leader.  Then, the old leader lease must be expired by that time.
   
   @runzhiwang , what do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang closed pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang closed pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-763400887


   depends on: https://github.com/apache/incubator-ratis/pull/398


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] szetszwo commented on a change in pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
szetszwo commented on a change in pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#discussion_r552301235



##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
##########
@@ -104,6 +104,20 @@ public RaftPeerRole getCurrentRole() {
       return getRole().getCurrentRole();
     }
 
+    @Override
+    public boolean isLeader() {

Review comment:
       Sure, thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on a change in pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#discussion_r552335190



##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
##########
@@ -104,6 +104,20 @@ public RaftPeerRole getCurrentRole() {
       return getRole().getCurrentRole();
     }
 
+    @Override
+    public boolean isLeader() {

Review comment:
       > To maintain the lease, the follower should withhold its vote if its has replied the AppendEntriesRequest, even it meets a RequestVoteRequest with higher term and newer raft log. Seems the RequestVote logic is not touched in this PR ?
   
   @GlenGeng This will be done in Pre-vote PR: https://github.com/apache/incubator-ratis/pull/161




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] szetszwo commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
szetszwo commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-754477166


   The design looks good. Will look at the code changes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] szetszwo commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
szetszwo commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-761949585


   > 2. Default disable leader lease, then CI need not to consider leader lease
   
   Let's disable leader lease as default.  When the feature becomes stable, we can change the default to enable.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang edited a comment on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang edited a comment on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-761709658


   @szetszwo Hi, with leader lease, the CI becomes flaky, there are 2 reasons:
   1. The CI environment' machine has a low performance, some times one rpc call cost more then 20ms from send to receive.
   2. In current ratis implementation,  leader sends log or heartbeat to follower every 75ms, if there is log, leader will not send heartbeat again. For example, at 0ms there is log and leader send log to follower, at 75ms there is log and leader send log to follower again, ..., at 750ms there is log and leader send log to follower again. So you can find from 0ms to 750ms, log always exist, leader always send log, never heartbeat. But if each log need more than 1000ms to: WriteDisk, applyTransaction, then leader will not receive any reply from 0-1000ms, then leader lease becomes invalid frequently.
   
   I think we have following options:
   1. Increase rpc.timeout.min from 150ms to 1500ms in CI 
   2. Default disable leader lease, then CI need not to consider leader lease
   
   
   What do you think ?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-757918564


   @szetszwo @bharatviswa504 Thanks, got it. I will add config in next PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] szetszwo commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
szetszwo commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-754474016


   Sure, will review this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-757410153


   > I agree with @GlenGeng that we may not need LEADER_LEASE_TIMEOUT_RATIO_KEY. If we take rpc-send-time right before sending out appendEntries, it seems pretty safe to use min-rpc-timeout as the leader-lease-timeout. If split brain happens, it has to take at least (min-rpc-timeout + leader election time) to elect a new leader. Then, the old leader lease must be expired by that time.
   
   @szetszwo I agree. There are some failed ut related to this PR, let me fix them.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] szetszwo commented on a change in pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
szetszwo commented on a change in pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#discussion_r551924473



##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
##########
@@ -104,6 +104,20 @@ public RaftPeerRole getCurrentRole() {
       return getRole().getCurrentRole();
     }
 
+    @Override
+    public boolean isLeader() {

Review comment:
       Sorry that I was not clear.  By "user", I actually mean RaftClient.  RaftServerImpl.checkLeaderState is an internal private method.  It is called by all the leader methods such as submitClientRequestAsync(..). RaftClient will get a NotLeaderException when isLeader() returns false, or LeaderNotReadyException when isLeaderReady() returns false.
   
   If the leader is not going to step down, it is better to change isLeaderReady().  The leader may regain the leader lease so that RaftClient should retry with the same leader.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] szetszwo commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
szetszwo commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-757740636


   @runzhiwang , thanks a lot.
   
   BTW, we should add confs to enable/disable PreVote and LeaderLease.  Some applications may not require these features.  This is suggested by @bshashikant .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] GlenGeng commented on a change in pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
GlenGeng commented on a change in pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#discussion_r552334420



##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
##########
@@ -104,6 +104,20 @@ public RaftPeerRole getCurrentRole() {
       return getRole().getCurrentRole();
     }
 
+    @Override
+    public boolean isLeader() {

Review comment:
       @runzhiwang  Thanks for the tag. I have following suggestions:
   
   1. I agree that leader should sort the `sendTsWithAck` increasingly, and iterate from large to small to pick the `sendTsWithAck` that majority of the followers(including himself) has replied the leader as the `startOfLease` .
   2. We can use `LeaderStateImpl#checkLeadership` to calculate and maintain the lease, and step down when lease is expired. `isLeader()` might be called hundreds or thousands times per second, which would be a potential waste to resource. `LeaderStateImpl#checkLeadership` can make leader step down if lease is expired, which is more suitable for the semantics of lease.
   3. We may not need the `LEADER_LEASE_TIMEOUT_RATIO_KEY ` at the very beginning. What we worry about is the lease is too short to make leader unstable, and we can revisit the `clock drift bound` feature later.
   4. To maintain the lease, the follower should withhold its vote if its has replied the AppendEntriesRequest, even it meets a RequestVoteRequest with higher term and newer raft log. Seems the RequestVote logic is not touched in this PR ?
   5. I suggest that leader should step down when the maintained lease is expired. From the view of App layer, `isLeaderReady()` means raft server has become leader, yet the `appliedLogIndex` of StateMachine has not reach the `lastLogIndex`, As soon as `isLeaderReady()` return true, the App layer would expect leader not return false during the whole term.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on a change in pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#discussion_r551927347



##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
##########
@@ -104,6 +104,20 @@ public RaftPeerRole getCurrentRole() {
       return getRole().getCurrentRole();
     }
 
+    @Override
+    public boolean isLeader() {

Review comment:
       @szetszwo Oh, sorry. By "user", I mean OM or SCM, because they need to check isLeader to decide whether do some thing, for example only leader SCM send command to datanode, so SCM need to know whether it's leader. Then the problem becomes SCM call which method: isLeader() or isLeaderReady(). From the name of method, maybe SCM is likely to use isLeader().




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] szetszwo commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
szetszwo commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-762544924


   > ..., I find it's almost impossible to enable leader lease in CI, because sometimes it cost 300ms from leader send heart to follower receive heartbeat. So CI will become very unstable, unless we increase rpc.timeout.min.
   
   Yes, we may increase rpc.timeout.min if necessary.
   
   > ... leader step down when lease become invalid ? ...
   
   Let's also make it configurable?  It seems that both ways have its own benefit.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on a change in pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#discussion_r551919913



##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
##########
@@ -104,6 +104,20 @@ public RaftPeerRole getCurrentRole() {
       return getRole().getCurrentRole();
     }
 
+    @Override
+    public boolean isLeader() {

Review comment:
       > Not the user code. It is the code in RaftServerImpl.checkLeaderState.
   
   @szetszwo Yes.  But I'm not sure whether it's interface friendly, maybe user is likely to check isLeader, not isLeaderReady. Just a discussion, I'm not sure.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on a change in pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#discussion_r552294256



##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
##########
@@ -104,6 +104,20 @@ public RaftPeerRole getCurrentRole() {
       return getRole().getCurrentRole();
     }
 
+    @Override
+    public boolean isLeader() {

Review comment:
       @szetszwo After thinking twice. I agree with you. We should change isLeaderReady().  Thanks the explanation.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-762519775


   > When the feature becomes stable, we can change the default to enable.
   
   @szetszwo Hi, I find it's almost impossible to enable leader lease in CI, because sometimes it cost 300ms from leader send heart to follower receive heartbeat. So CI will become very unstable, unless we increase rpc.timeout.min.
   
   Besides, what do you think of @GlenGeng 's suggestion: leader step down when lease become invalid ? In SCM HA, we do not check isLeaderReady, we depends on StateMachine#notifyLeaderChanged to change leadership.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] szetszwo commented on a change in pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
szetszwo commented on a change in pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#discussion_r551806316



##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
##########
@@ -104,6 +104,20 @@ public RaftPeerRole getCurrentRole() {
       return getRole().getCurrentRole();
     }
 
+    @Override
+    public boolean isLeader() {

Review comment:
       Assume that the leader won't step down if it has lost the leader lease.
   
   If the assumption is correct, then it is better to change isLeaderRead() instead of isLeader().  The client will see LeaderNotReadyException and retry with the same leader.
   
   If we change isLeader(),  the client will see NotLeaderException and retry with a different peer.

##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java
##########
@@ -905,6 +906,47 @@ private void yieldLeaderToHigherPriorityPeer() {
     }
   }
 
+  boolean isLeaderLeaseValid() {
+    if (checkLeaderLease()) {
+      return true;
+    }
+
+    updateLeaderLease();
+    return checkLeaderLease();
+  }
+
+  private boolean checkLeaderLease() {
+    return lastLeaderTimestamp.get().elapsedTimeMs() < server.properties().leaderLeaseTimeoutMs();
+  }
+
+  private boolean updateLeaderLease() {
+    Timestamp startLease = Timestamp.currentTime();
+
+    List<RaftPeerId> activePeers = new ArrayList<>();
+    List<FollowerInfo> followerInfos = getFollowerInfos();
+    for (final FollowerInfo info : followerInfos) {
+      if (info.getLastRpcSendTimeWithResponse().elapsedTimeMs() <= server.properties().leaderLeaseTimeoutMs()) {
+        if (startLease.compareTo(info.getLastRpcSendTimeWithResponse()) > 0) {
+          startLease = info.getLastRpcSendTimeWithResponse();
+        }
+        activePeers.add(info.getPeer().getId());
+      }
+    }

Review comment:
       The for-loop will choose all info with condition lastRpcSendTimeWithResponse < leaderLeaseTimeoutMs.  Suppose there are 5 servers (1 leader + 4 followers) and all 4 followers satisfy condition.  The lastRpcSendTimeWithResponse of the followers are 100, 200, 300, 400.  The for-loop will update startLease to the min, i.e. 100.  However, it should update to 300 since 300, 400 and self have the majority.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on a change in pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#discussion_r551877531



##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java
##########
@@ -905,6 +906,47 @@ private void yieldLeaderToHigherPriorityPeer() {
     }
   }
 
+  boolean isLeaderLeaseValid() {
+    if (checkLeaderLease()) {
+      return true;
+    }
+
+    updateLeaderLease();
+    return checkLeaderLease();
+  }
+
+  private boolean checkLeaderLease() {
+    return lastLeaderTimestamp.get().elapsedTimeMs() < server.properties().leaderLeaseTimeoutMs();
+  }
+
+  private boolean updateLeaderLease() {
+    Timestamp startLease = Timestamp.currentTime();
+
+    List<RaftPeerId> activePeers = new ArrayList<>();
+    List<FollowerInfo> followerInfos = getFollowerInfos();
+    for (final FollowerInfo info : followerInfos) {
+      if (info.getLastRpcSendTimeWithResponse().elapsedTimeMs() <= server.properties().leaderLeaseTimeoutMs()) {
+        if (startLease.compareTo(info.getLastRpcSendTimeWithResponse()) > 0) {
+          startLease = info.getLastRpcSendTimeWithResponse();
+        }
+        activePeers.add(info.getPeer().getId());
+      }
+    }

Review comment:
       @szetszwo I agree with you. @GlenGeng what do you think ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-757923211


   @szetszwo @bharatviswa504 Sorry, I have a question, I understand LeaderLease maybe not needed in some applications. But 
   in my thinking PreVote is needed in any application, it can make the leader more steady, what do you think ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] szetszwo commented on a change in pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
szetszwo commented on a change in pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#discussion_r551914909



##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
##########
@@ -104,6 +104,20 @@ public RaftPeerRole getCurrentRole() {
       return getRole().getCurrentRole();
     }
 
+    @Override
+    public boolean isLeader() {

Review comment:
       > If leader actully need to step down, we do achieve it by checkLeaderShip, ...
   
   Yes.  I agree.
   
   > ... user maybe prefer to call isLeader, ...
   
   Not the user code.  It is the code in RaftServerImpl.checkLeaderState.
   ```
   //RaftServerImpl.checkLeaderState
       if (!getInfo().isLeader()) {
         NotLeaderException exception = generateNotLeaderException();
         final RaftClientReply reply = newExceptionReply(request, exception);
         return RetryCacheImpl.failWithReply(reply, entry);
       }
       if (!getInfo().isLeaderReady()) {
         final CacheEntry cacheEntry = retryCache.getIfPresent(ClientInvocationId.valueOf(request));
         if (cacheEntry != null && cacheEntry.isCompletedNormally()) {
           return cacheEntry.getReplyFuture();
         }
         final LeaderNotReadyException lnre = new LeaderNotReadyException(getMemberId());
         final RaftClientReply reply = newExceptionReply(request, lnre);
         return RetryCacheImpl.failWithReply(reply, entry);
       }
   ```
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] szetszwo commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
szetszwo commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-758358279


   @runzhiwang , No problem.  Please take you time.  Thanks a lot for working hard on this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] szetszwo commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
szetszwo commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-754509804


   Question: when the leader has lost the leader lease, should it step down?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] szetszwo commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
szetszwo commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-755036702


   Sure, the plan sounds great.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang closed pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang closed pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] GlenGeng commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
GlenGeng commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-760770669


   For now, me and @runzhiwang is developing SCM HA.
   
   In SCM HA, SCM will cache `isLeader` and `term`, updating them when underlying RaftServer steps down or becomes leader, by implementing `StateMachine#notifyNotLeader` and `StateMachine#notifyLeaderChanged`.
   
   SCM HA does not invoke `DivisionInfo#isLeader()` but query its cached `isLeader` to decide whether underlying RaftServer is leader or not, thus it expects RaftServer to step down when lease is expired.
   
   I suggest to implement the leader with lease solution in `LeaderStateImpl#checkLeadership`, and add a switch to decide to step down whether when  lease is expired or when it can not hear from majority.
   
   What do you think @szetszwo  @runzhiwang ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on a change in pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#discussion_r551927347



##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
##########
@@ -104,6 +104,20 @@ public RaftPeerRole getCurrentRole() {
       return getRole().getCurrentRole();
     }
 
+    @Override
+    public boolean isLeader() {

Review comment:
       @szetszwo Oh, sorry. By "user", I mean OM or SCM in Ozone, because in OM HA or SCM HA,  they need to check isLeader to decide whether do something, for example only leader SCM send command to datanode, so SCM need to know whether it's leader. Then the problem becomes SCM call which method: isLeader() or isLeaderReady(). From the name of method, maybe SCM is likely to use isLeader().

##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
##########
@@ -104,6 +104,20 @@ public RaftPeerRole getCurrentRole() {
       return getRole().getCurrentRole();
     }
 
+    @Override
+    public boolean isLeader() {

Review comment:
       @szetszwo Oh, sorry. By "user", I mean OM or SCM in Ozone, because in OM HA or SCM HA,  they need to check isLeader to decide whether do something, for example only leader SCM can send command to datanode, so SCM need to know whether it's leader. Then the problem becomes SCM call which method: isLeader() or isLeaderReady(). From the name of method, maybe SCM is likely to use isLeader().




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on a change in pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#discussion_r551919913



##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
##########
@@ -104,6 +104,20 @@ public RaftPeerRole getCurrentRole() {
       return getRole().getCurrentRole();
     }
 
+    @Override
+    public boolean isLeader() {

Review comment:
       > Not the user code. It is the code in RaftServerImpl.checkLeaderState.
   
   @szetszwo Yes.  But I'm not sure whether it's interface friendly, maybe user is likely to check isLeader, not isLeaderReady. Just a discussion, I'm not sure.
   
   If we do not need to consider this, I'm also prefer to change isLeaderRead() instead of isLeader()




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] szetszwo commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
szetszwo commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-761949146


   @runzhiwang ,  thanks again for working on this.
   
   > 2. In current ratis implementation, leader sends log or heartbeat to follower every 75ms, if there is log, leader will not send heartbeat again. For example, at 0ms there is log and leader send log to follower, at 75ms there is log and leader send log to follower again, ..., at 750ms there is log and leader send log to follower again. So you can find from 0ms to 750ms, log always exist, leader always send log, never heartbeat. But if each log need more than 1000ms to: WriteDisk, applyTransaction, then leader will not receive any reply from 0-1000ms, then leader lease becomes invalid frequently.
   
   With the leader lease feature, the leader probably should send heartbeat separately since followers may take a long time to process log entires.  (The followers do not count the log processing time when counting heartbeat timeout.  However, it is impossible for the leader to do the same discount.)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on a change in pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#discussion_r551927347



##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
##########
@@ -104,6 +104,20 @@ public RaftPeerRole getCurrentRole() {
       return getRole().getCurrentRole();
     }
 
+    @Override
+    public boolean isLeader() {

Review comment:
       @szetszwo Oh, sorry. By "user", I mean OM or SCM in Ozone, because in OM HA or SCM HA,  they need to check isLeader to decide whether do some thing, for example only leader SCM send command to datanode, so SCM need to know whether it's leader. Then the problem becomes SCM call which method: isLeader() or isLeaderReady(). From the name of method, maybe SCM is likely to use isLeader().




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-758353867


   @szetszwo Thanks, got it. With leader lease, some ut become flaky, I need some time to fix them.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] szetszwo commented on pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
szetszwo commented on pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383#issuecomment-756887260


   > We may not need the LEADER_LEASE_TIMEOUT_RATIO_KEY at the very beginning. ...
   
   I agree with @GlenGeng that we may not need LEADER_LEASE_TIMEOUT_RATIO_KEY.  If we take rpc-send-time right before sending out appendEntries, it seems pretty safe to use min-rpc-timeout as the leader-lease-timeout.  If split brain happens, it has to take at least (min-rpc-timeout + leader election time) to elect a new leader.  Then, the old leader lease must be expired by that time.
   
   @runzhiwang , what do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang closed pull request #383: RATIS-1273. Fix split brain by leader lease

Posted by GitBox <gi...@apache.org>.
runzhiwang closed pull request #383:
URL: https://github.com/apache/incubator-ratis/pull/383






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org