You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ratis.apache.org by GitBox <gi...@apache.org> on 2020/06/19 11:58:58 UTC

[GitHub] [incubator-ratis] runzhiwang opened a new pull request #134: RATIS-982. Fix RaftServerImpl failed to change from RUNNING to RUNNING

runzhiwang opened a new pull request #134:
URL: https://github.com/apache/incubator-ratis/pull/134


   ## What changes were proposed in this pull request?
   
   What's the problem ?
   This happens in test, but it maybe also happen in production.
   
   For example, leader is s3 and follower is s4.
   1. kill s4, and restart s4.
   
   ```
   2020-06-19 07:03:18,095 [Thread-6194] INFO  ratis.MiniRaftCluster (MiniRaftCluster.java:killServer(458)) - killServer s4
   2020-06-19 07:03:18,095 [Thread-6194] INFO  ratis.MiniRaftCluster (MiniRaftCluster.java:newRaftServer(330)) - newRaftServer: s4, group-5BD7E8A01610:[s3:0.0.0.0:43375, s4:0.0.0.0:33719, s0:0.0.0.0:34867, s1:0.0.0.0:33783, s2:0.0.0.0:40473], format? false
   ```
   
   
   2. s4 start and set configuration from storage at [setRaftConf(raftConf.getLogEntryIndex(), raftConf) ](https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L170) and s4 will change to RUNNING at [lifeCycle.transition(RUNNING)](https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L213)
   
   
   
   ```
   2020-06-19 07:03:18,127 [pool-16-thread-1] INFO  impl.RaftServerImpl (ServerState.java:setRaftConf(356)) - s4@group-5BD7E8A01610: set configuration 0: [s3:0.0.0.0:43375, s4:0.0.0.0:33719, s0:0.0.0.0:34867, s1:0.0.0.0:33783, s2:0.0.0.0:40473], old=null at 0
   2020-06-19 07:03:18,153 [Thread-6194] INFO  impl.RaftServerImpl (RaftServerImpl.java:start(185)) - s4@group-5BD7E8A01610: start as a follower, conf=0: [s3:0.0.0.0:43375, s4:0.0.0.0:33719, s0:0.0.0.0:34867, s1:0.0.0.0:33783, s2:0.0.0.0:40473], old=null
   2020-06-19 07:03:18,153 [Thread-6194] INFO  impl.RaftServerImpl (RaftServerImpl.java:setRole(174)) - s4@group-5BD7E8A01610: changes role from      null to FOLLOWER at term 1 for startAsFollower
   ```
   
   
   
   3. s3 send append entry request to s4, and s4 change to RUNNING at [lifeCycle.compareAndTransition(STARTING, RUNNING)](https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1003)
   
   `2020-06-19 07:03:18,162 [nioEventLoopGroup-59-1] DEBUG impl.RaftServerImpl (RaftServerImpl.java:logAppendEntries(918)) - s4@group-5BD7E8A01610: receive appendEntries(s3, 1, (t:1, i:0), 0, false, commits[s3:c0, s4:c0, s0:c0, s1:c0, s2:c0], entries: (t:1, i:1), STATEMACHINELOGENTRY, client-9414EC4E73DA, cid=3000`
   
   4. If change to RUNNING in step3 happens before step2, then step2 will throw exception.
   
   ```
   2020-06-19 07:03:18,169 [Thread-6194] INFO  impl.RoleInfo (RoleInfo.java:updateAndGet(143)) - s4: start FollowerState
   2020-06-19 07:03:18,174 [Thread-6194] ERROR netty.TestRaftWithNetty (ExitUtils.java:terminate(133)) - Terminating with exit status -1: Failed to kill/restart server: s4
   2020-06-19T07:03:18.1918474Z java.lang.IllegalStateException: ILLEGAL TRANSITION: In s4, RUNNING -> RUNNING
   2020-06-19T07:03:18.1918899Z 	at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:63)
   2020-06-19T07:03:18.1919240Z 	at org.apache.ratis.util.LifeCycle$State.validate(LifeCycle.java:115)
   2020-06-19T07:03:18.1919558Z 	at org.apache.ratis.util.LifeCycle.transition(LifeCycle.java:155)
   2020-06-19T07:03:18.1919878Z 	at org.apache.ratis.server.impl.RaftServerImpl.startAsFollower(RaftServerImpl.java:214)
   2020-06-19T07:03:18.1920206Z 	at org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:186)
   2020-06-19T07:03:18.1920520Z 	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
   2020-06-19T07:03:18.1920839Z 	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
   2020-06-19T07:03:18.1921330Z 	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
   2020-06-19T07:03:18.1921639Z 	at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
   2020-06-19T07:03:18.1921951Z 	at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
   2020-06-19T07:03:18.1922261Z 	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
   2020-06-19T07:03:18.1922575Z 	at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401)
   2020-06-19T07:03:18.1922885Z 	at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
   2020-06-19T07:03:18.1925464Z 	at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
   2020-06-19T07:03:18.1940816Z 	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
   2020-06-19T07:03:18.1953283Z 	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
   2020-06-19T07:03:18.1967610Z 	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
   2020-06-19T07:03:18.1980549Z 	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:650)
   2020-06-19T07:03:18.1991620Z 	at org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:301)
   2020-06-19T07:03:18.1991958Z 	at org.apache.ratis.MiniRaftCluster.restartServer(MiniRaftCluster.java:312)
   2020-06-19T07:03:18.1992275Z 	at org.apache.ratis.MiniRaftCluster.restartServer(MiniRaftCluster.java:304)
   2020-06-19T07:03:18.1992609Z 	at org.apache.ratis.RaftBasicTests.lambda$killAndRestartServer$2(RaftBasicTests.java:100)
   2020-06-19T07:03:18.1992920Z 	at java.lang.Thread.run(Thread.java:748)
   ```
   
   
   
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/RATIS-982
   
   ## How was this patch tested?
   
   Existed tests.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on a change in pull request #134: RATIS-982. Fix RaftServerImpl illegal transition from RUNNING to RUNNING

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #134:
URL: https://github.com/apache/incubator-ratis/pull/134#discussion_r443311277



##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
##########
@@ -118,6 +121,8 @@
     this.jmxAdapter = new RaftServerJmxAdapter();
     this.leaderElectionMetrics = LeaderElectionMetrics.getLeaderElectionMetrics(this);
     this.raftServerMetrics = RaftServerMetrics.getRaftServerMetrics(this);
+
+    this.finishStart = new AtomicBoolean(false);

Review comment:
       @bshashikant Thanks for review. I have updated the patch.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] bshashikant commented on a change in pull request #134: RATIS-982. Fix RaftServerImpl illegal transition from RUNNING to RUNNING

Posted by GitBox <gi...@apache.org>.
bshashikant commented on a change in pull request #134:
URL: https://github.com/apache/incubator-ratis/pull/134#discussion_r443309101



##########
File path: ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
##########
@@ -118,6 +121,8 @@
     this.jmxAdapter = new RaftServerJmxAdapter();
     this.leaderElectionMetrics = LeaderElectionMetrics.getLeaderElectionMetrics(this);
     this.raftServerMetrics = RaftServerMetrics.getRaftServerMetrics(this);
+
+    this.finishStart = new AtomicBoolean(false);

Review comment:
       Thanks @runzhiwang for discovering and providing this fix. This  is one of main reasons of several unit test failures.
   
   The patch looks good to me.
   Can we rename "finishStart" to something like startComplete or startDone? "finishStart" looks confusing tome. Can you also add a comment in the code why its important?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on pull request #134: RATIS-982. Fix RaftServerImpl illegal transition from RUNNING to RUNNING

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on pull request #134:
URL: https://github.com/apache/incubator-ratis/pull/134#issuecomment-646597353


   @bshashikant Could you help review this patch ? Thank you very much.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] bshashikant merged pull request #134: RATIS-982. Fix RaftServerImpl illegal transition from RUNNING to RUNNING

Posted by GitBox <gi...@apache.org>.
bshashikant merged pull request #134:
URL: https://github.com/apache/incubator-ratis/pull/134


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org