You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2021/02/03 07:01:25 UTC

[GitHub] [pulsar] MarvinCai opened a new pull request #9443: Try to fix flaky LeaderElection test.

MarvinCai opened a new pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443


   Fixes #9408
   
   ### Modifications
   Seems the test is flaky because when it kills current leader and wait for new leader, it only check single pulsar to verify leader has changed, but cache for other pulsars may not [refreshed](https://github.com/apache/pulsar/blob/fa41d02bebfd841767846240f3ae574047f118f0/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/cache/impl/MetadataCacheImpl.java#L247) by the ZK watch yet, hence we got the Optional.Empty which is leader path doesn't exist and they try to become leader.
   
   So update to wait for leader changed for all pulsar, then verify they're all see same leader.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] MarvinCai commented on a change in pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
MarvinCai commented on a change in pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#discussion_r569653105



##########
File path: pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/LoadBalancerTest.java
##########
@@ -176,23 +176,30 @@ void shutdown() throws Exception {
         bkEnsemble.stop();
     }
 
-    private LeaderBroker loopUntilLeaderChanges(LeaderElectionService les, LeaderBroker oldLeader,
-            LeaderBroker newLeader) throws InterruptedException {
+    private void loopUntilLeaderChangesForAllBroker(List<PulsarService> activePulsars, LeaderBroker oldLeader)
+            throws InterruptedException {
         int loopCount = 0;
+        boolean settled;
 
         while (loopCount < MAX_RETRIES) {
             Thread.sleep(1000);

Review comment:
       @lhotari PTAL




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] MarvinCai commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
MarvinCai commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-774361183


   /pulsarbot run-failure-checks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-773084212


   > I'll rebase my PR if that one get merged first.
   
   @MarvinCai I meant to say earlier that it's fine that you don't rebase, it's better to get your changes merged asap and follow up any remaining issues later. The reason I'm saying this is that the build queue usually gets up to several hours and with all the flakiness, it's again a lot of retrying until these changes get merged...
   You can immediately locally test the changes by cherry-picking #9460 changes to a temporary local branch and checking if the flakiness gets fixed together with the changes in this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] MarvinCai commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
MarvinCai commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-773068016






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] MarvinCai commented on a change in pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
MarvinCai commented on a change in pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#discussion_r569628833



##########
File path: pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/LoadBalancerTest.java
##########
@@ -176,23 +176,30 @@ void shutdown() throws Exception {
         bkEnsemble.stop();
     }
 
-    private LeaderBroker loopUntilLeaderChanges(LeaderElectionService les, LeaderBroker oldLeader,
-            LeaderBroker newLeader) throws InterruptedException {
+    private void loopUntilLeaderChangesForAllBroker(List<PulsarService> activePulsars, LeaderBroker oldLeader)
+            throws InterruptedException {
         int loopCount = 0;
+        boolean settled;
 
         while (loopCount < MAX_RETRIES) {
             Thread.sleep(1000);

Review comment:
       Yeah, I'll do that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] merlimat commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
merlimat commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-772711482


   I'll be looking at the tight loop in the async calls there


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] merlimat merged pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
merlimat merged pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] MarvinCai commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
MarvinCai commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-773080049


   @lhotari Oh, I didn't see that change, but was trying to do exactly the same thing, manually invalidating the cache entry after an election, cause I found the test will still fail quite often due to the out dated cache entry.
   I'll rebase my PR if that one get merged first.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-773074297


   > Please don't merge for now, found some more issue we need to fix.
   
   @MarvinCai You are probably already aware of the fix #9460 that is also needed. If that's the case, I think these 2 changes could be merged separately and having a new PR to improve further. WDYT?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] sijie commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
sijie commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-773868301


   /pulsarbot run-failure-checks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-773061318


   /pulsarbot run-failure-checks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] 315157973 commented on a change in pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
315157973 commented on a change in pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#discussion_r569561747



##########
File path: pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/LoadBalancerTest.java
##########
@@ -176,23 +176,30 @@ void shutdown() throws Exception {
         bkEnsemble.stop();
     }
 
-    private LeaderBroker loopUntilLeaderChanges(LeaderElectionService les, LeaderBroker oldLeader,
-            LeaderBroker newLeader) throws InterruptedException {
+    private void loopUntilLeaderChangesForAllBroker(List<PulsarService> activePulsars, LeaderBroker oldLeader)
+            throws InterruptedException {
         int loopCount = 0;
+        boolean settled;
 
         while (loopCount < MAX_RETRIES) {
             Thread.sleep(1000);
-            // Check if the new leader is elected. If yes, break without incrementing the loopCount
-            newLeader = les.getCurrentLeader().get();
-            if (newLeader.equals(oldLeader) == false) {
+            settled = true;
+            // Check if the all active pulsar see a new leader
+            for (PulsarService pulsar : activePulsars) {
+                Optional<LeaderBroker> leader = pulsar.getLeaderElectionService().readCurrentLeader().join();
+                if (leader.isPresent() && leader.get().equals(oldLeader)) {

Review comment:
       Good job ,I have a question:
   Since there will be a single follower who cannot see the new leader until timeout, why traversing all the followers can solve this problem?
   The current logic is: as long as there is no old leader in all followers, it will be successful, but there will still be cases where empty is returned.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] MarvinCai commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
MarvinCai commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-774180857


   /pulsarbot run-failure-checks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] sijie commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
sijie commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-773868301


   /pulsarbot run-failure-checks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] MarvinCai commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
MarvinCai commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-773528787


   /pulsarbot run-failure-checks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] MarvinCai edited a comment on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
MarvinCai edited a comment on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-773068016


   Please don't merge for now, found some more issue we need to fix.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] MarvinCai edited a comment on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
MarvinCai edited a comment on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-773068016


   Please don't merge for now, found some more issue we need to fix.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] MarvinCai commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
MarvinCai commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-773068016


   Please don't merge yet, found some more issue we need to fix.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] MarvinCai commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
MarvinCai commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-772706265


   > @MarvinCai good work! I added a suggestion of using Awaitility for the waiting loop.
   > 
   > I have some concerns that the production code has some issues. I wrote about my observations in [#9408 (comment)](https://github.com/apache/pulsar/issues/9408#issuecomment-771552894) . There was a NPE stack trace with maximum 1024 stack frames.
   > 
   > [The NPE is easy to fix](https://github.com/apache/pulsar/commit/e8361f3656247a43c8928353d7e0a59784ddd25e)), but the depth of the stack is the concern. I did a quick check in the code and it looks like backoff with some delay is missing in some cases and the leader election can get into a very tight loop. I was hoping that @merlimat could check the comments in [#9408 (comment)](https://github.com/apache/pulsar/issues/9408#issuecomment-771561839) before #9408 is closed. I guess the other option is that I file a separate bug report about the observation.
   
   We can probably wait for couple days and if we don't get reply from him we can probably open another issue for that NPE. I did also see this exception couple of times, but not sure if it's causing any real problem.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] 315157973 commented on a change in pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
315157973 commented on a change in pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#discussion_r569908123



##########
File path: pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/LoadBalancerTest.java
##########
@@ -176,23 +176,30 @@ void shutdown() throws Exception {
         bkEnsemble.stop();
     }
 
-    private LeaderBroker loopUntilLeaderChanges(LeaderElectionService les, LeaderBroker oldLeader,
-            LeaderBroker newLeader) throws InterruptedException {
+    private void loopUntilLeaderChangesForAllBroker(List<PulsarService> activePulsars, LeaderBroker oldLeader)
+            throws InterruptedException {
         int loopCount = 0;
+        boolean settled;
 
         while (loopCount < MAX_RETRIES) {
             Thread.sleep(1000);
-            // Check if the new leader is elected. If yes, break without incrementing the loopCount
-            newLeader = les.getCurrentLeader().get();
-            if (newLeader.equals(oldLeader) == false) {
+            settled = true;
+            // Check if the all active pulsar see a new leader
+            for (PulsarService pulsar : activePulsars) {
+                Optional<LeaderBroker> leader = pulsar.getLeaderElectionService().readCurrentLeader().join();
+                if (leader.isPresent() && leader.get().equals(oldLeader)) {

Review comment:
       Seems to be the reason. Therefore, the unit test should cover the scenario where the leader of the follower is always empty.
   https://github.com/apache/pulsar/pull/9460
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] 315157973 commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
315157973 commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-773805606


   @MarvinCai  This PR is merged https://github.com/apache/pulsar/pull/9460, please go on


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] 315157973 commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
315157973 commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-773805606


   @MarvinCai  This PR is merged https://github.com/apache/pulsar/pull/9460, please go on


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-773061318






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] MarvinCai commented on a change in pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
MarvinCai commented on a change in pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#discussion_r569627190



##########
File path: pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/LoadBalancerTest.java
##########
@@ -176,23 +176,30 @@ void shutdown() throws Exception {
         bkEnsemble.stop();
     }
 
-    private LeaderBroker loopUntilLeaderChanges(LeaderElectionService les, LeaderBroker oldLeader,
-            LeaderBroker newLeader) throws InterruptedException {
+    private void loopUntilLeaderChangesForAllBroker(List<PulsarService> activePulsars, LeaderBroker oldLeader)
+            throws InterruptedException {
         int loopCount = 0;
+        boolean settled;
 
         while (loopCount < MAX_RETRIES) {
             Thread.sleep(1000);
-            // Check if the new leader is elected. If yes, break without incrementing the loopCount
-            newLeader = les.getCurrentLeader().get();
-            if (newLeader.equals(oldLeader) == false) {
+            settled = true;
+            // Check if the all active pulsar see a new leader
+            for (PulsarService pulsar : activePulsars) {
+                Optional<LeaderBroker> leader = pulsar.getLeaderElectionService().readCurrentLeader().join();
+                if (leader.isPresent() && leader.get().equals(oldLeader)) {

Review comment:
       @315157973 I think the old logic is it pick the last follower it seen and check if it sees the new leader, which is in this chunk of code ([ref1](https://github.com/apache/pulsar/blob/fd7da5210b59fe9fd7b2619534e8122ba7b2701a/pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/LoadBalancerTest.java#L730), [ref2](https://github.com/apache/pulsar/blob/fd7da5210b59fe9fd7b2619534e8122ba7b2701a/pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/LoadBalancerTest.java#L186-L188)).
   And then it just do the check which can't guarantee all follower already saw a new leader since for some follower which try to become leader, it'll first see empty path "/loadbalance/leader", read it in cache as [Optional.Empty] then try to create the znode, but after it fail(other node already create that znode and become leader) and before it's cache get updated by zk watch, there're might be some delay so old test can still see that [Optinal.Empty]. So loop through all followers and making sure all of them already a new leader, then check all of them see the same leader can solve the problem.
   Does it make sense?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on a change in pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
lhotari commented on a change in pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#discussion_r569229529



##########
File path: pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/LoadBalancerTest.java
##########
@@ -176,23 +176,30 @@ void shutdown() throws Exception {
         bkEnsemble.stop();
     }
 
-    private LeaderBroker loopUntilLeaderChanges(LeaderElectionService les, LeaderBroker oldLeader,
-            LeaderBroker newLeader) throws InterruptedException {
+    private void loopUntilLeaderChangesForAllBroker(List<PulsarService> activePulsars, LeaderBroker oldLeader)
+            throws InterruptedException {
         int loopCount = 0;
+        boolean settled;
 
         while (loopCount < MAX_RETRIES) {
             Thread.sleep(1000);

Review comment:
       It would be better to replace this loop with Awaitility. An example of using Awaitility for this type of case is here: https://github.com/lhotari/pulsar/commit/dc1696030c608fab288b0f8f9921682a702ed1a7#diff-ff0def4f209a344e2ebc9f9d3a6ab32bf67ed12b9c7c6a968a110a6e69c7ed07R192-R196
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] 315157973 commented on a change in pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
315157973 commented on a change in pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#discussion_r569880642



##########
File path: pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/LoadBalancerTest.java
##########
@@ -176,23 +176,30 @@ void shutdown() throws Exception {
         bkEnsemble.stop();
     }
 
-    private LeaderBroker loopUntilLeaderChanges(LeaderElectionService les, LeaderBroker oldLeader,
-            LeaderBroker newLeader) throws InterruptedException {
+    private void loopUntilLeaderChangesForAllBroker(List<PulsarService> activePulsars, LeaderBroker oldLeader)
+            throws InterruptedException {
         int loopCount = 0;
+        boolean settled;
 
         while (loopCount < MAX_RETRIES) {
             Thread.sleep(1000);
-            // Check if the new leader is elected. If yes, break without incrementing the loopCount
-            newLeader = les.getCurrentLeader().get();
-            if (newLeader.equals(oldLeader) == false) {
+            settled = true;
+            // Check if the all active pulsar see a new leader
+            for (PulsarService pulsar : activePulsars) {
+                Optional<LeaderBroker> leader = pulsar.getLeaderElectionService().readCurrentLeader().join();
+                if (leader.isPresent() && leader.get().equals(oldLeader)) {

Review comment:
       I tried to extend the waiting time, and some followers still could not see the new leader. That is: when a leader switch occurs, some followers can never see the new leader, and all they read are empty




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] MarvinCai commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
MarvinCai commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-773140931


   /pulsarbot run-failure-checks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#issuecomment-772344154


   @MarvinCai good work! I added a suggestion of using Awaitility for the waiting loop. 
   
   I have some concerns that the production code has some issues. I wrote about my observations in https://github.com/apache/pulsar/issues/9408#issuecomment-771552894 . There was a NPE stack trace with maximum 1024 stack frames. 
   
   [The NPE is easy to fix](https://github.com/apache/pulsar/commit/e8361f3656247a43c8928353d7e0a59784ddd25e)), but the depth of the stack is the concern. I did a quick check in the code and it looks like backoff with some delay is missing in some cases and the leader election can get into a very tight loop. I was hoping that @merlimat could check the comments in https://github.com/apache/pulsar/issues/9408#issuecomment-771561839 before #9408 is closed. I guess the other option is that I file a separate bug report about the observation.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] 315157973 commented on a change in pull request #9443: Try to fix flaky LeaderElection test.

Posted by GitBox <gi...@apache.org>.
315157973 commented on a change in pull request #9443:
URL: https://github.com/apache/pulsar/pull/9443#discussion_r569908123



##########
File path: pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/LoadBalancerTest.java
##########
@@ -176,23 +176,30 @@ void shutdown() throws Exception {
         bkEnsemble.stop();
     }
 
-    private LeaderBroker loopUntilLeaderChanges(LeaderElectionService les, LeaderBroker oldLeader,
-            LeaderBroker newLeader) throws InterruptedException {
+    private void loopUntilLeaderChangesForAllBroker(List<PulsarService> activePulsars, LeaderBroker oldLeader)
+            throws InterruptedException {
         int loopCount = 0;
+        boolean settled;
 
         while (loopCount < MAX_RETRIES) {
             Thread.sleep(1000);
-            // Check if the new leader is elected. If yes, break without incrementing the loopCount
-            newLeader = les.getCurrentLeader().get();
-            if (newLeader.equals(oldLeader) == false) {
+            settled = true;
+            // Check if the all active pulsar see a new leader
+            for (PulsarService pulsar : activePulsars) {
+                Optional<LeaderBroker> leader = pulsar.getLeaderElectionService().readCurrentLeader().join();
+                if (leader.isPresent() && leader.get().equals(oldLeader)) {

Review comment:
       Seems to be the reason
   https://github.com/apache/pulsar/pull/9460
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org