You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2022/07/23 08:24:13 UTC

[GitHub] [bookkeeper] wenbingshen opened a new pull request, #3418: fix Flaky-test: BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour

wenbingshen opened a new pull request, #3418:
URL: https://github.com/apache/bookkeeper/pull/3418

   ### Motivation
   
   fix Flaky-test: `BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour` by disable retry session expired.
   
   issue #3206  was fixed by PR #3415, but it still introduced another flaky test: `BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour`
   
   According to the investigation, the reason is: 
   `ZookeeperClient` still creates a new `zookeeper` instance when the old zookeeper client session time out.
   
   Due to the asynchronous execution of two threads executing bookie temporary node re-registration and zk instance re-creation, the test program sometimes succeeds and sometimes fails.
   
   1. When the temporary node re-registration is performed before the zk re-instantiation, the temporary node creation will use the old zk instance, which will cause a session timeout error, the bookie service will be shutdown, and the test will be successful;
    
   2. When the zk re-instantiation precedes the re-registration of the temporary node, the temporary node creation will use the new re-instantiated zk instance, then the temporary node will be successfully created, the bookie service is running normally, and the test fails.
   
   ```java
           try {
               connectExecutor.submit(clientCreator);
           } catch (RejectedExecutionException ree) {
               if (!closed.get()) {
                   logger.error("ZooKeeper reconnect task is rejected : ", ree);
               }
           } catch (Exception t) {
               logger.error("Failed to submit zookeeper reconnect task due to runtime exception : ", t);
           }
   ```
   
   ### Changes
   Add a `retryExpired` flag to indicate whether to run the zk instance and retry to create a new instance after the session times out.
   Set this flag to false for `ZKMetadataBookieDriver`;
   Other ZookeeperClient and normal ZookeeperClient applications will generate the default value true or set to true, which is consistent with the original behavior.
   
   ### Test the behavior of this PR:
   **Before this PR:**
   Executed the test 10 times, all failed.
   <img width="1111" alt="image" src="https://user-images.githubusercontent.com/35599757/180596967-9f3f4300-6ba6-4989-b35e-a275a955d139.png">
   
   **After this PR:**
   Executed the test 10 times, all successful.
   <img width="894" alt="image" src="https://user-images.githubusercontent.com/35599757/180597032-01318ed6-e498-4ba4-8bd7-fb081ed8245a.png">
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] StevenLuMT commented on a diff in pull request #3418: fix Flaky-test: BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour

Posted by GitBox <gi...@apache.org>.
StevenLuMT commented on code in PR #3418:
URL: https://github.com/apache/bookkeeper/pull/3418#discussion_r928182181


##########
bookkeeper-server/src/main/java/org/apache/bookkeeper/meta/zk/ZKMetadataDriverBase.java:
##########
@@ -165,7 +165,8 @@ public String getScheme() {
     protected void initialize(AbstractConfiguration<?> conf,
                               StatsLogger statsLogger,
                               RetryPolicy zkRetryPolicy,
-                              Optional<Object> optionalCtx) throws MetadataException {
+                              Optional<Object> optionalCtx,
+                              boolean zkRetryExpired) throws MetadataException {

Review Comment:
   this check(zkRetryExpired) is difficult to understand, what scene is set to true and what scene is set to false



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] StevenLuMT commented on pull request #3418: fix Flaky-test: BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour

Posted by GitBox <gi...@apache.org>.
StevenLuMT commented on PR #3418:
URL: https://github.com/apache/bookkeeper/pull/3418#issuecomment-1225339208

   fix old workflow,please see #3455 for detail


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] fix Flaky-test: BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour [bookkeeper]

Posted by "hangc0276 (via GitHub)" <gi...@apache.org>.
hangc0276 commented on PR #3418:
URL: https://github.com/apache/bookkeeper/pull/3418#issuecomment-1841940934

   > @dlg99
   > 
   > > can you predictably repro the problem? Maybe my fix was not good enough :( but I only could repro it under docker with limited number of CPUs and after the fix it did not fail after running in the loop 150 times.
   > > BookieStateManager's registration listener will shutdown the bookie if it cannot reconnect immediately so I guess there is still some timing issue between it and the test killing zk session / letting it recover.
   > 
   > When I run the test in my local, it is very flaky
   
   @poorbarcode There is a race condition describe in the motivation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] HQebupt commented on pull request #3418: fix Flaky-test: BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour

Posted by GitBox <gi...@apache.org>.
HQebupt commented on PR #3418:
URL: https://github.com/apache/bookkeeper/pull/3418#issuecomment-1193110999

   @StevenLuMT PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] dlg99 commented on pull request #3418: fix Flaky-test: BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour

Posted by GitBox <gi...@apache.org>.
dlg99 commented on PR #3418:
URL: https://github.com/apache/bookkeeper/pull/3418#issuecomment-1194812868

   can you predictably repro the problem? Maybe my fix was not good enough :( but I only could repro it under docker with limited number of CPUs and after the fix it did not fail after running in the loop 150 times.
   BookieStateManager's registration listener will shutdown the bookie if it cannot reconnect immediately so I guess there is still some timing issue between it and the test killing zk session / letting it recover. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] wenbingshen commented on pull request #3418: fix Flaky-test: BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour

Posted by GitBox <gi...@apache.org>.
wenbingshen commented on PR #3418:
URL: https://github.com/apache/bookkeeper/pull/3418#issuecomment-1193087365

   ping @dlg99 @hangc0276 @Shoothzj @eolivelli PTAL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] wenbingshen commented on a diff in pull request #3418: fix Flaky-test: BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour

Posted by GitBox <gi...@apache.org>.
wenbingshen commented on code in PR #3418:
URL: https://github.com/apache/bookkeeper/pull/3418#discussion_r928096822


##########
bookkeeper-server/src/test/java/org/apache/bookkeeper/test/BookieZKExpireTest.java:
##########
@@ -177,13 +179,15 @@ conf, new TestBookieImpl(conf),
             log.info("Resuming threads");
             sendthread.resume();
 
-            // allow watcher thread to run
-            Thread.sleep(3000);
+            // allow client.waitForConnection() timeout
+            Thread.sleep(10000);
             assertFalse("Bookie should shutdown on losing zk session", server.isBookieRunning());
             assertFalse("Bookie Server should shutdown on losing zk session", server.isRunning());

Review Comment:
   The increase from 3 seconds to 10 seconds is because creating a bookie temporary node needs to wait for the old zk client to establish a connection with the server. There is a timeout period of conf.getZkTimeout()=6 seconds. Once the timeout period expires, the bookie service will Closed due to session timeout, the extra 4 seconds is to consolidate the stability of the test and give the bookie service a little extra time to perform the shutdown.



##########
bookkeeper-server/src/main/java/org/apache/bookkeeper/zookeeper/ZooKeeperClient.java:
##########
@@ -356,6 +366,13 @@ private void onExpired() {
 
         logger.info("ZooKeeper session {} is expired from {}.",
                 Long.toHexString(getSessionId()), connectString);
+
+        if (!retryExpired) {
+            logger.warn("ZooKeeper session expired retries have been disabled");
+            return;
+        }
+

Review Comment:
   Recreating new zookeeper instances is prohibited here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] fix Flaky-test: BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour [bookkeeper]

Posted by "poorbarcode (via GitHub)" <gi...@apache.org>.
poorbarcode commented on PR #3418:
URL: https://github.com/apache/bookkeeper/pull/3418#issuecomment-1840379493

   @dlg99 
   
   > can you predictably repro the problem? Maybe my fix was not good enough :( but I only could repro it under docker with limited number of CPUs and after the fix it did not fail after running in the loop 150 times.
   BookieStateManager's registration listener will shutdown the bookie if it cannot reconnect immediately so I guess there is still some timing issue between it and the test killing zk session / letting it recover.
   
   When I run the test in my local, it is very flaky


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org