You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by GitBox <gi...@apache.org> on 2020/05/13 23:55:46 UTC

[GitHub] [hbase] ndimiduk commented on a change in pull request #1690: HBASE-24327 : Flaky connection in TestMasterShutdown#testMasterShutdo…

ndimiduk commented on a change in pull request #1690:
URL: https://github.com/apache/hbase/pull/1690#discussion_r424789318



##########
File path: hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterShutdown.java
##########
@@ -151,19 +152,46 @@ public void testMasterShutdownBeforeStartingAnyRegionServer() throws Exception {
       hbaseCluster = new LocalHBaseCluster(htu.getConfiguration(), options.getNumMasters(),
         options.getNumRegionServers(), options.getMasterClass(), options.getRsClass());
       final MasterThread masterThread = hbaseCluster.getMasters().get(0);
+
       masterThread.start();
-      // Switching to master registry exacerbated a race in the master bootstrap that can result
-      // in a lost shutdown command (HBASE-8422, HBASE-23836). The race is essentially because
-      // the server manager in HMaster is not initialized by the time shutdown() RPC (below) is
-      // made to the master. The suspected reason as to why it was uncommon before HBASE-18095
-      // is because the connection creation with ZK registry is so slow that by then the server
-      // manager is usually init'ed in time for the RPC to be made. For now, adding an explicit
-      // wait() in the test, waiting for the server manager to become available.
-      final long timeout = TimeUnit.MINUTES.toMillis(10);
-      assertNotEquals("Timeout waiting for server manager to become available.",
-        -1, Waiter.waitFor(htu.getConfiguration(), timeout,
-          () -> masterThread.getMaster().getServerManager() != null));
-      htu.getConnection().getAdmin().shutdown();
+      final CompletableFuture<Void> shutdownFuture = CompletableFuture.runAsync(() -> {

Review comment:
       Ah, and that one smells a bit like HBASE-23836.

##########
File path: hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterShutdown.java
##########
@@ -151,19 +152,46 @@ public void testMasterShutdownBeforeStartingAnyRegionServer() throws Exception {
       hbaseCluster = new LocalHBaseCluster(htu.getConfiguration(), options.getNumMasters(),
         options.getNumRegionServers(), options.getMasterClass(), options.getRsClass());
       final MasterThread masterThread = hbaseCluster.getMasters().get(0);
+
       masterThread.start();
-      // Switching to master registry exacerbated a race in the master bootstrap that can result
-      // in a lost shutdown command (HBASE-8422, HBASE-23836). The race is essentially because
-      // the server manager in HMaster is not initialized by the time shutdown() RPC (below) is
-      // made to the master. The suspected reason as to why it was uncommon before HBASE-18095
-      // is because the connection creation with ZK registry is so slow that by then the server
-      // manager is usually init'ed in time for the RPC to be made. For now, adding an explicit
-      // wait() in the test, waiting for the server manager to become available.
-      final long timeout = TimeUnit.MINUTES.toMillis(10);
-      assertNotEquals("Timeout waiting for server manager to become available.",
-        -1, Waiter.waitFor(htu.getConfiguration(), timeout,
-          () -> masterThread.getMaster().getServerManager() != null));
-      htu.getConnection().getAdmin().shutdown();
+      final CompletableFuture<Void> shutdownFuture = CompletableFuture.runAsync(() -> {
+        // Switching to master registry exacerbated a race in the master bootstrap that can result
+        // in a lost shutdown command (HBASE-8422, HBASE-23836). The race is essentially because
+        // the server manager in HMaster is not initialized by the time shutdown() RPC (below) is
+        // made to the master. The suspected reason as to why it was uncommon before HBASE-18095
+        // is because the connection creation with ZK registry is so slow that by then the server
+        // manager is usually init'ed in time for the RPC to be made. For now, adding an explicit
+        // wait() in the test, waiting for the server manager to become available.
+        final long timeout = TimeUnit.MINUTES.toMillis(10);
+        assertNotEquals("timeout waiting for server manager to become available.", -1,
+          htu.waitFor(timeout, () -> masterThread.getMaster().getServerManager() != null));
+
+        // Master has come up far enough that we can terminate it without creating a zombie.
+        LOG.debug("Attempting to establish connection.");
+        try {
+          // HBASE-24327 : (Resolve Flaky connection issues)
+          // shutdown() RPC can have flaky ZK connection issues.
+          // e.g
+          // ERROR [RpcServer.priority.RWQ.Fifo.read.handler=1,queue=1,port=53033]
+          // master.HMaster(2878): ZooKeeper exception trying to set cluster as down in ZK
+          // org.apache.zookeeper.KeeperException$SystemErrorException:
+          // KeeperErrorCode = SystemError
+          //
+          // However, even when above flakes happen, shutdown call does get completed even if
+          // RPC call has failure. Hence, subsequent retries will never succeed as HMaster is
+          // already shutdown. Hence, it can fail. To resolve it, after making one shutdown()
+          // call, we are ignoring IOException.
+          htu.getConnection().getAdmin().shutdown();
+          LOG.info("Shutdown RPC sent.");
+        } catch (IOException | CompletionException e) {
+          LOG.warn("Failed to establish connection.", e);

Review comment:
       Thanks for the links to the previous commits. GH makes it difficult for me to follow previous conversation.
   
   Okay, so if you move this code out of the `CompletableFuture`, the `CompletionException` can just be re-thrown. It indicates a client-side problem, and there's nothing the test should try to recover.
   
   I would not catch `IOException` either, just let it bubble up. Instead, I would focus on meaningful subclasses of `IOException`. `RetriesExhaustedException` is a good place to work from: you know your client made some effort. There's potentially multiple cause instances under there, so i guess just pick one to work with, probably the last one. If it's a descendent of `java.net.SocketException`, you know the client couldn't get anywhere -- how should the test behave if the the master is not running?
   
   After that, I'm not really sure what's thrown in what cases. Our API's checked exception definitions aren't strong enough for me know by reading the interfaces. However, pretty much anything else means the client managed to get the RPC over to the server. I think that fact alone is enough to consider this part of the test has succeeded at its goal. Until the myriad other issues in master startup and shutdown are resolved, I think this is the best the client can hope for (for what it's worth, I think this test will continue to be flakey until those master-side problems are solved, and they cannot be resolved perfectly by client-side gymnastics).
   
   You've looked at it, and seen the errors and test failures, more recently than I have, so what do you think? What brought you to this ticket in the first place? Are there more specific subclasses of `IOException` that are thrown, which you can use to reasonably address that specific condition? Paste the stack traces into Jira and the commit message so we can follow your effort.
   
   And thank you for your effort -- test fixing is usually thankless drudgery but it makes all of our lives better :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org