You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by GitBox <gi...@apache.org> on 2020/07/19 05:02:18 UTC

[GitHub] [hbase] busbey commented on a change in pull request #2084: HBASE-22263 Master creates duplicate ServerCrashProcedure on initiali…

busbey commented on a change in pull request #2084:
URL: https://github.com/apache/hbase/pull/2084#discussion_r456858152



##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
##########
@@ -846,10 +849,37 @@ private void finishActiveMasterInitialization(MonitoredTask status)
     if (isStopped()) return;
 
     status.setStatus("Submitting log splitting work for previously failed region servers");
+
+    // grab the list of procedures once. SCP fom pre-crash should all be loaded, and can't progress
+    // until AM joins the cluster any SCPs that got added after we get the log folder list should be
+    // for a different start code.
+    final Set<ServerName> alreadyHasSCP = new HashSet<>();
+    long scpCount = 0;
+    for (ProcedureInfo procInfo : this.procedureExecutor.listProcedures() ) {
+      final Procedure proc = this.procedureExecutor.getProcedure(procInfo.getProcId());
+      if (proc != null) {
+        if (proc instanceof ServerCrashProcedure && !(proc.isFinished() || proc.isSuccess())) {
+          scpCount++;
+          alreadyHasSCP.add(((ServerCrashProcedure)proc).getServerName());
+        }
+      }
+    }
+    LOG.info("Restored proceduces include " + scpCount + " SCP covering " + alreadyHasSCP.size() +
+        " ServerName.");
+    
+ 
+    LOG.info("Checking " + previouslyFailedServers.size() + " previously failed servers (seen via wals) for existing SCP.");
+    // AM should be in "not yet init" and these should all be queued
     // Master has recovered hbase:meta region server and we put
     // other failed region servers in a queue to be handled later by SSH
     for (ServerName tmpServer : previouslyFailedServers) {
-      this.serverManager.processDeadServer(tmpServer, true);
+      if (alreadyHasSCP.contains(tmpServer)) {
+        LOG.info("Skipping failed server in FS because it already has a queued SCP: " + tmpServer);
+        this.serverManager.getDeadServers().add(tmpServer);

Review comment:
       does a queued SCP imply that the server should already be in the dead servers list? Or do we only add servers to that when we create an SCP and not when we recover them?

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
##########
@@ -846,10 +849,37 @@ private void finishActiveMasterInitialization(MonitoredTask status)
     if (isStopped()) return;
 
     status.setStatus("Submitting log splitting work for previously failed region servers");
+
+    // grab the list of procedures once. SCP fom pre-crash should all be loaded, and can't progress
+    // until AM joins the cluster any SCPs that got added after we get the log folder list should be
+    // for a different start code.
+    final Set<ServerName> alreadyHasSCP = new HashSet<>();
+    long scpCount = 0;
+    for (ProcedureInfo procInfo : this.procedureExecutor.listProcedures() ) {
+      final Procedure proc = this.procedureExecutor.getProcedure(procInfo.getProcId());
+      if (proc != null) {
+        if (proc instanceof ServerCrashProcedure && !(proc.isFinished() || proc.isSuccess())) {
+          scpCount++;
+          alreadyHasSCP.add(((ServerCrashProcedure)proc).getServerName());
+        }
+      }
+    }
+    LOG.info("Restored proceduces include " + scpCount + " SCP covering " + alreadyHasSCP.size() +
+        " ServerName.");
+    
+ 
+    LOG.info("Checking " + previouslyFailedServers.size() + " previously failed servers (seen via wals) for existing SCP.");
+    // AM should be in "not yet init" and these should all be queued
     // Master has recovered hbase:meta region server and we put
     // other failed region servers in a queue to be handled later by SSH
     for (ServerName tmpServer : previouslyFailedServers) {
-      this.serverManager.processDeadServer(tmpServer, true);
+      if (alreadyHasSCP.contains(tmpServer)) {
+        LOG.info("Skipping failed server in FS because it already has a queued SCP: " + tmpServer);
+        this.serverManager.getDeadServers().add(tmpServer);

Review comment:
       this looks like what's different from my old patch, is that right? have I missed anything else?

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
##########
@@ -681,11 +681,15 @@ public synchronized void processDeadServer(final ServerName serverName, boolean
     // the handler threads and meta table could not be re-assigned in case
     // the corresponding server is down. So we queue them up here instead.
     if (!services.getAssignmentManager().isFailoverCleanupDone()) {
+      LOG.debug("AssignmentManager isn't done cleaning up from failover. Requeue server " + serverName);
       requeuedDeadServers.put(serverName, shouldSplitWal);
       return;
     }
 
+    // we don't chck if deadservers already included?
+    // when a server is already in the dead server list (including start code) do we need to schedule an SCP?

Review comment:
       these comments should be removed.

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
##########
@@ -681,11 +681,15 @@ public synchronized void processDeadServer(final ServerName serverName, boolean
     // the handler threads and meta table could not be re-assigned in case
     // the corresponding server is down. So we queue them up here instead.
     if (!services.getAssignmentManager().isFailoverCleanupDone()) {
+      LOG.debug("AssignmentManager isn't done cleaning up from failover. Requeue server " + serverName);
       requeuedDeadServers.put(serverName, shouldSplitWal);
       return;
     }
 
+    // we don't chck if deadservers already included?
+    // when a server is already in the dead server list (including start code) do we need to schedule an SCP?
     this.deadservers.add(serverName);
+    // scheduled an SCP means AM must be going?

Review comment:
       this one should be removed too

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
##########
@@ -860,6 +890,8 @@ private void finishActiveMasterInitialization(MonitoredTask status)
 
     // Fix up assignment manager status
     status.setStatus("Starting assignment manager");
+    // die somewhere in here for SCP flood I think.

Review comment:
       we can leave out the speculative comments now?

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
##########
@@ -2733,6 +2766,7 @@ public boolean isServerCrashProcessingEnabled() {
     return serverCrashProcessingEnabled.isReady();
   }
 
+  // XXX what

Review comment:
       lol. I still find this curious, but I do not think we need the comment now.

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java
##########
@@ -426,6 +439,9 @@ private void prepareLogReplay(final MasterProcedureEnv env, final Set<HRegionInf
     MasterFileSystem mfs = env.getMasterServices().getMasterFileSystem();
     AssignmentManager am = env.getMasterServices().getAssignmentManager();
     mfs.prepareLogReplay(this.serverName, regions);
+    // If the master doesn't fail, we'll set this again in SERVER_CRASH_ASSIGN
+    // can we skip doing it here? depends on how fast another observer
+    // needs to see that things were processed since we yield between now and then.

Review comment:
       what's the word on these two cases? could we just set the log split state in SERVER_CRASH_ASSIGN? who else looks at wether or not the logs were processed?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org