You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "sodonnel (via GitHub)" <gi...@apache.org> on 2023/10/17 15:26:56 UTC

[PR] HDDS-9481. A reformatted datanode node cannot be decommissioned [ozone]

sodonnel opened a new pull request, #5458:
URL: https://github.com/apache/ozone/pull/5458

   ## What changes were proposed in this pull request?
   
   If a datanode is registered to SCM on the cluster, and then it is stopped and the data disks cleared and then it is restarted, it will reconnect to SCM as a new node with a new UUID.
   
   When this happens, the old datanode details are kept in SCM as a dead node and the mapping table which maps DNs running on a host to their UUIDs will contain two entries, leaving the decommission command unable to decide which entry is to be decommissioned, giving this error:
   
   ```
   2023-10-03 08:05:50,279 ERROR [IPC Server handler 25 on 9860]-org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer: Failed to decommission nodes
   org.apache.hadoop.hdds.scm.node.InvalidHostStringException: Host host1.acme.org is running multiple datanodes registered with SCM, but no port numbers match. Please check the port number.
           at org.apache.hadoop.hdds.scm.node.NodeDecommissionManager.mapHostnamesToDatanodes(NodeDecommissionManager.java:151)
           at org.apache.hadoop.hdds.scm.node.NodeDecommissionManager.decommissionNodes(NodeDecommissionManager.java:228)
           at org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer.decommissionNodes(SCMClientProtocolServer.java:624)
           at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.decommissionNodes(StorageContainerLocationProtocolServerSideTranslatorPB.java:1114)
           at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.processRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:602)
           at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
           at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:221)
           at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
           at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
           at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
           at java.base/java.security.AccessController.doPrivileged(Native Method)
           at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
           at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
           at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
   ```
   
   It is valid for multiple DNs to be on the same host, especially on test clusters or mini-clusters. However it is not possible for a DN to be heartbeating from the same host with the same ports.
   
   In this case, where we try to decommission a host, and it has multiple entries from the same host and all the ports are the same for all entries, we can safely decommission the one with the newest heartbeat.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-9481
   
   ## How was this patch tested?
   
   New unit test added to reproduce and validate the fix.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9481. A reformatted datanode node cannot be decommissioned [ozone]

Posted by "ashishkumar50 (via GitHub)" <gi...@apache.org>.
ashishkumar50 commented on PR #5458:
URL: https://github.com/apache/ozone/pull/5458#issuecomment-1768373663

   Thanks for updating patch, LGTM+1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9481. A reformatted datanode node cannot be decommissioned [ozone]

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel merged PR #5458:
URL: https://github.com/apache/ozone/pull/5458


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9481. A reformatted datanode node cannot be decommissioned [ozone]

Posted by "ashishkumar50 (via GitHub)" <gi...@apache.org>.
ashishkumar50 commented on code in PR #5458:
URL: https://github.com/apache/ozone/pull/5458#discussion_r1363236888


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java:
##########
@@ -139,26 +142,77 @@ private List<DatanodeDetails> mapHostnamesToDatanodes(List<String> hosts)
               + " Please check the port number.");
         }
         results.add(found.get(0));
-      } else if (found.size() > 1) {
-        DatanodeDetails match = null;
-        for (DatanodeDetails dn : found) {
-          if (validateDNPortMatch(host.getPort(), dn)) {
-            match = dn;
-            break;
+      } else {
+        // Here we either have multiple DNs on the same host / IP, and they
+        // should have different ports. Or, we have a case where a DN was
+        // registered from a host, then stopped and formatted, changing its
+        // UUID and registered again. In that case, the ports of all hosts
+        // should be the same, and we should just use the one with the most
+        // recent heartbeat.
+        if (allPortsMatch(found)) {
+          // All ports match, so just use the most recent heartbeat as it is
+          // not possible for a host to have 2 DNs coming from the same port.
+          DatanodeDetails mostRecent = findDnWithMostRecentHeartbeat(found);
+          if (mostRecent == null) {
+            throw new InvalidHostStringException("Host " + host.getRawHostname()
+                + " has multiple datanodes registered with SCM."
+                + " All have identical ports, but none have a newest"
+                + " heartbeat.");
           }
+          results.add(mostRecent);
+        } else {
+          DatanodeDetails match = null;
+          for (DatanodeDetails dn : found) {

Review Comment:
   Here can we have the below possibility? 
   Let's say it matches 3 DNs in which 2 DN are having same port but not the third one.
   And assume` host port `matches the 2 DN.
   In this case it will not able to choose which one to decommission and here also we need `findDnWithMostRecentHeartbeat` among the 2 DN
   Can you consider `host port` also in `allPortsMatch` so that this condition will not arise?



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java:
##########
@@ -139,26 +142,77 @@ private List<DatanodeDetails> mapHostnamesToDatanodes(List<String> hosts)
               + " Please check the port number.");
         }
         results.add(found.get(0));
-      } else if (found.size() > 1) {
-        DatanodeDetails match = null;
-        for (DatanodeDetails dn : found) {
-          if (validateDNPortMatch(host.getPort(), dn)) {
-            match = dn;
-            break;
+      } else {
+        // Here we either have multiple DNs on the same host / IP, and they
+        // should have different ports. Or, we have a case where a DN was
+        // registered from a host, then stopped and formatted, changing its
+        // UUID and registered again. In that case, the ports of all hosts
+        // should be the same, and we should just use the one with the most
+        // recent heartbeat.
+        if (allPortsMatch(found)) {
+          // All ports match, so just use the most recent heartbeat as it is
+          // not possible for a host to have 2 DNs coming from the same port.
+          DatanodeDetails mostRecent = findDnWithMostRecentHeartbeat(found);
+          if (mostRecent == null) {
+            throw new InvalidHostStringException("Host " + host.getRawHostname()
+                + " has multiple datanodes registered with SCM."
+                + " All have identical ports, but none have a newest"
+                + " heartbeat.");
           }
+          results.add(mostRecent);
+        } else {
+          DatanodeDetails match = null;
+          for (DatanodeDetails dn : found) {
+            if (validateDNPortMatch(host.getPort(), dn)) {
+              match = dn;
+              break;
+            }
+          }
+          if (match == null) {
+            throw new InvalidHostStringException("Host " + host.getRawHostname()
+                + " is running multiple datanodes registered with SCM,"
+                + " but no port numbers match."
+                + " Please check the port number.");
+          }
+          results.add(match);
         }
-        if (match == null) {
-          throw new InvalidHostStringException("Host " + host.getRawHostname()
-              + " is running multiple datanodes registered with SCM,"
-              + " but no port numbers match."
-              + " Please check the port number.");
-        }
-        results.add(match);
       }
     }
     return results;
   }
 
+  public boolean allPortsMatch(List<DatanodeDetails> dns) {
+    if (dns.size() < 2) {
+      return true;

Review Comment:
   Currently this looks like it will not hit for dns with size < 2. We are returning `true` which can call `findDnWithMostRecentHeartbeat` and we are trying to access two dn heartbeats which will fail.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java:
##########
@@ -139,26 +142,77 @@ private List<DatanodeDetails> mapHostnamesToDatanodes(List<String> hosts)
               + " Please check the port number.");
         }
         results.add(found.get(0));
-      } else if (found.size() > 1) {
-        DatanodeDetails match = null;
-        for (DatanodeDetails dn : found) {
-          if (validateDNPortMatch(host.getPort(), dn)) {
-            match = dn;
-            break;
+      } else {
+        // Here we either have multiple DNs on the same host / IP, and they
+        // should have different ports. Or, we have a case where a DN was
+        // registered from a host, then stopped and formatted, changing its
+        // UUID and registered again. In that case, the ports of all hosts
+        // should be the same, and we should just use the one with the most
+        // recent heartbeat.
+        if (allPortsMatch(found)) {
+          // All ports match, so just use the most recent heartbeat as it is
+          // not possible for a host to have 2 DNs coming from the same port.
+          DatanodeDetails mostRecent = findDnWithMostRecentHeartbeat(found);
+          if (mostRecent == null) {
+            throw new InvalidHostStringException("Host " + host.getRawHostname()
+                + " has multiple datanodes registered with SCM."
+                + " All have identical ports, but none have a newest"
+                + " heartbeat.");
           }
+          results.add(mostRecent);
+        } else {
+          DatanodeDetails match = null;
+          for (DatanodeDetails dn : found) {
+            if (validateDNPortMatch(host.getPort(), dn)) {
+              match = dn;
+              break;
+            }
+          }
+          if (match == null) {
+            throw new InvalidHostStringException("Host " + host.getRawHostname()
+                + " is running multiple datanodes registered with SCM,"
+                + " but no port numbers match."
+                + " Please check the port number.");
+          }
+          results.add(match);
         }
-        if (match == null) {
-          throw new InvalidHostStringException("Host " + host.getRawHostname()
-              + " is running multiple datanodes registered with SCM,"
-              + " but no port numbers match."
-              + " Please check the port number.");
-        }
-        results.add(match);
       }
     }
     return results;
   }
 
+  public boolean allPortsMatch(List<DatanodeDetails> dns) {

Review Comment:
   Both methods we can change to private.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9481. A reformatted datanode node cannot be decommissioned [ozone]

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #5458:
URL: https://github.com/apache/ozone/pull/5458#discussion_r1363476185


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java:
##########
@@ -139,26 +142,77 @@ private List<DatanodeDetails> mapHostnamesToDatanodes(List<String> hosts)
               + " Please check the port number.");
         }
         results.add(found.get(0));
-      } else if (found.size() > 1) {
-        DatanodeDetails match = null;
-        for (DatanodeDetails dn : found) {
-          if (validateDNPortMatch(host.getPort(), dn)) {
-            match = dn;
-            break;
+      } else {
+        // Here we either have multiple DNs on the same host / IP, and they
+        // should have different ports. Or, we have a case where a DN was
+        // registered from a host, then stopped and formatted, changing its
+        // UUID and registered again. In that case, the ports of all hosts
+        // should be the same, and we should just use the one with the most
+        // recent heartbeat.
+        if (allPortsMatch(found)) {
+          // All ports match, so just use the most recent heartbeat as it is
+          // not possible for a host to have 2 DNs coming from the same port.
+          DatanodeDetails mostRecent = findDnWithMostRecentHeartbeat(found);
+          if (mostRecent == null) {
+            throw new InvalidHostStringException("Host " + host.getRawHostname()
+                + " has multiple datanodes registered with SCM."
+                + " All have identical ports, but none have a newest"
+                + " heartbeat.");
           }
+          results.add(mostRecent);
+        } else {
+          DatanodeDetails match = null;
+          for (DatanodeDetails dn : found) {
+            if (validateDNPortMatch(host.getPort(), dn)) {
+              match = dn;
+              break;
+            }
+          }
+          if (match == null) {
+            throw new InvalidHostStringException("Host " + host.getRawHostname()
+                + " is running multiple datanodes registered with SCM,"
+                + " but no port numbers match."
+                + " Please check the port number.");
+          }
+          results.add(match);
         }
-        if (match == null) {
-          throw new InvalidHostStringException("Host " + host.getRawHostname()
-              + " is running multiple datanodes registered with SCM,"
-              + " but no port numbers match."
-              + " Please check the port number.");
-        }
-        results.add(match);
       }
     }
     return results;
   }
 
+  public boolean allPortsMatch(List<DatanodeDetails> dns) {
+    if (dns.size() < 2) {
+      return true;

Review Comment:
   These private methods should never get call with less than 2 DNs, however I have added a guard clause to the findDnWithMostRecentHeartbeat too, so it will handle the cases where an empty or list of size 1 is passed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9481. A reformatted datanode node cannot be decommissioned [ozone]

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #5458:
URL: https://github.com/apache/ozone/pull/5458#discussion_r1363626024


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java:
##########
@@ -139,26 +142,77 @@ private List<DatanodeDetails> mapHostnamesToDatanodes(List<String> hosts)
               + " Please check the port number.");
         }
         results.add(found.get(0));
-      } else if (found.size() > 1) {
-        DatanodeDetails match = null;
-        for (DatanodeDetails dn : found) {
-          if (validateDNPortMatch(host.getPort(), dn)) {
-            match = dn;
-            break;
+      } else {
+        // Here we either have multiple DNs on the same host / IP, and they
+        // should have different ports. Or, we have a case where a DN was
+        // registered from a host, then stopped and formatted, changing its
+        // UUID and registered again. In that case, the ports of all hosts
+        // should be the same, and we should just use the one with the most
+        // recent heartbeat.
+        if (allPortsMatch(found)) {
+          // All ports match, so just use the most recent heartbeat as it is
+          // not possible for a host to have 2 DNs coming from the same port.
+          DatanodeDetails mostRecent = findDnWithMostRecentHeartbeat(found);
+          if (mostRecent == null) {
+            throw new InvalidHostStringException("Host " + host.getRawHostname()
+                + " has multiple datanodes registered with SCM."
+                + " All have identical ports, but none have a newest"
+                + " heartbeat.");
           }
+          results.add(mostRecent);
+        } else {
+          DatanodeDetails match = null;
+          for (DatanodeDetails dn : found) {

Review Comment:
   I have added some logic to handle these sort of cases. I think have multiple registrations from the same host and different ports on a "real" cluster would be very unlikely, but we may as well fix it while we are here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9481. A reformatted datanode node cannot be decommissioned [ozone]

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on PR #5458:
URL: https://github.com/apache/ozone/pull/5458#issuecomment-1768534041

   Thanks for reviewing @ashishkumar50 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org