You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2021/06/29 07:21:59 UTC

[GitHub] [ozone] ChenSammi opened a new pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is in IN_MAINTENANCE or ENTERING_MAINTENANCE opState

ChenSammi opened a new pull request #2377:
URL: https://github.com/apache/ozone/pull/2377


   https://issues.apache.org/jira/browse/HDDS-5296


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#discussion_r669449926



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementRackAware.java
##########
@@ -275,28 +276,38 @@ private Node chooseNode(List<Node> excludedNodes, Node affinityNode,
         throw new SCMException("No satisfied datanode to meet the" +
             " excludedNodes and affinityNode constrains.", null);
       }
-      if (super.hasEnoughSpace((DatanodeDetails)node, sizeRequired)) {
-        LOG.debug("Datanode {} is chosen. Required size is {}",
-            node.toString(), sizeRequired);
-        metrics.incrDatanodeChooseSuccessCount();
-        if (isFallbacked) {
-          metrics.incrDatanodeChooseFallbackCount();
-        }
-        return node;
+
+      DatanodeDetails datanodeDetails = (DatanodeDetails)node;
+      DatanodeInfo datanodeInfo = (DatanodeInfo)getNodeManager()
+          .getNodeByUuid(datanodeDetails.getUuidString());
+      if (datanodeInfo == null) {
+        LOG.error("Failed to find the DatanodeInfo for datanode {}",
+            datanodeDetails);

Review comment:
       Ah, I missed the call to `hasEnoughSpace`. OK - I think this is good now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#discussion_r666399176



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/states/NodeStateMap.java
##########
@@ -79,7 +86,10 @@ public void addNode(DatanodeDetails datanodeDetails, NodeStatus nodeStatus)
       if (nodeMap.containsKey(id)) {
         throw new NodeAlreadyExistsException("Node UUID: " + id);
       }
-      nodeMap.put(id, new DatanodeInfo(datanodeDetails, nodeStatus));
+      DatanodeInfo datanodeInfo = new DatanodeInfo(datanodeDetails, nodeStatus);
+      // Make sure nodes in topology tree has all DN info.
+      clusterMap.add(datanodeInfo);
+      nodeMap.put(id, datanodeInfo);

Review comment:
       SCMContainerPlacementRackAware already has a NodeManager instance passed into it. I think this change would be cleaner and safer if we just used NodeManager to lookup the NodeStatus. Then all the locks are obeyed and DatanodeInfo is stored only in a single place.
   
   ~~The problem then is that the NetworkToplogy instance holds only a Node instance. But we need a DatanodeDetails instance (which extends Node) to query NodeManager to get the NodeStatus.~~
   
   I checked further and the NetworkTopology is given a DatanodeDetails object on registration as the "node" interface is cast back to DatanodeDetails when the list of picked nodes is returned. So I think we can just do:
   
   ```
   getNodeManager().getNodeStatus((DatanodeDetails) firstNode);
   ```
   Rather than changing what is stored in the NetworkTopology object at the moment. In general, there will be few nodes that are not IN_SERVICE, so it should not need many lookups.
   
   However this causes me to ask - should we remove a dead node from the topology so it is never selected via the DeadNodeHandler and then add it back via the NonHealthyToHealthyNodeHandler?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#discussion_r660748852



##########
File path: hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/node/states/TestNodeStateMap.java
##########
@@ -127,26 +129,26 @@ public void testGetNodeMethodsReturnCorrectCountsAndStates()
    */
   @Test
   public void testConcurrency() throws Exception {
-    NodeStateMap nodeStateMap = new NodeStateMap();
+    //NodeStateMap nodeStateMap = new NodeStateMap();

Review comment:
       This commented line should probably be removed, however I am not sure if any of these NodeStateMap changes are needed, as I am not sure if we need to add the node to the topology map in NodeStateMap, when it already happens in the register method.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#discussion_r662070011



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/states/NodeStateMap.java
##########
@@ -79,7 +86,10 @@ public void addNode(DatanodeDetails datanodeDetails, NodeStatus nodeStatus)
       if (nodeMap.containsKey(id)) {
         throw new NodeAlreadyExistsException("Node UUID: " + id);
       }
-      nodeMap.put(id, new DatanodeInfo(datanodeDetails, nodeStatus));
+      DatanodeInfo datanodeInfo = new DatanodeInfo(datanodeDetails, nodeStatus);
+      // Make sure nodes in topology tree has all DN info.
+      clusterMap.add(datanodeInfo);
+      nodeMap.put(id, datanodeInfo);

Review comment:
       I'm not sure about this. At the moment all calls to getNodeStatus (and all things related to datanodeInfo) are protected by a lock in nodeStateMap. If we just pass the DatanodeInfo object in here, then it means we are potentially accessing it concurrently, and that could result in hard to predict problems.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on a change in pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on a change in pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#discussion_r661917957



##########
File path: hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/node/states/TestNodeStateMap.java
##########
@@ -127,26 +129,26 @@ public void testGetNodeMethodsReturnCorrectCountsAndStates()
    */
   @Test
   public void testConcurrency() throws Exception {
-    NodeStateMap nodeStateMap = new NodeStateMap();
+    //NodeStateMap nodeStateMap = new NodeStateMap();

Review comment:
       Yes, this line should be removed. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#discussion_r666399176



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/states/NodeStateMap.java
##########
@@ -79,7 +86,10 @@ public void addNode(DatanodeDetails datanodeDetails, NodeStatus nodeStatus)
       if (nodeMap.containsKey(id)) {
         throw new NodeAlreadyExistsException("Node UUID: " + id);
       }
-      nodeMap.put(id, new DatanodeInfo(datanodeDetails, nodeStatus));
+      DatanodeInfo datanodeInfo = new DatanodeInfo(datanodeDetails, nodeStatus);
+      // Make sure nodes in topology tree has all DN info.
+      clusterMap.add(datanodeInfo);
+      nodeMap.put(id, datanodeInfo);

Review comment:
       SCMContainerPlacementRackAware already has a NodeManager instance passed into it. I think this change would be cleaner and safer if we just used NodeManager to lookup the NodeStatus. Then all the locks are obeyed and DatanodeInfo is stored only in a single place.
   
   ~~The problem then is that the NetworkToplogy instance holds only a Node instance. But we need a DatanodeDetails instance (which extends Node) to query NodeManager to get the NodeStatus. ~~
   
   I checked further and the NetworkTopology is given a DatanodeDetails object on registration as the "node" interface is cast back to DatanodeDetails when the list of picked nodes is returned. So I think we can just do:
   
   ```
   getNodeManager().getNodeStatus((DatanodeDetails) firstNode);
   ```
   Rather than changing what is stored in the NetworkTopology object at the moment. In general, there will be few nodes that are not IN_SERVICE, so it should not need many lookups.
   
   However this causes me to ask - should we remove a dead node from the topology so it is never selected via the DeadNodeHandler and then add it back via the NonHealthyToHealthyNodeHandler?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on a change in pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on a change in pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#discussion_r661917012



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementRackAware.java
##########
@@ -275,28 +277,32 @@ private Node chooseNode(List<Node> excludedNodes, Node affinityNode,
         throw new SCMException("No satisfied datanode to meet the" +
             " excludedNodes and affinityNode constrains.", null);
       }
-      if (super.hasEnoughSpace((DatanodeDetails)node, sizeRequired)) {
-        LOG.debug("Datanode {} is chosen. Required size is {}",
+
+      DatanodeInfo datanodeInfo = (DatanodeInfo) node;
+      NodeStatus nodeStatus = datanodeInfo.getNodeStatus();
+      if (nodeStatus.isNodeWritable() &&
+          (super.hasEnoughSpace(datanodeInfo, sizeRequired))) {
+        LOG.debug("Datanode {} is chosen. Required storage size is {} bytes",
             node.toString(), sizeRequired);

Review comment:
       Good point.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on a change in pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on a change in pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#discussion_r669224196



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementRackAware.java
##########
@@ -275,28 +276,38 @@ private Node chooseNode(List<Node> excludedNodes, Node affinityNode,
         throw new SCMException("No satisfied datanode to meet the" +
             " excludedNodes and affinityNode constrains.", null);
       }
-      if (super.hasEnoughSpace((DatanodeDetails)node, sizeRequired)) {
-        LOG.debug("Datanode {} is chosen. Required size is {}",
-            node.toString(), sizeRequired);
-        metrics.incrDatanodeChooseSuccessCount();
-        if (isFallbacked) {
-          metrics.incrDatanodeChooseFallbackCount();
-        }
-        return node;
+
+      DatanodeDetails datanodeDetails = (DatanodeDetails)node;
+      DatanodeInfo datanodeInfo = (DatanodeInfo)getNodeManager()
+          .getNodeByUuid(datanodeDetails.getUuidString());
+      if (datanodeInfo == null) {
+        LOG.error("Failed to find the DatanodeInfo for datanode {}",
+            datanodeDetails);

Review comment:
       @sodonnel ,  it is really because the following super.hasEnoughSpace(datanodeInfo, sizeRequired)) statement needs a DatanodeInfo object as parameter so here getNodeByUuid is called. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#discussion_r666374090



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/states/NodeStateMap.java
##########
@@ -79,7 +86,10 @@ public void addNode(DatanodeDetails datanodeDetails, NodeStatus nodeStatus)
       if (nodeMap.containsKey(id)) {
         throw new NodeAlreadyExistsException("Node UUID: " + id);
       }
-      nodeMap.put(id, new DatanodeInfo(datanodeDetails, nodeStatus));
+      DatanodeInfo datanodeInfo = new DatanodeInfo(datanodeDetails, nodeStatus);
+      // Make sure nodes in topology tree has all DN info.
+      clusterMap.add(datanodeInfo);
+      nodeMap.put(id, datanodeInfo);

Review comment:
       @nandakumar131 @bshashikant What do you guys think about this change (putting the DatanodeInfo into the topology map)? I remember we talked about something like this with decommission, and concluded we were happier if the DatanodeDetails reference was only stored in a single place (nodeManager) and anything that needs it should request it from there. Our fear was that some future change could replace the DatanodeInfo instance stored in one place and not in the other.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on a change in pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on a change in pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#discussion_r668517945



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/states/NodeStateMap.java
##########
@@ -79,7 +86,10 @@ public void addNode(DatanodeDetails datanodeDetails, NodeStatus nodeStatus)
       if (nodeMap.containsKey(id)) {
         throw new NodeAlreadyExistsException("Node UUID: " + id);
       }
-      nodeMap.put(id, new DatanodeInfo(datanodeDetails, nodeStatus));
+      DatanodeInfo datanodeInfo = new DatanodeInfo(datanodeDetails, nodeStatus);
+      // Make sure nodes in topology tree has all DN info.
+      clusterMap.add(datanodeInfo);
+      nodeMap.put(id, datanodeInfo);

Review comment:
       OK,  I wil revert this part of change.  Removing the dead dn from topology seems a good idea. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#discussion_r666844379



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/ReplicateContainerCommandHandler.java
##########
@@ -60,13 +61,20 @@ public ReplicateContainerCommandHandler(
   @Override
   public void handle(SCMCommand command, OzoneContainer container,
       StateContext context, SCMConnectionManager connectionManager) {
-
     final ReplicateContainerCommand replicateCommand =
         (ReplicateContainerCommand) command;
     final List<DatanodeDetails> sourceDatanodes =
         replicateCommand.getSourceDatanodes();
     final long containerID = replicateCommand.getContainerID();
 
+    DatanodeDetails dn = context.getParent().getDatanodeDetails();
+    if (dn.getPersistedOpState() !=
+        HddsProtos.NodeOperationalState.IN_SERVICE) {
+      LOG.info("Dn is of {} state. Ignore this replicate container command " +
+          "for container {}", dn.getPersistedOpState(), containerID);
+      return;
+    }
+

Review comment:
       I think this needs to go into the supervisor, so that when the task is executed by the executor, it checks to see the DN state and drops the command at that point. We could have a lot of replication commands queued which are past this point, and if the DN switches to a not IN_SERVICE state, ideally we want to prevent processing them all.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
sodonnel commented on pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#issuecomment-879007921


   Thanks for addressing the review comments @ChenSammi. I think this change looks better now. I just have one more comment inline where you are getting the nodeStatus - I think you should call the existing NodeManager API to get it so it is all protected under a lock.
   
   >  Removing the dead dn from topology seems a good idea.
   
   I think we should do that part under a separate Jira as its a slightly different change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel merged pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
sodonnel merged pull request #2377:
URL: https://github.com/apache/ozone/pull/2377


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#discussion_r660744396



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/states/NodeStateMap.java
##########
@@ -79,7 +86,10 @@ public void addNode(DatanodeDetails datanodeDetails, NodeStatus nodeStatus)
       if (nodeMap.containsKey(id)) {
         throw new NodeAlreadyExistsException("Node UUID: " + id);
       }
-      nodeMap.put(id, new DatanodeInfo(datanodeDetails, nodeStatus));
+      DatanodeInfo datanodeInfo = new DatanodeInfo(datanodeDetails, nodeStatus);
+      // Make sure nodes in topology tree has all DN info.
+      clusterMap.add(datanodeInfo);
+      nodeMap.put(id, datanodeInfo);

Review comment:
       The node is already added to the cluster map on registration in `SCMNodeManager.register(...)`:
   
   ```
     @Override
     public RegisteredCommand register(
         DatanodeDetails datanodeDetails, NodeReportProto nodeReport,
         PipelineReportsProto pipelineReportsProto) {
         ...
           networkLocation = nodeResolve(dnsName);
           if (networkLocation != null) {
             datanodeDetails.setNetworkLocation(networkLocation);
           }
   
           clusterMap.add(datanodeDetails);
           nodeStateManager.addNode(datanodeDetails);
       ...
   ```
   
   Why do we need to add it again here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#issuecomment-871849982


   > 
   > 
   > These changes make sense to me, I have pointed out a couple of small things to check inline.
   > 
   > Another scenario for this problem is:
   > 
   >     1. Replication Manager schedules a lot of replication commands on a host where they are queued.
   > 
   >     2. The host is then decommissioned soon after.
   > 
   > 
   > In that case, all those inflight replications are still going to be executed.
   > 
   > Should we also check the DN state in the DN replication command handler, and if the DN is not IN_SERVICE, ignore them?
   > 
   > You can get the DN state inside `ReplicateContainerCommandHander` via:
   > 
   > ```
   > context.getParent().getDatanodeDetails().getPersistedOpState()
   > ```
   > 
   > So if we passed the context.getParent() into `new ReplicationTask(...)` it would be able to make that check when it is executed.
   
   Right. I have added the  DN state check in  ReplicateContainerCommandHandler#handle.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#issuecomment-871860540


   Thanks @sodonnel for the code review.  A new commit is uploaded to address the comments. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#discussion_r668668853



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementRackAware.java
##########
@@ -275,28 +276,38 @@ private Node chooseNode(List<Node> excludedNodes, Node affinityNode,
         throw new SCMException("No satisfied datanode to meet the" +
             " excludedNodes and affinityNode constrains.", null);
       }
-      if (super.hasEnoughSpace((DatanodeDetails)node, sizeRequired)) {
-        LOG.debug("Datanode {} is chosen. Required size is {}",
-            node.toString(), sizeRequired);
-        metrics.incrDatanodeChooseSuccessCount();
-        if (isFallbacked) {
-          metrics.incrDatanodeChooseFallbackCount();
-        }
-        return node;
+
+      DatanodeDetails datanodeDetails = (DatanodeDetails)node;
+      DatanodeInfo datanodeInfo = (DatanodeInfo)getNodeManager()
+          .getNodeByUuid(datanodeDetails.getUuidString());
+      if (datanodeInfo == null) {
+        LOG.error("Failed to find the DatanodeInfo for datanode {}",
+            datanodeDetails);

Review comment:
       You can get the nodeStatus directly via NodeManager:
   
   ```
     public NodeStatus getNodeStatus(DatanodeDetails datanodeDetails)
         throws NodeNotFoundException {
   ```
   
   Its doing basically what you are doing here, but I think its safer as an immutable NodeStatus object is returned under a lock to you:
   
   ```
     public NodeStatus getNodeStatus(UUID uuid) throws NodeNotFoundException {
       lock.readLock().lock();
       try {
         DatanodeInfo dn = nodeMap.get(uuid);
         if (dn == null) {
           throw new NodeNotFoundException("Node not found in node map." +
               " UUID: " + uuid);
         }
         return dn.getNodeStatus();
       } finally {
         lock.readLock().unlock();
       }
     }
   ```
   
   That means we don't need to worry about any side effects or concurrent modification here at all.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#discussion_r660734798



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementRackAware.java
##########
@@ -275,28 +277,32 @@ private Node chooseNode(List<Node> excludedNodes, Node affinityNode,
         throw new SCMException("No satisfied datanode to meet the" +
             " excludedNodes and affinityNode constrains.", null);
       }
-      if (super.hasEnoughSpace((DatanodeDetails)node, sizeRequired)) {
-        LOG.debug("Datanode {} is chosen. Required size is {}",
+
+      DatanodeInfo datanodeInfo = (DatanodeInfo) node;
+      NodeStatus nodeStatus = datanodeInfo.getNodeStatus();
+      if (nodeStatus.isNodeWritable() &&
+          (super.hasEnoughSpace(datanodeInfo, sizeRequired))) {
+        LOG.debug("Datanode {} is chosen. Required storage size is {} bytes",
             node.toString(), sizeRequired);

Review comment:
       `node.toString()` is going to always be executed even with INFO level logs. This should be changed to just `node` and let the logger call toString on it internally if debug is enabled.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] JacksonYao287 commented on pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
JacksonYao287 commented on pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#issuecomment-879756310


   this patch looks good to me , thanks @ChenSammi  for this work!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] JacksonYao287 commented on pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
JacksonYao287 commented on pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#issuecomment-871887897


   thanks @ChenSammi for this work! the changes looks good to me, but a little question!
   
   if we make a change to 'clusterMap', should we do some more work about `SCMNodeManager# getClusterNetworkTopologyMap`, which will be called by other functions!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on a change in pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on a change in pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#discussion_r661917874



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/states/NodeStateMap.java
##########
@@ -79,7 +86,10 @@ public void addNode(DatanodeDetails datanodeDetails, NodeStatus nodeStatus)
       if (nodeMap.containsKey(id)) {
         throw new NodeAlreadyExistsException("Node UUID: " + id);
       }
-      nodeMap.put(id, new DatanodeInfo(datanodeDetails, nodeStatus));
+      DatanodeInfo datanodeInfo = new DatanodeInfo(datanodeDetails, nodeStatus);
+      // Make sure nodes in topology tree has all DN info.
+      clusterMap.add(datanodeInfo);
+      nodeMap.put(id, datanodeInfo);

Review comment:
       The add to topology action during SCMNodeManager.register(...) is removed in this patch.  
   
   The goal is to make sure the node object added into topology is of DatanodeInfo type, so that when a node is chosen through topology API, this node will have the operation state information already, to save an extra call of NodeManager to get this information.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on a change in pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on a change in pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#discussion_r668516640



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/ReplicateContainerCommandHandler.java
##########
@@ -60,13 +61,20 @@ public ReplicateContainerCommandHandler(
   @Override
   public void handle(SCMCommand command, OzoneContainer container,
       StateContext context, SCMConnectionManager connectionManager) {
-
     final ReplicateContainerCommand replicateCommand =
         (ReplicateContainerCommand) command;
     final List<DatanodeDetails> sourceDatanodes =
         replicateCommand.getSourceDatanodes();
     final long containerID = replicateCommand.getContainerID();
 
+    DatanodeDetails dn = context.getParent().getDatanodeDetails();
+    if (dn.getPersistedOpState() !=
+        HddsProtos.NodeOperationalState.IN_SERVICE) {
+      LOG.info("Dn is of {} state. Ignore this replicate container command " +
+          "for container {}", dn.getPersistedOpState(), containerID);
+      return;
+    }
+

Review comment:
       Good point.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#discussion_r666399176



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/states/NodeStateMap.java
##########
@@ -79,7 +86,10 @@ public void addNode(DatanodeDetails datanodeDetails, NodeStatus nodeStatus)
       if (nodeMap.containsKey(id)) {
         throw new NodeAlreadyExistsException("Node UUID: " + id);
       }
-      nodeMap.put(id, new DatanodeInfo(datanodeDetails, nodeStatus));
+      DatanodeInfo datanodeInfo = new DatanodeInfo(datanodeDetails, nodeStatus);
+      // Make sure nodes in topology tree has all DN info.
+      clusterMap.add(datanodeInfo);
+      nodeMap.put(id, datanodeInfo);

Review comment:
       SCMContainerPlacementRackAware already has a NodeManager instance passed into it. I think this change would be cleaner and safer if we just used NodeManager to lookup the NodeStatus. Then all the locks are obeyed and DatanodeInfo is stored only in a single place.
   
   The problem then is that the NetworkToplogy instance holds only a Node instance. But we need a DatanodeDetails instance (which extends Node) to query NodeManager to get the NodeStatus.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on pull request #2377: HDDS-5296. Replication Manager should not sent replica command to DN which is not in IN_SERVICE opState

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on pull request #2377:
URL: https://github.com/apache/ozone/pull/2377#issuecomment-879553146


   > 
   > 
   > Thanks for addressing the review comments @ChenSammi. I think this change looks better now. I just have one more comment inline where you are getting the nodeStatus - I think you should call the existing NodeManager API to get it so it is all protected under a lock.
   > 
   > > Removing the dead dn from topology seems a good idea.
   > 
   > I think we should do that part under a separate Jira as its a slightly different change.
   
   HDDS-5437 is created for track.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org