You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2021/08/31 03:47:47 UTC

[GitHub] [ozone] ChenSammi opened a new pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

ChenSammi opened a new pull request #2598:
URL: https://github.com/apache/ozone/pull/2598


   https://issues.apache.org/jira/browse/HDDS-5700


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#issuecomment-912011497


   > @ChenSammi when a DN is decommissioned successfully, do you think we can delete all its' replicas by sending replica delete command from SCM, so that we can free up the space on the DN and cleanup the container map of SCM.
   
   I don't think we should remove the containers from the DNs, for two main reasons:
   
   1. Sometimes people decommission nodes for maintenance, and then bring them back to the cluster, so its better if they still have the data in that case.
   2. If there are some bugs in decommission, then its good we can recommission the node and know its data is still intact.
   3. If the node needs to be wiped its easy to just format the drives or delete the data folders manually later.
   
   However, we should clear the replica references out of SCM. This should happen when the node goes dead, and the dead node handler runs. If that is not the case, then we might have a general problem with dead node handling. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on a change in pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on a change in pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#discussion_r700706590



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java
##########
@@ -297,6 +302,14 @@ private boolean checkContainersReplicatedOnNode(DatanodeDetails dn)
     LOG.info("{} has {} sufficientlyReplicated, {} underReplicated and {} " +
         "unhealthy containers",
         dn, sufficientlyReplicated, underReplicated, unhealthy);
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("{} has {} underReplicated [{}] and {} unhealthy [{}] " +
+              "containers", dn, underReplicated,
+          underReplicatedIDs.stream().map(

Review comment:
       I tried the String.join.  intellij raised the compile error.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi merged pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
ChenSammi merged pull request #2598:
URL: https://github.com/apache/ozone/pull/2598


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#issuecomment-912204339


   Thanks @sodonnel  for the code review. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#issuecomment-909092383


   I wish we had a better way to list out the Unhealthy / Under replicated containers etc. Feel like we need a basic fsck command for Ozone to be able to debug things like this a bit more easily.
   
   Patch looks mostly OK. I just had one comment about wrapping the lines in "if debug enable" statements.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on a change in pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on a change in pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#discussion_r700706226



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java
##########
@@ -284,9 +287,11 @@ private boolean checkContainersReplicatedOnNode(DatanodeDetails dn)
         if (replicaSet.isSufficientlyReplicated()) {
           sufficientlyReplicated++;
         } else {
+          underReplicatedIDs.add(cid);

Review comment:
       I added a limit of 10000 as the threshold.  From my experience of cluster maintenance,  I only care about the left over under replicated and unhealthy containers, which are of size from dozens to hundreds, so 10000 limit seems enough. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#discussion_r699175854



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java
##########
@@ -297,6 +302,14 @@ private boolean checkContainersReplicatedOnNode(DatanodeDetails dn)
     LOG.info("{} has {} sufficientlyReplicated, {} underReplicated and {} " +
         "unhealthy containers",
         dn, sufficientlyReplicated, underReplicated, unhealthy);
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("{} has {} underReplicated [{}] and {} unhealthy [{}] " +
+              "containers", dn, underReplicated,
+          underReplicatedIDs.stream().map(

Review comment:
       Actually, this might only work if the list is of type String. I am not 100% sure.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#issuecomment-909213359


   > I wish we had a better way to list out the Unhealthy / Under replicated containers etc. Feel like we need a basic fsck command for Ozone to be able to debug things like this a bit more easily.
   > 
   > Patch looks mostly OK. I just had one comment about wrapping the lines in "if debug enable" statements.
   
   Thanks @sodonnel , I will address the comments by a new commit.  I also agree we should have a better way to display these kind of information.  The patch is just an emergency fix, for we have decommission issue in our cluster and I need to resolve the issue ASAP.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on a change in pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on a change in pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#discussion_r700706590



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java
##########
@@ -297,6 +302,14 @@ private boolean checkContainersReplicatedOnNode(DatanodeDetails dn)
     LOG.info("{} has {} sufficientlyReplicated, {} underReplicated and {} " +
         "unhealthy containers",
         dn, sufficientlyReplicated, underReplicated, unhealthy);
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("{} has {} underReplicated [{}] and {} unhealthy [{}] " +
+              "containers", dn, underReplicated,
+          underReplicatedIDs.stream().map(

Review comment:
       I tried the String.join.  intellij will raise compile error when underReplicatedIDs is used.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi edited a comment on pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
ChenSammi edited a comment on pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#issuecomment-911089256


   @sodonnel ,  when a DN is decommissioned successfully, do you think  we can delete all its' replicas by sending replica delete command from SCM,  so that we can free up the space on the DN and cleanup the container map of SCM.   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#issuecomment-909092383


   I wish we had a better way to list out the Unhealthy / Under replicated containers etc. Feel like we need a basic fsck command for Ozone to be able to debug things like this a bit more easily.
   
   Patch looks mostly OK. I just had one comment about wrapping the lines in "if debug enable" statements.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#discussion_r699176099



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java
##########
@@ -284,9 +287,11 @@ private boolean checkContainersReplicatedOnNode(DatanodeDetails dn)
         if (replicaSet.isSufficientlyReplicated()) {
           sufficientlyReplicated++;
         } else {
+          underReplicatedIDs.add(cid);
           underReplicated++;
         }
         if (!replicaSet.isHealthy()) {
+          unhealthyIDs.add(cid);

Review comment:
       Same as above for this line.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#discussion_r699173614



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java
##########
@@ -284,9 +287,11 @@ private boolean checkContainersReplicatedOnNode(DatanodeDetails dn)
         if (replicaSet.isSufficientlyReplicated()) {
           sufficientlyReplicated++;
         } else {
+          underReplicatedIDs.add(cid);

Review comment:
       I think we should wrap this line in `if (LOG.isDebugEnabled()) {` too - as if debug is not enabled, we never use the lists, so there is no point in adding to them.
   
   Also, I wonder if we should set an upper bound on the list? Eg if there are 200,000 under replicated containers, is it really useful to log them all? Maybe logging just the first 10 or 100 would be enough?

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java
##########
@@ -297,6 +302,14 @@ private boolean checkContainersReplicatedOnNode(DatanodeDetails dn)
     LOG.info("{} has {} sufficientlyReplicated, {} underReplicated and {} " +
         "unhealthy containers",
         dn, sufficientlyReplicated, underReplicated, unhealthy);
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("{} has {} underReplicated [{}] and {} unhealthy [{}] " +
+              "containers", dn, underReplicated,
+          underReplicatedIDs.stream().map(

Review comment:
       You can do the join more simply with:
   
   ```
   String.join(",", list);
   ```

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java
##########
@@ -297,6 +302,14 @@ private boolean checkContainersReplicatedOnNode(DatanodeDetails dn)
     LOG.info("{} has {} sufficientlyReplicated, {} underReplicated and {} " +
         "unhealthy containers",
         dn, sufficientlyReplicated, underReplicated, unhealthy);
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("{} has {} underReplicated [{}] and {} unhealthy [{}] " +
+              "containers", dn, underReplicated,
+          underReplicatedIDs.stream().map(

Review comment:
       Actually, this might only work if the list is of type String. I am not 100% sure.

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java
##########
@@ -284,9 +287,11 @@ private boolean checkContainersReplicatedOnNode(DatanodeDetails dn)
         if (replicaSet.isSufficientlyReplicated()) {
           sufficientlyReplicated++;
         } else {
+          underReplicatedIDs.add(cid);
           underReplicated++;
         }
         if (!replicaSet.isHealthy()) {
+          unhealthyIDs.add(cid);

Review comment:
       Same as above for this line.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#issuecomment-909092383


   I wish we had a better way to list out the Unhealthy / Under replicated containers etc. Feel like we need a basic fsck command for Ozone to be able to debug things like this a bit more easily.
   
   Patch looks mostly OK. I just had one comment about wrapping the lines in "if debug enable" statements.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#issuecomment-911089256


   @sodonnel ,  when a DN is decommissioned successfully, do you think  we should delete all its' replicas by sending replica delete command from SCM,  so that we can free up the space on the DN and cleanup the container map of SCM.   
   
   The current state is every container now still has 4 replicas info in SCM,  after I have turned off the decommissioned DN. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#discussion_r699173614



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java
##########
@@ -284,9 +287,11 @@ private boolean checkContainersReplicatedOnNode(DatanodeDetails dn)
         if (replicaSet.isSufficientlyReplicated()) {
           sufficientlyReplicated++;
         } else {
+          underReplicatedIDs.add(cid);

Review comment:
       I think we should wrap this line in `if (LOG.isDebugEnabled()) {` too - as if debug is not enabled, we never use the lists, so there is no point in adding to them.
   
   Also, I wonder if we should set an upper bound on the list? Eg if there are 200,000 under replicated containers, is it really useful to log them all? Maybe logging just the first 10 or 100 would be enough?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#discussion_r699174278



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java
##########
@@ -297,6 +302,14 @@ private boolean checkContainersReplicatedOnNode(DatanodeDetails dn)
     LOG.info("{} has {} sufficientlyReplicated, {} underReplicated and {} " +
         "unhealthy containers",
         dn, sufficientlyReplicated, underReplicated, unhealthy);
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("{} has {} underReplicated [{}] and {} unhealthy [{}] " +
+              "containers", dn, underReplicated,
+          underReplicatedIDs.stream().map(

Review comment:
       You can do the join more simply with:
   
   ```
   String.join(",", list);
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#issuecomment-909213359


   > I wish we had a better way to list out the Unhealthy / Under replicated containers etc. Feel like we need a basic fsck command for Ozone to be able to debug things like this a bit more easily.
   > 
   > Patch looks mostly OK. I just had one comment about wrapping the lines in "if debug enable" statements.
   
   Thanks @sodonnel , I will address the comments by a new commit.  I also agree we should have a better way to display these kind of information.  The patch is just an emergency fix, for we have decommission issue in our cluster and I need to resolve the issue ASAP.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#issuecomment-909213359


   > I wish we had a better way to list out the Unhealthy / Under replicated containers etc. Feel like we need a basic fsck command for Ozone to be able to debug things like this a bit more easily.
   > 
   > Patch looks mostly OK. I just had one comment about wrapping the lines in "if debug enable" statements.
   
   Thanks @sodonnel , I will address the comments by a new commit.  I also agree we should have a better way to display these kind of information.  The patch is just an emergency fix, for we have decommission issue in our cluster and I need to resolve the issue ASAP.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#issuecomment-912204134


   > 
   > 
   > > @ChenSammi when a DN is decommissioned successfully, do you think we can delete all its' replicas by sending replica delete command from SCM, so that we can free up the space on the DN and cleanup the container map of SCM.
   > 
   > I don't think we should remove the containers from the DNs, for two main reasons:
   > 
   >     1. Sometimes people decommission nodes for maintenance, and then bring them back to the cluster, so its better if they still have the data in that case.
   > 
   >     2. If there are some bugs in decommission, then its good we can recommission the node and know its data is still intact.
   > 
   >     3. If the node needs to be wiped its easy to just format the drives or delete the data folders manually later.
   > 
   > 
   
   I see.  It makes sense to keep the replica data. 
   
   > However, we should clear the replica references out of SCM. This should happen when the node goes dead, and the dead node handler runs. If that is not the case, then we might have a general problem with dead node handling.
   
   The obervation is once the DN is dead, it's replicas are all cleared from SCM.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#discussion_r699173614



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java
##########
@@ -284,9 +287,11 @@ private boolean checkContainersReplicatedOnNode(DatanodeDetails dn)
         if (replicaSet.isSufficientlyReplicated()) {
           sufficientlyReplicated++;
         } else {
+          underReplicatedIDs.add(cid);

Review comment:
       I think we should wrap this line in `if (LOG.isDebugEnabled()) {` too - as if debug is not enabled, we never use the lists, so there is no point in adding to them.
   
   Also, I wonder if we should set an upper bound on the list? Eg if there are 200,000 under replicated containers, is it really useful to log them all? Maybe logging just the first 10 or 100 would be enough?

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java
##########
@@ -297,6 +302,14 @@ private boolean checkContainersReplicatedOnNode(DatanodeDetails dn)
     LOG.info("{} has {} sufficientlyReplicated, {} underReplicated and {} " +
         "unhealthy containers",
         dn, sufficientlyReplicated, underReplicated, unhealthy);
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("{} has {} underReplicated [{}] and {} unhealthy [{}] " +
+              "containers", dn, underReplicated,
+          underReplicatedIDs.stream().map(

Review comment:
       You can do the join more simply with:
   
   ```
   String.join(",", list);
   ```

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java
##########
@@ -297,6 +302,14 @@ private boolean checkContainersReplicatedOnNode(DatanodeDetails dn)
     LOG.info("{} has {} sufficientlyReplicated, {} underReplicated and {} " +
         "unhealthy containers",
         dn, sufficientlyReplicated, underReplicated, unhealthy);
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("{} has {} underReplicated [{}] and {} unhealthy [{}] " +
+              "containers", dn, underReplicated,
+          underReplicatedIDs.stream().map(

Review comment:
       Actually, this might only work if the list is of type String. I am not 100% sure.

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java
##########
@@ -284,9 +287,11 @@ private boolean checkContainersReplicatedOnNode(DatanodeDetails dn)
         if (replicaSet.isSufficientlyReplicated()) {
           sufficientlyReplicated++;
         } else {
+          underReplicatedIDs.add(cid);
           underReplicated++;
         }
         if (!replicaSet.isHealthy()) {
+          unhealthyIDs.add(cid);

Review comment:
       Same as above for this line.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi edited a comment on pull request #2598: HDDS-5700. Improve LOG message of decommission progress.

Posted by GitBox <gi...@apache.org>.
ChenSammi edited a comment on pull request #2598:
URL: https://github.com/apache/ozone/pull/2598#issuecomment-911089256


   @sodonnel ,  when a DN is decommissioned successfully, do you think  we can delete all its' replicas by sending replica delete command from SCM,  so that we can free up the space on the DN and cleanup the container map of SCM.   
   
   The current state is every container now still has 4 replicas info in SCM,  after I have turned off the decommissioned DN. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org