You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2021/04/27 16:36:21 UTC

[GitHub] [ozone] sodonnel opened a new pull request #2190: HDDS-5153. Decommissioning a dead node should complete immediately

sodonnel opened a new pull request #2190:
URL: https://github.com/apache/ozone/pull/2190


   ## What changes were proposed in this pull request?
   
   If you run the decommission or "enter maintenance" command on a node which is already dead, then it should immediately go to the DECOMMISSIONED or IN_MAINTENANCE state. As the node is already dead, there is no way to replicate its containers in a controlled way, and hence the decommission process does not need to run.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-5153
   
   ## How was this patch tested?
   
   New unit tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #2190: HDDS-5153. Decommissioning a dead node should complete immediately

Posted by GitBox <gi...@apache.org>.

sodonnel commented on pull request #2190:
URL: https://github.com/apache/ozone/pull/2190#issuecomment-828234107


   @GlenGeng The change here is to handle a node which is already dead before decommission starts. This change will make it go to decommissioned immediately and not enter the normal workflow at all.
   
   If a node goes dead while it is decommissioning (ie in the workflow), it will abort the workflow via the `shouldContinueWorkflow(...)` method, where it will be placed back into IN_SERVICE + DEAD, and then SCM will handle it as a dead node accordingly.
   
   I don't think a node which goes dead will be stuck as decommissioning forever due to the above check.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] fapifta commented on pull request #2190: HDDS-5153. Decommissioning a dead node should complete immediately

Posted by GitBox <gi...@apache.org>.

fapifta commented on pull request #2190:
URL: https://github.com/apache/ozone/pull/2190#issuecomment-828240771


   @GlenGeng I think, if a node is dead already, and we start decommission, if the node immediately goes to decommissioned, then the check is ignored due to this condition:  if (status.isDecommissioning() || status.isEnteringMaintenance()) and as the node never enters the replication workflow we are good, as the container replication for an already dead node has to be handled when the node goes dead, and also we can not do anything with a dead node in the decomm flow.
   
   @sodonnel I think the changes proposed are good, +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] GlenGeng commented on pull request #2190: HDDS-5153. Decommissioning a dead node should complete immediately

Posted by GitBox <gi...@apache.org>.

GlenGeng commented on pull request #2190:
URL: https://github.com/apache/ozone/pull/2190#issuecomment-828255084


   Thanks @sodonnel  and @fapifta for the explanation. The change is LGTM, and it has my +1.
   
   Since we have `shouldContinueWorkflow()`, a dead node with START_MAINTENANCE and DECOMMISSIONING will abort workflow directly, I wonder which situation is this fixed applied for ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel merged pull request #2190: HDDS-5153. Decommissioning a dead node should complete immediately

Posted by GitBox <gi...@apache.org>.

sodonnel merged pull request #2190:
URL: https://github.com/apache/ozone/pull/2190


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #2190: HDDS-5153. Decommissioning a dead node should complete immediately

Posted by GitBox <gi...@apache.org>.

sodonnel commented on pull request #2190:
URL: https://github.com/apache/ozone/pull/2190#issuecomment-828360237


   @GlenGeng 
   
   > I wonder which situation is this fixed applied for ?
   
   The fix is for the case where someone decommissions / puts to maintenance an already dead node. In that case, we cannot do it gracefully, and the node will already be handled as dead in SCM. Therefore we treat it as a no-op and moved it directly to the end state. There is no point in adding it to the monitor and tracking it there. This also follows what HDFS does in the same scenario.
   
   Glen also mentioned on slack:
   
   > What if there is an expiry time attached to the maintenance command. If we don't track the node in the monitor, then how can we expire maintenance?
   
   This is somewhat of an edge case. However the solution (as already implemented in the code) is to set the maintenance end time as 0 (no end time). Then the only way to get the node back to IN_SERVICE is to recommission it, which is the same as for the decommissioned node.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] GlenGeng commented on pull request #2190: HDDS-5153. Decommissioning a dead node should complete immediately

Posted by GitBox <gi...@apache.org>.

GlenGeng commented on pull request #2190:
URL: https://github.com/apache/ozone/pull/2190#issuecomment-828115989


   Hey @sodonnel. I have one concern. Can the check `checkContainersReplicatedOnNode`  be safely ignored for a dead node ? Will there be corner cases ?
   
   ```
           if (status.isDecommissioning() || status.isEnteringMaintenance()) {
             if (checkPipelinesClosedOnNode(dn)
                 // Ensure the DN has received and persisted the current maint
                 // state.
                 && status.getOperationalState()
                     == dn.getPersistedOpState()
                 && checkContainersReplicatedOnNode(dn)) {
   ``` 
   
   Another choice is, we can ignore the check 
   ```
                 && status.getOperationalState()
                     == dn.getPersistedOpState()
   ```
   for a dead node, since its persisted state can not be changed any more, which will make a dead node stuck in the queue. But we still reserve the check for replicas in that dead node. What do you think ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] GlenGeng commented on pull request #2190: HDDS-5153. Decommissioning a dead node should complete immediately

Posted by GitBox <gi...@apache.org>.

GlenGeng commented on pull request #2190:
URL: https://github.com/apache/ozone/pull/2190#issuecomment-828256794


   > I wonder which situation is this fixed applied for ?
   
   Seems the problem is the dead node can not go through the workflow, thus they can not enter IN_MAINTENANCE and DECOMMISSIONED ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org