You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by GitBox <gi...@apache.org> on 2021/02/22 17:24:38 UTC

[GitHub] [ambari] ramkrish86 opened a new pull request #3287: AMBARI-25620 Ambari Server STOP command on a node might fail because …

ramkrish86 opened a new pull request #3287:
URL: https://github.com/apache/ambari/pull/3287


   …of a decommissioned NM's behaviour
   
   ## What changes were proposed in this pull request?
   
   As part of our use case, before we STOP any set of components in a node, if we see a NodeManager and Datanode in that node, we first Decommission them and then issue STOP components to all the components in the node.
   
   DataNodes once decommissioned are not STOPPED but they are alive as a process. Unless we stop them. Whereas RM stops the NM on decommission. (this is a known behaviour). But what happens is that the RecoveryManager in ambari keeps restarting the service thinking its DESIRED state (in the agent side) is STARTED. So the restart keeps happening. So the state changes between STARTED <-> INSTALLED on the agent side and once this happens we communicate the Component status to the server side.
   On receiving this update the server sets the STATE as STARTED/INSTALLED as the case may be.
   Now coming back to the actual STOP command request that we gave, as per design in ambari server once all the component updates are sent, it processes them in batch and tries to do the in-memory transition of STATES on the server side (not the cache but the FSM (state machine transition). Here the event is INSTALL/STOP event for NM that the server is expecting but instead of getting an INSTALLED state it gets STARTED state. The reason as highlighted above. So the entire STOP command gets aborted by the server thinking there is some problem in what it sees.
   /Multimap is analog of Map<Object, List<Object>> but allows to avoid nested loop
           ListMultimap<String, ServiceComponentHostEvent> eventMap = formEventMap(stage, commandsToStart);
           Map<ExecutionCommand, String> commandsToAbort = new HashMap<>();
           if (!eventMap.isEmpty()) {
             LOG.debug("==> processing {} serviceComponentHostEvents...", eventMap.size());
             Cluster cluster = clusters.getCluster(stage.getClusterName());
             if (cluster != null) {
               Map<ServiceComponentHostEvent, String> failedEvents = cluster.processServiceComponentHostEvents(eventMap);
   
               if (failedEvents.size() > 0) {
                 LOG.error("==> {} events failed.", failedEvents.size());
               }
   
               for (Iterator<ExecutionCommand> iterator = commandsToUpdate.iterator(); iterator.hasNext(); ) {
                 ExecutionCommand cmd = iterator.next();
                 for (ServiceComponentHostEvent event : failedEvents.keySet()) {
                   if (StringUtils.equals(event.getHostName(), cmd.getHostname()) &&
                     StringUtils.equals(event.getServiceComponentName(), cmd.getRole())) {
                     iterator.remove();
                     commandsToAbort.put(cmd, failedEvents.get(event));
                     break;
                   }
                 }
               }
   Check the processServiceComponentHostEvents() for the way the transition happens and what is the Invalid Transition that happens over there. The log msg would be like this
   org.apache.ambari.server.state.fsm.InvalidStateTransitionException: Invalid event: HOST_SVCCOMP_INSTALL at STARTED
   Since this entire set of STOP component is considered as a FAILURe, we issue ABORT command and hence all the STOP command issued to the agent are aborted.
   This makes the DN to stay in the STARTED state itself and hence the remaining DELETE HOST command keeps failing.
   The idea is to ensure that for NM if decommissioned and the current state is STARTED for a HOST_SVCCOMP_INSTALL then mark it as not a failure condition so that the commands are not aborted.
   ## How was this patch tested?
   
   (Please explain how this patch was tested. Ex: unit tests, manual tests)
   The patch was tested with a scale down feature of more than 20 nodes at one shot. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ambari] hapylestat commented on a change in pull request #3287: AMBARI-25620 Ambari Server STOP command on a node might fail because …

Posted by GitBox <gi...@apache.org>.
hapylestat commented on a change in pull request #3287:
URL: https://github.com/apache/ambari/pull/3287#discussion_r581442087



##########
File path: ambari-server/src/main/java/org/apache/ambari/server/state/cluster/ClusterImpl.java
##########
@@ -2199,6 +2201,18 @@ public Long getNextConfigVersion(String type) {
         Enum<?> currentState = e.getCurrentState();
         Enum<?> failedEvent = e.getEvent();
 
+        if (serviceComponent != null && serviceComponentHost != null
+            && (currentState == State.STARTED || currentState == State.STARTING)
+            && failedEvent == ServiceComponentHostEventType.HOST_SVCCOMP_INSTALL
+            && serviceComponentName.equals("NODEMANAGER") && serviceComponentHost

Review comment:
       The logic should be generalized without special treatment for any pre-defined component




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ambari] ramkrish86 commented on a change in pull request #3287: AMBARI-25620 Ambari Server STOP command on a node might fail because …

Posted by GitBox <gi...@apache.org>.
ramkrish86 commented on a change in pull request #3287:
URL: https://github.com/apache/ambari/pull/3287#discussion_r582766125



##########
File path: ambari-server/src/main/java/org/apache/ambari/server/state/cluster/ClusterImpl.java
##########
@@ -2199,6 +2201,18 @@ public Long getNextConfigVersion(String type) {
         Enum<?> currentState = e.getCurrentState();
         Enum<?> failedEvent = e.getEvent();
 
+        if (serviceComponent != null && serviceComponentHost != null
+            && (currentState == State.STARTED || currentState == State.STARTING)
+            && failedEvent == ServiceComponentHostEventType.HOST_SVCCOMP_INSTALL
+            && serviceComponentName.equals("NODEMANAGER") && serviceComponentHost

Review comment:
       @hapylestat 
   Can you have a look at the patch now?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ambari] ramkrish86 commented on a change in pull request #3287: AMBARI-25620 Ambari Server STOP command on a node might fail because …

Posted by GitBox <gi...@apache.org>.
ramkrish86 commented on a change in pull request #3287:
URL: https://github.com/apache/ambari/pull/3287#discussion_r583377252



##########
File path: ambari-server/src/main/java/org/apache/ambari/server/state/cluster/ClusterImpl.java
##########
@@ -2199,6 +2201,18 @@ public Long getNextConfigVersion(String type) {
         Enum<?> currentState = e.getCurrentState();
         Enum<?> failedEvent = e.getEvent();
 
+        if (serviceComponent != null && serviceComponentHost != null
+            && (currentState == State.STARTED || currentState == State.STARTING)
+            && failedEvent == ServiceComponentHostEventType.HOST_SVCCOMP_INSTALL
+            && serviceComponentName.equals("NODEMANAGER") && serviceComponentHost

Review comment:
       @hapylestat 
   Ping for reviews. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ambari] ramkrish86 commented on a change in pull request #3287: AMBARI-25620 Ambari Server STOP command on a node might fail because …

Posted by GitBox <gi...@apache.org>.
ramkrish86 commented on a change in pull request #3287:
URL: https://github.com/apache/ambari/pull/3287#discussion_r581614737



##########
File path: ambari-server/src/main/java/org/apache/ambari/server/state/cluster/ClusterImpl.java
##########
@@ -2199,6 +2201,18 @@ public Long getNextConfigVersion(String type) {
         Enum<?> currentState = e.getCurrentState();
         Enum<?> failedEvent = e.getEvent();
 
+        if (serviceComponent != null && serviceComponentHost != null
+            && (currentState == State.STARTED || currentState == State.STARTING)
+            && failedEvent == ServiceComponentHostEventType.HOST_SVCCOMP_INSTALL
+            && serviceComponentName.equals("NODEMANAGER") && serviceComponentHost

Review comment:
       Ok then removing NM from this logic should be good enough for us. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ambari] ramkrish86 commented on pull request #3287: AMBARI-25620 Ambari Server STOP command on a node might fail because …

Posted by GitBox <gi...@apache.org>.
ramkrish86 commented on pull request #3287:
URL: https://github.com/apache/ambari/pull/3287#issuecomment-797608497


   @hapylestat and @aonishuk  - Gentle ping.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ambari] ramkrish86 commented on a change in pull request #3287: AMBARI-25620 Ambari Server STOP command on a node might fail because …

Posted by GitBox <gi...@apache.org>.
ramkrish86 commented on a change in pull request #3287:
URL: https://github.com/apache/ambari/pull/3287#discussion_r581616267



##########
File path: ambari-server/src/main/java/org/apache/ambari/server/state/cluster/ClusterImpl.java
##########
@@ -2199,6 +2201,18 @@ public Long getNextConfigVersion(String type) {
         Enum<?> currentState = e.getCurrentState();
         Enum<?> failedEvent = e.getEvent();
 
+        if (serviceComponent != null && serviceComponentHost != null
+            && (currentState == State.STARTED || currentState == State.STARTING)
+            && failedEvent == ServiceComponentHostEventType.HOST_SVCCOMP_INSTALL
+            && serviceComponentName.equals("NODEMANAGER") && serviceComponentHost

Review comment:
       @hapylestat - I have removed the treatment for NM specifically but overall retained the check with the state and the failedEvent as is. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [ambari] surajj-naik commented on pull request #3287: AMBARI-25620 Ambari Server STOP command on a node might fail because …

Posted by GitBox <gi...@apache.org>.
surajj-naik commented on pull request #3287:
URL: https://github.com/apache/ambari/pull/3287#issuecomment-788013226


   LGTM +1. Required in case where we are scaling down, especially since Nodemanager is a special scenario. Good to get Andrew's review on this one.
   Ping @aonishuk 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org