You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@cloudstack.apache.org by "ilya musayev (JIRA)" <ji...@apache.org> on 2013/09/25 23:43:03 UTC
[jira] [Created] (CLOUDSTACK-4740) Some vSphere VMs are shutdown when ACS is restarted

ilya musayev created CLOUDSTACK-4740:
----------------------------------------

             Summary: Some vSphere VMs are shutdown when ACS is restarted
                 Key: CLOUDSTACK-4740
                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-4740
             Project: CloudStack
          Issue Type: Bug
      Security Level: Public (Anyone can view this level - this is the default.)
          Components: Management Server
    Affects Versions: 4.1.0, 4.1.1, 4.2.0, Future
         Environment: I'm running ACS 4.1.1 with vSphere 5.1
            Reporter: ilya musayev
            Priority: Critical
             Fix For: 4.1.1, 4.2.0, Future, 4.1.0


If management server is restarted, when management server starts - it checks whether the agentState for vSphere VMs and if it does not get a proper response, it marks them as stopped.

As the result, some of my virtual instances would shutdown.

Attempting to analyze this issue further, here are my findings and errors seen in the log.

2013-09-25 14:35:49,928 DEBUG [vmware.resource.VmwareResource] (AgentTaskPool-1:null) Detecting a new state but couldn't find a old state so adding it to the changes: i-2-262-acs-docs-fc17
2013-09-25 14:35:51,213 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq -1--1: Startup request from directly connected host:  { Cmd , MgmtId: -1, via: -1, Ver: v1, Flags: 11, [{"cpus":16,"speed":2199,"memory":68683468800,"dom0MinMemory":0,"poolSync":false,"vms":{"i-8-270-CLOUD411":{"state":"Running"},"r-15-CLOUD41-OLD":{"state":"Stopped"},"v-260-CLOUD411":{"state":"Running"},"i-2-283-vmbld01l-ops-08":{"state":"Running"},"i-2-104-ACS41VM":{"state":"Running"},"--s-1-CLOUD41-OLD":{"state":"Running"},"i-27-280-CLOUD411":{"state":"Running"},"i-2-285-ossec01l-ops-08":{"state":"Running"},"i-2-262-acs-docs-fc17":{"state":"Stopped"},"i-24-265-test3":{"state":"Running"},"cloud01l-ops-08.portal.webmd.com":{"state":"Running"},"i-2-278-demo01t-ops-08":{"state":"Running"},"s-63-CLOUD411":{"state":"Running"},"r-66-CLOUD411":{"state":"Running"},"i-2-281-acs-appliance":{"state":"Running"}},"caps":"hvm","hypervisorType":"VMware","hostDetails":{"com.cloud.network.Networks.RouterPrivateIpStrategy":"DcGlobal","NativeHA":"true"},"hypervisorVersion":"5.0","type":"Routing","dataCenter":"2","pod":"2","cluster":"3","guid":"HostSystem:host-19716@vc00q-ops-08.portal.webmd.com","name":"vmha62d-ops-08.portal.webmd.com","version":"4.1.1-SNAPSHOT","privateIpAddress":"172.25.243.31","privateMacAddress":"68:b5:99:73:0b:c2","privateNetmask":"255.255.255.0","storageIpAddress":"172.25.243.31","storageNetmask":"255.255.255.0","storageMacAddress":"68:b5:99:73:0b:c2","wait":0},{"totalSize":0,"poolInfo":{"uuid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","host":"vmha62d-ops-08.portal.webmd.com","localPath":"","hostPath":"datastore-19718","poolType":"LVM","capacityBytes":141465485312,"availableBytes":140383354880},"resourceType":"STORAGE_POOL","hostDetails":{},"type":"Storage","dataCenter":"2","pod":"2","cluster":"3","guid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","name":"72c8aedb-58c4-4569-ac51-adc5af770bf6","wait":0}] }
2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running and realState = Stopped
2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running and realState = Stopped
2013-09-25 14:35:53,614 INFO  [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-1:null) Skip HA for VMware VM i-2-262-acs-docs-fc17
2013-09-25 14:35:53,694 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq 11-1418264581: Sending  { Cmd , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-283-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-285-ossec01l-ops-08","wait":0}}] }
2013-09-25 14:35:53,695 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq 11-1418264581: Executing:  { Cmd , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-283-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-285-ossec01l-ops-08","wait":0}}] }
2013-09-25 14:35:53,702 INFO  [vmware.resource.VmwareResource] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) Executing resource StopCommand: {"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}
2013-09-25 14:35:53,703 DEBUG [vmware.mo.HostMO] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on host
2013-09-25 14:35:54,753 INFO  [vmware.resource.VmwareResource] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) VM i-2-262-acs-docs-fc17 is already in stopped state
2013-09-25 14:37:50,264 DEBUG [agent.transport.Request] (DirectAgent-3:null) Seq 11-1418264581: Processing:  { Ans: , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100, [{"StopAnswer":{"vncPort":0,"result":true,"details":"VM i-2-262-acs-docs-fc17 is already in stopped state","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-278-demo01t-ops-08 Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-281-acs-appliance Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-283-vmbld01l-ops-08 Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-285-ossec01l-ops-08 Succeed","wait":0}}] }
2013-09-25 14:38:14,528 DEBUG [vmware.mo.HostMO] (DirectAgent-20:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on host


If for whatever reason agentState is stopped for vSphere, we should not force power off the VM, neither do we have to restart it.

The code that causes the restart is in engine/orchestration/src/com/cloud/vm/VirtualMachineManagerImpl.java,

lines 2365 - 2376

In this snippet we do the check if cs state == Running and agentState == Stopped 

We should not power off the VM if we cannot retrieve the state!


2233     protected Command compareState(long hostId, VMInstanceVO vm, final AgentVmInfo info, final boolean fullSync, boolean trackExternalChange) {
2234         State agentState = info.state;
2235         final State serverState = vm.getState();
2236         final String serverName = vm.getInstanceName();
2237 
2238         Command command = null;
2239         s_logger.debug("VM " + serverName + ": cs state = " + serverState + " and realState = " + agentState);
2240         if (s_logger.isDebugEnabled()) {
2241             s_logger.debug("VM " + serverName + ": cs state = " + serverState + " and realState = " + agentState);
2242         }
2243 
2244         if (agentState == State.Error) {
2245             agentState = State.Stopped;
2246 
2247             short alertType = AlertManager.ALERT_TYPE_USERVM;
2248             if (VirtualMachine.Type.DomainRouter.equals(vm.getType())) {
2249                 alertType = AlertManager.ALERT_TYPE_DOMAIN_ROUTER;
2250             } else if (VirtualMachine.Type.ConsoleProxy.equals(vm.getType())) {
2251                 alertType = AlertManager.ALERT_TYPE_CONSOLE_PROXY;
2252             } else if (VirtualMachine.Type.SecondaryStorageVm.equals(vm.getType())) {
2253                 alertType = AlertManager.ALERT_TYPE_SSVM;
2254             }
2255 
2256             HostPodVO podVO = _podDao.findById(vm.getPodIdToDeployIn());
2257             DataCenterVO dcVO = _dcDao.findById(vm.getDataCenterId());
2258             HostVO hostVO = _hostDao.findById(vm.getHostId());
2259 
2260             String hostDesc = "name: " + hostVO.getName() + " (id:" + hostVO.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + podVO.getName();
2261             _alertMgr.sendAlert(alertType, vm.getDataCenterId(), vm.getPodIdToDeployIn(), "VM (name: " + vm.getInstanceName() + ", id: " + vm.getId() + ") stopped on host " +
2262                                                                                           hostDesc + " due to storage failure",
2263                 "Virtual Machine " + vm.getInstanceName() + " (id: " + vm.getId() + ") running on host [" + vm.getHostId() + "] stopped due to storage failure.");
2264         }
2265 
2266         if (trackExternalChange) {
2267             if (serverState == State.Starting) {
2268                 if (vm.getHostId() != null && vm.getHostId() != hostId) {
2269                     s_logger.info("CloudStack is starting VM on host " + vm.getHostId() + ", but status report comes from a different host " + hostId +
2270                                   ", skip status sync for vm: " + vm.getInstanceName());
2271                     return null;
2272                 }
2273             }
2274             if (vm.getHostId() == null || hostId != vm.getHostId()) {
2275                 try {
2276                     ItWorkVO workItem = _workDao.findByOutstandingWork(vm.getId(), State.Migrating);
2277                     if (workItem == null) {
2278                         stateTransitTo(vm, VirtualMachine.Event.AgentReportMigrated, hostId);
2279                     }
2280                 } catch (NoTransitionException e) {
2281                 }
2282             }
2283         }
2284 
2285         // during VM migration time, don't sync state will agent status update
2286         if (serverState == State.Migrating) {
2287             s_logger.debug("Skipping vm in migrating state: " + vm);
2288             return null;
2289         }
2290 
2291         if (trackExternalChange) {
2292             if (serverState == State.Starting) {
2293                 if (vm.getHostId() != null && vm.getHostId() != hostId) {
2294                     s_logger.info("CloudStack is starting VM on host " + vm.getHostId() + ", but status report comes from a different host " + hostId +
2295                                   ", skip status sync for vm: " + vm.getInstanceName());
2296                     return null;
2297                 }
2298             }
2299 
2300             if (serverState == State.Running) {
2301                 try {
2302                     //
2303                     // we had a bug that sometimes VM may be at Running State
2304                     // but host_id is null, we will cover it here.
2305                     // means that when CloudStack DB lost of host information,
2306                     // we will heal it with the info reported from host
2307                     //
2308                     if (vm.getHostId() == null || hostId != vm.getHostId()) {
2309                         if (s_logger.isDebugEnabled()) {
2310                             s_logger.debug("detected host change when VM " + vm + " is at running state, VM could be live-migrated externally from host " + vm.getHostId() +
2311                                            " to host " + hostId);
2312                         }
2313 
2314                         stateTransitTo(vm, VirtualMachine.Event.AgentReportMigrated, hostId);
2315                     }
2316                 } catch (NoTransitionException e) {
2317                     s_logger.warn(e.getMessage());
2318                 }
2319             }
2320         }
2321 
2322         if (agentState == serverState) {
2323             if (s_logger.isDebugEnabled()) {
2324                 s_logger.debug("Both states are " + agentState + " for " + vm);
2325             }
2326             assert (agentState == State.Stopped || agentState == State.Running) : "If the states we send up is changed, this must be changed.";
2327             if (agentState == State.Running) {
2328                 try {
2329                     stateTransitTo(vm, VirtualMachine.Event.AgentReportRunning, hostId);
2330                 } catch (NoTransitionException e) {
2331                     s_logger.warn(e.getMessage());
2332                 }
2333                 // FIXME: What if someone comes in and sets it to stopping? Then
2334                 // what?
2335                 return null;
2336             }
2337 
2338             s_logger.debug("State matches but the agent said stopped so let's send a cleanup command anyways.");
2339             return cleanup(vm);
2340         }
2341 
2342         if (agentState == State.Shutdowned) {
2343             if (serverState == State.Running || serverState == State.Starting || serverState == State.Stopping) {
2344                 try {
2345                     advanceStop(vm, true);
2346                 } catch (AgentUnavailableException e) {
2347                     assert (false) : "How do we hit this with forced on?";
2348                     return null;
2349                 } catch (OperationTimedoutException e) {
2350                     assert (false) : "How do we hit this with forced on?";
2351                     return null;
2352                 } catch (ConcurrentOperationException e) {
2353                     assert (false) : "How do we hit this with forced on?";
2354                     return null;
2355                 }
2356             } else {
2357                 s_logger.debug("Sending cleanup to a shutdowned vm: " + vm.getInstanceName());
2358                 command = cleanup(vm);
2359             }
2360         } else if (agentState == State.Stopped) {
2361             // This state means the VM on the agent was detected previously
2362             // and now is gone. This is slightly different than if the VM
2363             // was never completed but we still send down a Stop Command
2364             // to ensure there's cleanup.
2365             if (serverState == State.Running) {
2366                 // Our records showed that it should be running so let's restart
2367                 // it.
2368                 _haMgr.scheduleRestart(vm, false);
2369             } else if (serverState == State.Stopping) {
2370                 _haMgr.scheduleStop(vm, hostId, WorkType.ForceStop);
2371                 s_logger.debug("Scheduling a check stop for VM in stopping mode: " + vm);
2372             } else if (serverState == State.Starting) {
2373                 s_logger.debug("Ignoring VM in starting mode: " + vm.getInstanceName());
2374                 _haMgr.scheduleRestart(vm, false);
2375             }
2376             command = cleanup(vm);


I'd be happy to test the fix if such is comes out.

Thanks
ilya

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira