You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@cloudstack.apache.org by "Damodar Reddy T (JIRA)" <ji...@apache.org> on 2013/12/19 11:58:07 UTC

[jira] [Commented] (CLOUDSTACK-4740) Some vSphere VMs are shutdown when ACS is restarted

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13852807#comment-13852807 ] 

Damodar Reddy T commented on CLOUDSTACK-4740:
---------------------------------------------

Hi  ilya musayev,

  I tried to re produce this use case with the following set up and not able to re produce it.

1. ESXi 5.1 and ACS 4.3
2. launched 3 to 4 vms
3. Added a fire wall(iptable rule) in MS machine not to access my vSphere host.

now restarted the management server and all my VMs(including system vms) are in Running state only. and once I removed the firewall restriction on MS machine all went fine.

Is this the correct way to re produce this use case? If not can you please share re producible steps so that I will try once again.

If not can I close this defect saying not re producible?

Thanks & Regards
Damoder

> Some vSphere VMs are shutdown when ACS is restarted
> ---------------------------------------------------
>
>                 Key: CLOUDSTACK-4740
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-4740
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: Management Server
>    Affects Versions: 4.1.0, 4.1.1, 4.2.0, Future
>         Environment: I'm running ACS 4.1.1 with vSphere 5.1
>            Reporter: ilya musayev
>            Assignee: Damodar Reddy T
>            Priority: Critical
>              Labels: management, poweroff
>             Fix For: 4.3.0
>
>
> If management server is restarted, when management server starts - it checks whether the agentState for vSphere VMs and if it does not get a proper response, it marks them as stopped.
> As the result, some of my virtual instances would shutdown.
> Attempting to analyze this issue further, here are my findings and errors seen in the log.
> 2013-09-25 14:35:49,928 DEBUG [vmware.resource.VmwareResource] (AgentTaskPool-1:null) Detecting a new state but couldn't find a old state so adding it to the changes: i-2-262-acs-docs-fc17
> 2013-09-25 14:35:51,213 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq -1--1: Startup request from directly connected host:  { Cmd , MgmtId: -1, via: -1, Ver: v1, Flags: 11, [{"cpus":16,"speed":2199,"memory":68683468800,"dom0MinMemory":0,"poolSync":false,"vms":{"i-8-270-CLOUD411":{"state":"Running"},"r-15-CLOUD41-OLD":{"state":"Stopped"},"v-260-CLOUD411":{"state":"Running"},"i-2-283-vmbld01l-ops-08":{"state":"Running"},"i-2-104-ACS41VM":{"state":"Running"},"--s-1-CLOUD41-OLD":{"state":"Running"},"i-27-280-CLOUD411":{"state":"Running"},"i-2-285-ossec01l-ops-08":{"state":"Running"},"i-2-262-acs-docs-fc17":{"state":"Stopped"},"i-24-265-test3":{"state":"Running"},"cloud01l-ops-08.portal.webmd.com":{"state":"Running"},"i-2-278-demo01t-ops-08":{"state":"Running"},"s-63-CLOUD411":{"state":"Running"},"r-66-CLOUD411":{"state":"Running"},"i-2-281-acs-appliance":{"state":"Running"}},"caps":"hvm","hypervisorType":"VMware","hostDetails":{"com.cloud.network.Networks.RouterPrivateIpStrategy":"DcGlobal","NativeHA":"true"},"hypervisorVersion":"5.0","type":"Routing","dataCenter":"2","pod":"2","cluster":"3","guid":"HostSystem:host-19716@vc00q-ops-08.portal.webmd.com","name":"vmha62d-ops-08.portal.webmd.com","version":"4.1.1-SNAPSHOT","privateIpAddress":"172.25.243.31","privateMacAddress":"68:b5:99:73:0b:c2","privateNetmask":"255.255.255.0","storageIpAddress":"172.25.243.31","storageNetmask":"255.255.255.0","storageMacAddress":"68:b5:99:73:0b:c2","wait":0},{"totalSize":0,"poolInfo":{"uuid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","host":"vmha62d-ops-08.portal.webmd.com","localPath":"","hostPath":"datastore-19718","poolType":"LVM","capacityBytes":141465485312,"availableBytes":140383354880},"resourceType":"STORAGE_POOL","hostDetails":{},"type":"Storage","dataCenter":"2","pod":"2","cluster":"3","guid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","name":"72c8aedb-58c4-4569-ac51-adc5af770bf6","wait":0}] }
> 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running and realState = Stopped
> 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running and realState = Stopped
> 2013-09-25 14:35:53,614 INFO  [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-1:null) Skip HA for VMware VM i-2-262-acs-docs-fc17
> 2013-09-25 14:35:53,694 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq 11-1418264581: Sending  { Cmd , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-283-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-285-ossec01l-ops-08","wait":0}}] }
> 2013-09-25 14:35:53,695 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq 11-1418264581: Executing:  { Cmd , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-283-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-285-ossec01l-ops-08","wait":0}}] }
> 2013-09-25 14:35:53,702 INFO  [vmware.resource.VmwareResource] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) Executing resource StopCommand: {"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}
> 2013-09-25 14:35:53,703 DEBUG [vmware.mo.HostMO] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on host
> 2013-09-25 14:35:54,753 INFO  [vmware.resource.VmwareResource] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) VM i-2-262-acs-docs-fc17 is already in stopped state
> 2013-09-25 14:37:50,264 DEBUG [agent.transport.Request] (DirectAgent-3:null) Seq 11-1418264581: Processing:  { Ans: , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100, [{"StopAnswer":{"vncPort":0,"result":true,"details":"VM i-2-262-acs-docs-fc17 is already in stopped state","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-278-demo01t-ops-08 Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-281-acs-appliance Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-283-vmbld01l-ops-08 Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-285-ossec01l-ops-08 Succeed","wait":0}}] }
> 2013-09-25 14:38:14,528 DEBUG [vmware.mo.HostMO] (DirectAgent-20:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on host
> If for whatever reason agentState is stopped for vSphere, we should not force power off the VM, neither do we have to restart it.
> The code that causes the restart is in engine/orchestration/src/com/cloud/vm/VirtualMachineManagerImpl.java,
> lines 2365 - 2376
> In this snippet we do the check if cs state == Running and agentState == Stopped 
> We should not power off the VM if we cannot retrieve the state!
> 2233     protected Command compareState(long hostId, VMInstanceVO vm, final AgentVmInfo info, final boolean fullSync, boolean trackExternalChange) {
> 2234         State agentState = info.state;
> 2235         final State serverState = vm.getState();
> 2236         final String serverName = vm.getInstanceName();
> 2237 
> 2238         Command command = null;
> 2239         s_logger.debug("VM " + serverName + ": cs state = " + serverState + " and realState = " + agentState);
> 2240         if (s_logger.isDebugEnabled()) {
> 2241             s_logger.debug("VM " + serverName + ": cs state = " + serverState + " and realState = " + agentState);
> 2242         }
> 2243 
> 2244         if (agentState == State.Error) {
> 2245             agentState = State.Stopped;
> 2246 
> 2247             short alertType = AlertManager.ALERT_TYPE_USERVM;
> 2248             if (VirtualMachine.Type.DomainRouter.equals(vm.getType())) {
> 2249                 alertType = AlertManager.ALERT_TYPE_DOMAIN_ROUTER;
> 2250             } else if (VirtualMachine.Type.ConsoleProxy.equals(vm.getType())) {
> 2251                 alertType = AlertManager.ALERT_TYPE_CONSOLE_PROXY;
> 2252             } else if (VirtualMachine.Type.SecondaryStorageVm.equals(vm.getType())) {
> 2253                 alertType = AlertManager.ALERT_TYPE_SSVM;
> 2254             }
> 2255 
> 2256             HostPodVO podVO = _podDao.findById(vm.getPodIdToDeployIn());
> 2257             DataCenterVO dcVO = _dcDao.findById(vm.getDataCenterId());
> 2258             HostVO hostVO = _hostDao.findById(vm.getHostId());
> 2259 
> 2260             String hostDesc = "name: " + hostVO.getName() + " (id:" + hostVO.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + podVO.getName();
> 2261             _alertMgr.sendAlert(alertType, vm.getDataCenterId(), vm.getPodIdToDeployIn(), "VM (name: " + vm.getInstanceName() + ", id: " + vm.getId() + ") stopped on host " +
> 2262                                                                                           hostDesc + " due to storage failure",
> 2263                 "Virtual Machine " + vm.getInstanceName() + " (id: " + vm.getId() + ") running on host [" + vm.getHostId() + "] stopped due to storage failure.");
> 2264         }
> 2265 
> 2266         if (trackExternalChange) {
> 2267             if (serverState == State.Starting) {
> 2268                 if (vm.getHostId() != null && vm.getHostId() != hostId) {
> 2269                     s_logger.info("CloudStack is starting VM on host " + vm.getHostId() + ", but status report comes from a different host " + hostId +
> 2270                                   ", skip status sync for vm: " + vm.getInstanceName());
> 2271                     return null;
> 2272                 }
> 2273             }
> 2274             if (vm.getHostId() == null || hostId != vm.getHostId()) {
> 2275                 try {
> 2276                     ItWorkVO workItem = _workDao.findByOutstandingWork(vm.getId(), State.Migrating);
> 2277                     if (workItem == null) {
> 2278                         stateTransitTo(vm, VirtualMachine.Event.AgentReportMigrated, hostId);
> 2279                     }
> 2280                 } catch (NoTransitionException e) {
> 2281                 }
> 2282             }
> 2283         }
> 2284 
> 2285         // during VM migration time, don't sync state will agent status update
> 2286         if (serverState == State.Migrating) {
> 2287             s_logger.debug("Skipping vm in migrating state: " + vm);
> 2288             return null;
> 2289         }
> 2290 
> 2291         if (trackExternalChange) {
> 2292             if (serverState == State.Starting) {
> 2293                 if (vm.getHostId() != null && vm.getHostId() != hostId) {
> 2294                     s_logger.info("CloudStack is starting VM on host " + vm.getHostId() + ", but status report comes from a different host " + hostId +
> 2295                                   ", skip status sync for vm: " + vm.getInstanceName());
> 2296                     return null;
> 2297                 }
> 2298             }
> 2299 
> 2300             if (serverState == State.Running) {
> 2301                 try {
> 2302                     //
> 2303                     // we had a bug that sometimes VM may be at Running State
> 2304                     // but host_id is null, we will cover it here.
> 2305                     // means that when CloudStack DB lost of host information,
> 2306                     // we will heal it with the info reported from host
> 2307                     //
> 2308                     if (vm.getHostId() == null || hostId != vm.getHostId()) {
> 2309                         if (s_logger.isDebugEnabled()) {
> 2310                             s_logger.debug("detected host change when VM " + vm + " is at running state, VM could be live-migrated externally from host " + vm.getHostId() +
> 2311                                            " to host " + hostId);
> 2312                         }
> 2313 
> 2314                         stateTransitTo(vm, VirtualMachine.Event.AgentReportMigrated, hostId);
> 2315                     }
> 2316                 } catch (NoTransitionException e) {
> 2317                     s_logger.warn(e.getMessage());
> 2318                 }
> 2319             }
> 2320         }
> 2321 
> 2322         if (agentState == serverState) {
> 2323             if (s_logger.isDebugEnabled()) {
> 2324                 s_logger.debug("Both states are " + agentState + " for " + vm);
> 2325             }
> 2326             assert (agentState == State.Stopped || agentState == State.Running) : "If the states we send up is changed, this must be changed.";
> 2327             if (agentState == State.Running) {
> 2328                 try {
> 2329                     stateTransitTo(vm, VirtualMachine.Event.AgentReportRunning, hostId);
> 2330                 } catch (NoTransitionException e) {
> 2331                     s_logger.warn(e.getMessage());
> 2332                 }
> 2333                 // FIXME: What if someone comes in and sets it to stopping? Then
> 2334                 // what?
> 2335                 return null;
> 2336             }
> 2337 
> 2338             s_logger.debug("State matches but the agent said stopped so let's send a cleanup command anyways.");
> 2339             return cleanup(vm);
> 2340         }
> 2341 
> 2342         if (agentState == State.Shutdowned) {
> 2343             if (serverState == State.Running || serverState == State.Starting || serverState == State.Stopping) {
> 2344                 try {
> 2345                     advanceStop(vm, true);
> 2346                 } catch (AgentUnavailableException e) {
> 2347                     assert (false) : "How do we hit this with forced on?";
> 2348                     return null;
> 2349                 } catch (OperationTimedoutException e) {
> 2350                     assert (false) : "How do we hit this with forced on?";
> 2351                     return null;
> 2352                 } catch (ConcurrentOperationException e) {
> 2353                     assert (false) : "How do we hit this with forced on?";
> 2354                     return null;
> 2355                 }
> 2356             } else {
> 2357                 s_logger.debug("Sending cleanup to a shutdowned vm: " + vm.getInstanceName());
> 2358                 command = cleanup(vm);
> 2359             }
> 2360         } else if (agentState == State.Stopped) {
> 2361             // This state means the VM on the agent was detected previously
> 2362             // and now is gone. This is slightly different than if the VM
> 2363             // was never completed but we still send down a Stop Command
> 2364             // to ensure there's cleanup.
> 2365             if (serverState == State.Running) {
> 2366                 // Our records showed that it should be running so let's restart
> 2367                 // it.
> 2368                 _haMgr.scheduleRestart(vm, false);
> 2369             } else if (serverState == State.Stopping) {
> 2370                 _haMgr.scheduleStop(vm, hostId, WorkType.ForceStop);
> 2371                 s_logger.debug("Scheduling a check stop for VM in stopping mode: " + vm);
> 2372             } else if (serverState == State.Starting) {
> 2373                 s_logger.debug("Ignoring VM in starting mode: " + vm.getInstanceName());
> 2374                 _haMgr.scheduleRestart(vm, false);
> 2375             }
> 2376             command = cleanup(vm);
> I'd be happy to test the fix if such is comes out.
> Thanks
> ilya



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)