You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@cloudstack.apache.org by "ilya musayev (JIRA)" <ji...@apache.org> on 2013/09/26 00:22:03 UTC
[jira] [Commented] (CLOUDSTACK-4740) Some vSphere VMs are shutdown
when ACS is restarted
[ https://issues.apache.org/jira/browse/CLOUDSTACK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13778166#comment-13778166 ]
ilya musayev commented on CLOUDSTACK-4740:
------------------------------------------
this usually occurs with VMs that dont have latest/support vmware tools installed i assume or the state has not been recorded in ACS for whatever the reason was. The vmware tools are noted as "Running (3rd-party/Independent)" - that were downloaded of from vmware as part of open-vm-tools project.
> Some vSphere VMs are shutdown when ACS is restarted
> ---------------------------------------------------
>
> Key: CLOUDSTACK-4740
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-4740
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the default.)
> Components: Management Server
> Affects Versions: 4.1.0, 4.1.1, 4.2.0, Future
> Environment: I'm running ACS 4.1.1 with vSphere 5.1
> Reporter: ilya musayev
> Priority: Critical
> Labels: management, poweroff
> Fix For: 4.1.0, 4.1.1, 4.2.0, Future
>
>
> If management server is restarted, when management server starts - it checks whether the agentState for vSphere VMs and if it does not get a proper response, it marks them as stopped.
> As the result, some of my virtual instances would shutdown.
> Attempting to analyze this issue further, here are my findings and errors seen in the log.
> 2013-09-25 14:35:49,928 DEBUG [vmware.resource.VmwareResource] (AgentTaskPool-1:null) Detecting a new state but couldn't find a old state so adding it to the changes: i-2-262-acs-docs-fc17
> 2013-09-25 14:35:51,213 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq -1--1: Startup request from directly connected host: { Cmd , MgmtId: -1, via: -1, Ver: v1, Flags: 11, [{"cpus":16,"speed":2199,"memory":68683468800,"dom0MinMemory":0,"poolSync":false,"vms":{"i-8-270-CLOUD411":{"state":"Running"},"r-15-CLOUD41-OLD":{"state":"Stopped"},"v-260-CLOUD411":{"state":"Running"},"i-2-283-vmbld01l-ops-08":{"state":"Running"},"i-2-104-ACS41VM":{"state":"Running"},"--s-1-CLOUD41-OLD":{"state":"Running"},"i-27-280-CLOUD411":{"state":"Running"},"i-2-285-ossec01l-ops-08":{"state":"Running"},"i-2-262-acs-docs-fc17":{"state":"Stopped"},"i-24-265-test3":{"state":"Running"},"cloud01l-ops-08.portal.webmd.com":{"state":"Running"},"i-2-278-demo01t-ops-08":{"state":"Running"},"s-63-CLOUD411":{"state":"Running"},"r-66-CLOUD411":{"state":"Running"},"i-2-281-acs-appliance":{"state":"Running"}},"caps":"hvm","hypervisorType":"VMware","hostDetails":{"com.cloud.network.Networks.RouterPrivateIpStrategy":"DcGlobal","NativeHA":"true"},"hypervisorVersion":"5.0","type":"Routing","dataCenter":"2","pod":"2","cluster":"3","guid":"HostSystem:host-19716@vc00q-ops-08.portal.webmd.com","name":"vmha62d-ops-08.portal.webmd.com","version":"4.1.1-SNAPSHOT","privateIpAddress":"172.25.243.31","privateMacAddress":"68:b5:99:73:0b:c2","privateNetmask":"255.255.255.0","storageIpAddress":"172.25.243.31","storageNetmask":"255.255.255.0","storageMacAddress":"68:b5:99:73:0b:c2","wait":0},{"totalSize":0,"poolInfo":{"uuid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","host":"vmha62d-ops-08.portal.webmd.com","localPath":"","hostPath":"datastore-19718","poolType":"LVM","capacityBytes":141465485312,"availableBytes":140383354880},"resourceType":"STORAGE_POOL","hostDetails":{},"type":"Storage","dataCenter":"2","pod":"2","cluster":"3","guid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","name":"72c8aedb-58c4-4569-ac51-adc5af770bf6","wait":0}] }
> 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running and realState = Stopped
> 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running and realState = Stopped
> 2013-09-25 14:35:53,614 INFO [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-1:null) Skip HA for VMware VM i-2-262-acs-docs-fc17
> 2013-09-25 14:35:53,694 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq 11-1418264581: Sending { Cmd , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-283-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-285-ossec01l-ops-08","wait":0}}] }
> 2013-09-25 14:35:53,695 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq 11-1418264581: Executing: { Cmd , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-283-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-285-ossec01l-ops-08","wait":0}}] }
> 2013-09-25 14:35:53,702 INFO [vmware.resource.VmwareResource] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) Executing resource StopCommand: {"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}
> 2013-09-25 14:35:53,703 DEBUG [vmware.mo.HostMO] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on host
> 2013-09-25 14:35:54,753 INFO [vmware.resource.VmwareResource] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) VM i-2-262-acs-docs-fc17 is already in stopped state
> 2013-09-25 14:37:50,264 DEBUG [agent.transport.Request] (DirectAgent-3:null) Seq 11-1418264581: Processing: { Ans: , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100, [{"StopAnswer":{"vncPort":0,"result":true,"details":"VM i-2-262-acs-docs-fc17 is already in stopped state","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-278-demo01t-ops-08 Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-281-acs-appliance Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-283-vmbld01l-ops-08 Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-285-ossec01l-ops-08 Succeed","wait":0}}] }
> 2013-09-25 14:38:14,528 DEBUG [vmware.mo.HostMO] (DirectAgent-20:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on host
> If for whatever reason agentState is stopped for vSphere, we should not force power off the VM, neither do we have to restart it.
> The code that causes the restart is in engine/orchestration/src/com/cloud/vm/VirtualMachineManagerImpl.java,
> lines 2365 - 2376
> In this snippet we do the check if cs state == Running and agentState == Stopped
> We should not power off the VM if we cannot retrieve the state!
> 2233 protected Command compareState(long hostId, VMInstanceVO vm, final AgentVmInfo info, final boolean fullSync, boolean trackExternalChange) {
> 2234 State agentState = info.state;
> 2235 final State serverState = vm.getState();
> 2236 final String serverName = vm.getInstanceName();
> 2237
> 2238 Command command = null;
> 2239 s_logger.debug("VM " + serverName + ": cs state = " + serverState + " and realState = " + agentState);
> 2240 if (s_logger.isDebugEnabled()) {
> 2241 s_logger.debug("VM " + serverName + ": cs state = " + serverState + " and realState = " + agentState);
> 2242 }
> 2243
> 2244 if (agentState == State.Error) {
> 2245 agentState = State.Stopped;
> 2246
> 2247 short alertType = AlertManager.ALERT_TYPE_USERVM;
> 2248 if (VirtualMachine.Type.DomainRouter.equals(vm.getType())) {
> 2249 alertType = AlertManager.ALERT_TYPE_DOMAIN_ROUTER;
> 2250 } else if (VirtualMachine.Type.ConsoleProxy.equals(vm.getType())) {
> 2251 alertType = AlertManager.ALERT_TYPE_CONSOLE_PROXY;
> 2252 } else if (VirtualMachine.Type.SecondaryStorageVm.equals(vm.getType())) {
> 2253 alertType = AlertManager.ALERT_TYPE_SSVM;
> 2254 }
> 2255
> 2256 HostPodVO podVO = _podDao.findById(vm.getPodIdToDeployIn());
> 2257 DataCenterVO dcVO = _dcDao.findById(vm.getDataCenterId());
> 2258 HostVO hostVO = _hostDao.findById(vm.getHostId());
> 2259
> 2260 String hostDesc = "name: " + hostVO.getName() + " (id:" + hostVO.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + podVO.getName();
> 2261 _alertMgr.sendAlert(alertType, vm.getDataCenterId(), vm.getPodIdToDeployIn(), "VM (name: " + vm.getInstanceName() + ", id: " + vm.getId() + ") stopped on host " +
> 2262 hostDesc + " due to storage failure",
> 2263 "Virtual Machine " + vm.getInstanceName() + " (id: " + vm.getId() + ") running on host [" + vm.getHostId() + "] stopped due to storage failure.");
> 2264 }
> 2265
> 2266 if (trackExternalChange) {
> 2267 if (serverState == State.Starting) {
> 2268 if (vm.getHostId() != null && vm.getHostId() != hostId) {
> 2269 s_logger.info("CloudStack is starting VM on host " + vm.getHostId() + ", but status report comes from a different host " + hostId +
> 2270 ", skip status sync for vm: " + vm.getInstanceName());
> 2271 return null;
> 2272 }
> 2273 }
> 2274 if (vm.getHostId() == null || hostId != vm.getHostId()) {
> 2275 try {
> 2276 ItWorkVO workItem = _workDao.findByOutstandingWork(vm.getId(), State.Migrating);
> 2277 if (workItem == null) {
> 2278 stateTransitTo(vm, VirtualMachine.Event.AgentReportMigrated, hostId);
> 2279 }
> 2280 } catch (NoTransitionException e) {
> 2281 }
> 2282 }
> 2283 }
> 2284
> 2285 // during VM migration time, don't sync state will agent status update
> 2286 if (serverState == State.Migrating) {
> 2287 s_logger.debug("Skipping vm in migrating state: " + vm);
> 2288 return null;
> 2289 }
> 2290
> 2291 if (trackExternalChange) {
> 2292 if (serverState == State.Starting) {
> 2293 if (vm.getHostId() != null && vm.getHostId() != hostId) {
> 2294 s_logger.info("CloudStack is starting VM on host " + vm.getHostId() + ", but status report comes from a different host " + hostId +
> 2295 ", skip status sync for vm: " + vm.getInstanceName());
> 2296 return null;
> 2297 }
> 2298 }
> 2299
> 2300 if (serverState == State.Running) {
> 2301 try {
> 2302 //
> 2303 // we had a bug that sometimes VM may be at Running State
> 2304 // but host_id is null, we will cover it here.
> 2305 // means that when CloudStack DB lost of host information,
> 2306 // we will heal it with the info reported from host
> 2307 //
> 2308 if (vm.getHostId() == null || hostId != vm.getHostId()) {
> 2309 if (s_logger.isDebugEnabled()) {
> 2310 s_logger.debug("detected host change when VM " + vm + " is at running state, VM could be live-migrated externally from host " + vm.getHostId() +
> 2311 " to host " + hostId);
> 2312 }
> 2313
> 2314 stateTransitTo(vm, VirtualMachine.Event.AgentReportMigrated, hostId);
> 2315 }
> 2316 } catch (NoTransitionException e) {
> 2317 s_logger.warn(e.getMessage());
> 2318 }
> 2319 }
> 2320 }
> 2321
> 2322 if (agentState == serverState) {
> 2323 if (s_logger.isDebugEnabled()) {
> 2324 s_logger.debug("Both states are " + agentState + " for " + vm);
> 2325 }
> 2326 assert (agentState == State.Stopped || agentState == State.Running) : "If the states we send up is changed, this must be changed.";
> 2327 if (agentState == State.Running) {
> 2328 try {
> 2329 stateTransitTo(vm, VirtualMachine.Event.AgentReportRunning, hostId);
> 2330 } catch (NoTransitionException e) {
> 2331 s_logger.warn(e.getMessage());
> 2332 }
> 2333 // FIXME: What if someone comes in and sets it to stopping? Then
> 2334 // what?
> 2335 return null;
> 2336 }
> 2337
> 2338 s_logger.debug("State matches but the agent said stopped so let's send a cleanup command anyways.");
> 2339 return cleanup(vm);
> 2340 }
> 2341
> 2342 if (agentState == State.Shutdowned) {
> 2343 if (serverState == State.Running || serverState == State.Starting || serverState == State.Stopping) {
> 2344 try {
> 2345 advanceStop(vm, true);
> 2346 } catch (AgentUnavailableException e) {
> 2347 assert (false) : "How do we hit this with forced on?";
> 2348 return null;
> 2349 } catch (OperationTimedoutException e) {
> 2350 assert (false) : "How do we hit this with forced on?";
> 2351 return null;
> 2352 } catch (ConcurrentOperationException e) {
> 2353 assert (false) : "How do we hit this with forced on?";
> 2354 return null;
> 2355 }
> 2356 } else {
> 2357 s_logger.debug("Sending cleanup to a shutdowned vm: " + vm.getInstanceName());
> 2358 command = cleanup(vm);
> 2359 }
> 2360 } else if (agentState == State.Stopped) {
> 2361 // This state means the VM on the agent was detected previously
> 2362 // and now is gone. This is slightly different than if the VM
> 2363 // was never completed but we still send down a Stop Command
> 2364 // to ensure there's cleanup.
> 2365 if (serverState == State.Running) {
> 2366 // Our records showed that it should be running so let's restart
> 2367 // it.
> 2368 _haMgr.scheduleRestart(vm, false);
> 2369 } else if (serverState == State.Stopping) {
> 2370 _haMgr.scheduleStop(vm, hostId, WorkType.ForceStop);
> 2371 s_logger.debug("Scheduling a check stop for VM in stopping mode: " + vm);
> 2372 } else if (serverState == State.Starting) {
> 2373 s_logger.debug("Ignoring VM in starting mode: " + vm.getInstanceName());
> 2374 _haMgr.scheduleRestart(vm, false);
> 2375 }
> 2376 command = cleanup(vm);
> I'd be happy to test the fix if such is comes out.
> Thanks
> ilya
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira