You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@cloudstack.apache.org by "Damodar Reddy T (JIRA)" <ji...@apache.org> on 2013/12/19 11:58:07 UTC
[jira] [Commented] (CLOUDSTACK-4740) Some vSphere VMs are shutdown
when ACS is restarted
[ https://issues.apache.org/jira/browse/CLOUDSTACK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13852807#comment-13852807 ]
Damodar Reddy T commented on CLOUDSTACK-4740:
---------------------------------------------
Hi ilya musayev,
I tried to re produce this use case with the following set up and not able to re produce it.
1. ESXi 5.1 and ACS 4.3
2. launched 3 to 4 vms
3. Added a fire wall(iptable rule) in MS machine not to access my vSphere host.
now restarted the management server and all my VMs(including system vms) are in Running state only. and once I removed the firewall restriction on MS machine all went fine.
Is this the correct way to re produce this use case? If not can you please share re producible steps so that I will try once again.
If not can I close this defect saying not re producible?
Thanks & Regards
Damoder
> Some vSphere VMs are shutdown when ACS is restarted
> ---------------------------------------------------
>
> Key: CLOUDSTACK-4740
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-4740
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the default.)
> Components: Management Server
> Affects Versions: 4.1.0, 4.1.1, 4.2.0, Future
> Environment: I'm running ACS 4.1.1 with vSphere 5.1
> Reporter: ilya musayev
> Assignee: Damodar Reddy T
> Priority: Critical
> Labels: management, poweroff
> Fix For: 4.3.0
>
>
> If management server is restarted, when management server starts - it checks whether the agentState for vSphere VMs and if it does not get a proper response, it marks them as stopped.
> As the result, some of my virtual instances would shutdown.
> Attempting to analyze this issue further, here are my findings and errors seen in the log.
> 2013-09-25 14:35:49,928 DEBUG [vmware.resource.VmwareResource] (AgentTaskPool-1:null) Detecting a new state but couldn't find a old state so adding it to the changes: i-2-262-acs-docs-fc17
> 2013-09-25 14:35:51,213 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq -1--1: Startup request from directly connected host: { Cmd , MgmtId: -1, via: -1, Ver: v1, Flags: 11, [{"cpus":16,"speed":2199,"memory":68683468800,"dom0MinMemory":0,"poolSync":false,"vms":{"i-8-270-CLOUD411":{"state":"Running"},"r-15-CLOUD41-OLD":{"state":"Stopped"},"v-260-CLOUD411":{"state":"Running"},"i-2-283-vmbld01l-ops-08":{"state":"Running"},"i-2-104-ACS41VM":{"state":"Running"},"--s-1-CLOUD41-OLD":{"state":"Running"},"i-27-280-CLOUD411":{"state":"Running"},"i-2-285-ossec01l-ops-08":{"state":"Running"},"i-2-262-acs-docs-fc17":{"state":"Stopped"},"i-24-265-test3":{"state":"Running"},"cloud01l-ops-08.portal.webmd.com":{"state":"Running"},"i-2-278-demo01t-ops-08":{"state":"Running"},"s-63-CLOUD411":{"state":"Running"},"r-66-CLOUD411":{"state":"Running"},"i-2-281-acs-appliance":{"state":"Running"}},"caps":"hvm","hypervisorType":"VMware","hostDetails":{"com.cloud.network.Networks.RouterPrivateIpStrategy":"DcGlobal","NativeHA":"true"},"hypervisorVersion":"5.0","type":"Routing","dataCenter":"2","pod":"2","cluster":"3","guid":"HostSystem:host-19716@vc00q-ops-08.portal.webmd.com","name":"vmha62d-ops-08.portal.webmd.com","version":"4.1.1-SNAPSHOT","privateIpAddress":"172.25.243.31","privateMacAddress":"68:b5:99:73:0b:c2","privateNetmask":"255.255.255.0","storageIpAddress":"172.25.243.31","storageNetmask":"255.255.255.0","storageMacAddress":"68:b5:99:73:0b:c2","wait":0},{"totalSize":0,"poolInfo":{"uuid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","host":"vmha62d-ops-08.portal.webmd.com","localPath":"","hostPath":"datastore-19718","poolType":"LVM","capacityBytes":141465485312,"availableBytes":140383354880},"resourceType":"STORAGE_POOL","hostDetails":{},"type":"Storage","dataCenter":"2","pod":"2","cluster":"3","guid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","name":"72c8aedb-58c4-4569-ac51-adc5af770bf6","wait":0}] }
> 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running and realState = Stopped
> 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running and realState = Stopped
> 2013-09-25 14:35:53,614 INFO [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-1:null) Skip HA for VMware VM i-2-262-acs-docs-fc17
> 2013-09-25 14:35:53,694 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq 11-1418264581: Sending { Cmd , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-283-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-285-ossec01l-ops-08","wait":0}}] }
> 2013-09-25 14:35:53,695 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq 11-1418264581: Executing: { Cmd , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-283-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-285-ossec01l-ops-08","wait":0}}] }
> 2013-09-25 14:35:53,702 INFO [vmware.resource.VmwareResource] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) Executing resource StopCommand: {"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}
> 2013-09-25 14:35:53,703 DEBUG [vmware.mo.HostMO] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on host
> 2013-09-25 14:35:54,753 INFO [vmware.resource.VmwareResource] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) VM i-2-262-acs-docs-fc17 is already in stopped state
> 2013-09-25 14:37:50,264 DEBUG [agent.transport.Request] (DirectAgent-3:null) Seq 11-1418264581: Processing: { Ans: , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100, [{"StopAnswer":{"vncPort":0,"result":true,"details":"VM i-2-262-acs-docs-fc17 is already in stopped state","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-278-demo01t-ops-08 Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-281-acs-appliance Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-283-vmbld01l-ops-08 Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-285-ossec01l-ops-08 Succeed","wait":0}}] }
> 2013-09-25 14:38:14,528 DEBUG [vmware.mo.HostMO] (DirectAgent-20:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on host
> If for whatever reason agentState is stopped for vSphere, we should not force power off the VM, neither do we have to restart it.
> The code that causes the restart is in engine/orchestration/src/com/cloud/vm/VirtualMachineManagerImpl.java,
> lines 2365 - 2376
> In this snippet we do the check if cs state == Running and agentState == Stopped
> We should not power off the VM if we cannot retrieve the state!
> 2233 protected Command compareState(long hostId, VMInstanceVO vm, final AgentVmInfo info, final boolean fullSync, boolean trackExternalChange) {
> 2234 State agentState = info.state;
> 2235 final State serverState = vm.getState();
> 2236 final String serverName = vm.getInstanceName();
> 2237
> 2238 Command command = null;
> 2239 s_logger.debug("VM " + serverName + ": cs state = " + serverState + " and realState = " + agentState);
> 2240 if (s_logger.isDebugEnabled()) {
> 2241 s_logger.debug("VM " + serverName + ": cs state = " + serverState + " and realState = " + agentState);
> 2242 }
> 2243
> 2244 if (agentState == State.Error) {
> 2245 agentState = State.Stopped;
> 2246
> 2247 short alertType = AlertManager.ALERT_TYPE_USERVM;
> 2248 if (VirtualMachine.Type.DomainRouter.equals(vm.getType())) {
> 2249 alertType = AlertManager.ALERT_TYPE_DOMAIN_ROUTER;
> 2250 } else if (VirtualMachine.Type.ConsoleProxy.equals(vm.getType())) {
> 2251 alertType = AlertManager.ALERT_TYPE_CONSOLE_PROXY;
> 2252 } else if (VirtualMachine.Type.SecondaryStorageVm.equals(vm.getType())) {
> 2253 alertType = AlertManager.ALERT_TYPE_SSVM;
> 2254 }
> 2255
> 2256 HostPodVO podVO = _podDao.findById(vm.getPodIdToDeployIn());
> 2257 DataCenterVO dcVO = _dcDao.findById(vm.getDataCenterId());
> 2258 HostVO hostVO = _hostDao.findById(vm.getHostId());
> 2259
> 2260 String hostDesc = "name: " + hostVO.getName() + " (id:" + hostVO.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + podVO.getName();
> 2261 _alertMgr.sendAlert(alertType, vm.getDataCenterId(), vm.getPodIdToDeployIn(), "VM (name: " + vm.getInstanceName() + ", id: " + vm.getId() + ") stopped on host " +
> 2262 hostDesc + " due to storage failure",
> 2263 "Virtual Machine " + vm.getInstanceName() + " (id: " + vm.getId() + ") running on host [" + vm.getHostId() + "] stopped due to storage failure.");
> 2264 }
> 2265
> 2266 if (trackExternalChange) {
> 2267 if (serverState == State.Starting) {
> 2268 if (vm.getHostId() != null && vm.getHostId() != hostId) {
> 2269 s_logger.info("CloudStack is starting VM on host " + vm.getHostId() + ", but status report comes from a different host " + hostId +
> 2270 ", skip status sync for vm: " + vm.getInstanceName());
> 2271 return null;
> 2272 }
> 2273 }
> 2274 if (vm.getHostId() == null || hostId != vm.getHostId()) {
> 2275 try {
> 2276 ItWorkVO workItem = _workDao.findByOutstandingWork(vm.getId(), State.Migrating);
> 2277 if (workItem == null) {
> 2278 stateTransitTo(vm, VirtualMachine.Event.AgentReportMigrated, hostId);
> 2279 }
> 2280 } catch (NoTransitionException e) {
> 2281 }
> 2282 }
> 2283 }
> 2284
> 2285 // during VM migration time, don't sync state will agent status update
> 2286 if (serverState == State.Migrating) {
> 2287 s_logger.debug("Skipping vm in migrating state: " + vm);
> 2288 return null;
> 2289 }
> 2290
> 2291 if (trackExternalChange) {
> 2292 if (serverState == State.Starting) {
> 2293 if (vm.getHostId() != null && vm.getHostId() != hostId) {
> 2294 s_logger.info("CloudStack is starting VM on host " + vm.getHostId() + ", but status report comes from a different host " + hostId +
> 2295 ", skip status sync for vm: " + vm.getInstanceName());
> 2296 return null;
> 2297 }
> 2298 }
> 2299
> 2300 if (serverState == State.Running) {
> 2301 try {
> 2302 //
> 2303 // we had a bug that sometimes VM may be at Running State
> 2304 // but host_id is null, we will cover it here.
> 2305 // means that when CloudStack DB lost of host information,
> 2306 // we will heal it with the info reported from host
> 2307 //
> 2308 if (vm.getHostId() == null || hostId != vm.getHostId()) {
> 2309 if (s_logger.isDebugEnabled()) {
> 2310 s_logger.debug("detected host change when VM " + vm + " is at running state, VM could be live-migrated externally from host " + vm.getHostId() +
> 2311 " to host " + hostId);
> 2312 }
> 2313
> 2314 stateTransitTo(vm, VirtualMachine.Event.AgentReportMigrated, hostId);
> 2315 }
> 2316 } catch (NoTransitionException e) {
> 2317 s_logger.warn(e.getMessage());
> 2318 }
> 2319 }
> 2320 }
> 2321
> 2322 if (agentState == serverState) {
> 2323 if (s_logger.isDebugEnabled()) {
> 2324 s_logger.debug("Both states are " + agentState + " for " + vm);
> 2325 }
> 2326 assert (agentState == State.Stopped || agentState == State.Running) : "If the states we send up is changed, this must be changed.";
> 2327 if (agentState == State.Running) {
> 2328 try {
> 2329 stateTransitTo(vm, VirtualMachine.Event.AgentReportRunning, hostId);
> 2330 } catch (NoTransitionException e) {
> 2331 s_logger.warn(e.getMessage());
> 2332 }
> 2333 // FIXME: What if someone comes in and sets it to stopping? Then
> 2334 // what?
> 2335 return null;
> 2336 }
> 2337
> 2338 s_logger.debug("State matches but the agent said stopped so let's send a cleanup command anyways.");
> 2339 return cleanup(vm);
> 2340 }
> 2341
> 2342 if (agentState == State.Shutdowned) {
> 2343 if (serverState == State.Running || serverState == State.Starting || serverState == State.Stopping) {
> 2344 try {
> 2345 advanceStop(vm, true);
> 2346 } catch (AgentUnavailableException e) {
> 2347 assert (false) : "How do we hit this with forced on?";
> 2348 return null;
> 2349 } catch (OperationTimedoutException e) {
> 2350 assert (false) : "How do we hit this with forced on?";
> 2351 return null;
> 2352 } catch (ConcurrentOperationException e) {
> 2353 assert (false) : "How do we hit this with forced on?";
> 2354 return null;
> 2355 }
> 2356 } else {
> 2357 s_logger.debug("Sending cleanup to a shutdowned vm: " + vm.getInstanceName());
> 2358 command = cleanup(vm);
> 2359 }
> 2360 } else if (agentState == State.Stopped) {
> 2361 // This state means the VM on the agent was detected previously
> 2362 // and now is gone. This is slightly different than if the VM
> 2363 // was never completed but we still send down a Stop Command
> 2364 // to ensure there's cleanup.
> 2365 if (serverState == State.Running) {
> 2366 // Our records showed that it should be running so let's restart
> 2367 // it.
> 2368 _haMgr.scheduleRestart(vm, false);
> 2369 } else if (serverState == State.Stopping) {
> 2370 _haMgr.scheduleStop(vm, hostId, WorkType.ForceStop);
> 2371 s_logger.debug("Scheduling a check stop for VM in stopping mode: " + vm);
> 2372 } else if (serverState == State.Starting) {
> 2373 s_logger.debug("Ignoring VM in starting mode: " + vm.getInstanceName());
> 2374 _haMgr.scheduleRestart(vm, false);
> 2375 }
> 2376 command = cleanup(vm);
> I'd be happy to test the fix if such is comes out.
> Thanks
> ilya
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)