You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@cloudstack.apache.org by "Koushik Das (JIRA)" <ji...@apache.org> on 2014/06/12 14:29:02 UTC

[jira] [Commented] (CLOUDSTACK-6857) Losing the connection from CloudStack Manager to the agent will force a shutdown when connection is re-established

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029099#comment-14029099 ] 

Koushik Das commented on CLOUDSTACK-6857:
-----------------------------------------

Can you share the full logs? Based on the log snippet none of the available investigators were able to determine if VM is alive. In such a case something called 'fencers' tries to fence off the VM. If fencers fail nothing is done to the VM. Full logs will help understand what all happened.

> Losing the connection from CloudStack Manager to the agent will force a shutdown when connection is re-established
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-6857
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6857
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: Management Server
>    Affects Versions: 4.3.0
>         Environment: Ubuntu 12.04
>            Reporter: c-hemp
>            Priority: Critical
>
> If a physical host is not pingable that host goes into alert mode. If the physical hosts is unreachable, the virtual router is either unreachable or unable to ping a virtual on the physical host, and the manager is unable to ping the virtual instance it assumes the virtual is down and puts it into a stop state.  
> When the connection is restablished, it gets the state from the database, sees that it is now in a stopped state, and will then shutdown the instance.
> This behavior can cause major outages if there is any type of network loss once the connectivity comes back.  This is especially critical when using CloudStack across multiple colos.
> The logs when it happens:
> 14-06-06 02:01:22,259 INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) PingInvestigator found VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,259 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] (HA-Worker-1:ctx-be848615 work-1953) Not a System Vm, unable to determine state of VM[User|cephvmstage013] returning null
> 2014-06-06 02:01:22,259 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] (HA-Worker-1:ctx-be848615 work-1953) Testing if VM[User|cephvmstage013] is alive
> 2014-06-06 02:01:22,260 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] (HA-Worker-1:ctx-be848615 work-1953) Unable to find a management nic, cannot ping this system VM, unable to determine state of VM[User|cephvmstage013] returning null
> 2014-06-06 02:01:22,260 INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) ManagementIPSysVMInvestigator found VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,263 INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) KVMInvestigator found VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,263 INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) HypervInvestigator found VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,419 INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) KVMInvestigator found VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,419 INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) HypervInvestigator found VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,584 WARN  [c.c.v.VirtualMachineManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) Unable to actually stop VM[User|cephvmstage013] but continue with release because it's a force stop
> 2014-06-06 02:01:22,585 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) VM[User|cephvmstage013] is stopped on the host.  Proceeding to release resource held.
> 2014-06-06 02:01:22,648 WARN  [c.c.v.VirtualMachineManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) Unable to actually stop VM[User|cephvmstage013] but continue with release because it's a force stop
> 2014-06-06 02:01:22,650 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) VM[User|cephvmstage013] is stopped on the host.  Proceeding to release resource held.
> 2014-06-06 02:01:22,704 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) Successfully released network resources for the vm VM[User|cephvmstage013]
> 2014-06-06 02:01:22,704 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) Successfully released storage resources for the vm VM[User|cephvmstage013]
> 2014-06-06 02:01:22,774 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) Successfully released network resources for the vm VM[User|cephvmstage013]
> 2014-06-06 02:01:22,774 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) Successfully released storage resources for the vm VM[User|cephvmstage013]
> The behavior should change to be set into an alert state, then once connectivity is re-established, if the instance is up, update the manager with the running status



--
This message was sent by Atlassian JIRA
(v6.2#6252)