You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@cloudstack.apache.org by "Koushik Das (JIRA)" <ji...@apache.org> on 2013/07/18 14:30:48 UTC

[jira] [Commented] (CLOUDSTACK-3421) When hypervisor is down, no HA occurs with log output "Agent state cannot be determined, do nothing"

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13712273#comment-13712273 ] 

Koushik Das commented on CLOUDSTACK-3421:
-----------------------------------------

The current behavior is by design. MS cannot determine the state conclusively as it is not able to communicate with the agent and so does nothing. HA will only get triggered when the state of a host can be determined as down.
Basically there is no easy way to distinguish if the host is actually down or it is simply disconnected from network. The problem is that if HA is done on the VMs and after that the original VMs come up then there will be disk corruption.

                
> When hypervisor is down, no HA occurs with log output "Agent state cannot be determined, do nothing"
> ----------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-3421
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3421
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: KVM, Management Server
>    Affects Versions: 4.1.0
>         Environment: CentOS 6.4 minimal install
> Libvirt, KVM/Qemu
> CloudStack 4.1
> GlusterFS 3.2, replicated+distributed as primary storage via Shared Mount Point
> 3 physical servers
> * 1 management server, running NFS secondary storage
> ** 1 nic, management+storage
> * 2 hypervisor nodes, running glusterfs-server 
> ** 4x nic, management+storage, public, guest, gluster peering
> * Advanced zone
> * KVM
> * 4 networks: 
>  eth0: cloudbr0: management+secondary storage, 
>  eth2: cloudbr1: public
>  eth3: cloudbr2: guest
>  eth1: gluster peering
> * Shared Mount Point
> * custom network offering with redundant routers enabled
> * global settings tweaked to increase speed of identifying down state
> ** ping.interval: 10sec
>            Reporter: Gerard Lynch
>            Priority: Critical
>             Fix For: 4.1.1, 4.2.0, Future
>
>         Attachments: catalina_management-server.zip
>
>
> We wanted to test CloudStack's HA capabilities by simulating outages to find out how long it would take to recover.  One of the tests was simulating loss of a hypervisor node by shutting it down.   When we tested this, we found that CloudStack failed to bring up any of the VMs (System or Instance), which were on the down node, until the node was powered back up and reconnected.
> In the logs, we see repeating occurances of:
> INFO  [utils.exception.CSExceptionErrorCode] (AgentTaskPool-11:) Could not find exception: com.cloud.exception.OperationTimedoutException in error code list for exceptions
> INFO  [utils.exception.CSExceptionErrorCode] (AgentTaskPool-10:) Could not find exception: com.cloud.exception.OperationTimedoutException in error code list for exceptions
> WARN  [agent.manager.AgentAttache] (AgentTaskPool-11:) Seq 14-660013135: Timed out on Seq 14-660013135:  { Cmd , MgmtId: 93515041483, via: 14, Ver: v1, Flags: 100011, [{"CheckHealthCommand":{"wait":50}}] }
> WARN  [agent.manager.AgentAttache] (AgentTaskPool-10:) Seq 15-1097531400: Timed out on Seq 15-1097531400:  { Cmd , MgmtId: 93515041483, via: 15, Ver: v1, Flags: 100011, [{"CheckHealthCommand":{"wait":50}}] }
> WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-11:) Operation timed out: Commands 660013135 to Host 14 timed out after 100
> WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-10:) Operation timed out: Commands 1097531400 to Host 15 timed out after 100
> WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-11:) Agent state cannot be determined, do nothing
> WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-10:) Agent state cannot be determined, do nothing
> To reproduce: 
> 1. Build the environment as detailed above
> 2. Register an ISO
> 3. Create a new guest network using the custom network offering (that offers redundant routers)
> 3. Provision an instance
> 4. Ensure the system VMs and instance are on the first hypervisor node
> 5. Shutdown the first hypervisor node (or pull the plug)
> Expected result:
>   All system VMs and instance(s) should be brought up on the 2nd hypervisor node.
> Actual result:
>   We see the first hypervisor node marked "disconnected."
>   All System VMs and the Instance are still marked "Running", however ping to any of them fails. 
>   Ping to the redundant router on the 2nd hypervisor node is still working.
>   We see in the logs 
>   "INFO  [utils.exception.CSExceptionErrorCode] (AgentTaskPool-11:) Could not find exception: com.cloud.exception.OperationTimedoutException in error code list for exceptions"
>   Followed by
>   "Agent state cannot be determined, do nothing"
> Searching for "Cloudstack Agent state cannot be determined, do nothing" lead to: CLOUDSTACK-803 - https://reviews.apache.org/r/8853/
> Which caused me some concern, because if I read the logic in the ticket correctly... The management server will not perform any HA actions if it's unable to determine the state of a hypervisor node.  In the scenario above, it's not a loss of connectivity, but an actual outage on the hypervisor... so I'd rather like HA to occur.  Split brain is a concern, but I think that something along the lines of "if hypervisor can't see management or gateway, stop instances)" is more relevant than "do nothing"
> I'm hoping this is something really obvious and simple to resolve, because otherwise this is a pretty serious issue as currently any accidental shutdown, or hardware fault will cause a continuous outage requiring manual action to resolve.
> Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira