You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@cloudstack.apache.org by "Brenn Oosterbaan (JIRA)" <ji...@apache.org> on 2014/09/10 10:10:28 UTC

[jira] [Comment Edited] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128206#comment-14128206 ] 

Brenn Oosterbaan edited comment on CLOUDSTACK-7184 at 9/10/14 8:09 AM:
-----------------------------------------------------------------------

"I've seen similar with KVM - I'm not sure this is necessarily tied to Xen? I'd suggest that possibly CS be a little more thorough before deciding a VM is down...maybe via channels other than the agent/VR?"

John is right on the money here. Although the patch comitted by Daan does give the possibility to specify a check interval for the Xen storage heartbeat script (instead of using the default of 5 seconds) it is not the root cause of this issue.

There are two mechanisms at work here. The xen heartbeat script which checks if the storage is reachable on a specific hypervisors, and Cloudstack which determines if a hypervisor is up or not.

When we set the Xen heartbeat interval to 180 seconds we basicly said: it's ok for vm's living on a hypervisor to 'hang' for 180 seconds in case of storage fail-overs or other issues.
Cloudstack has its own checking mechanisms to determine if a hypervisor is down or not. Those checks are not in line with the xen heartbeat interval. Which means that even though we decided 180 seconds of unavailability is fine, Cloudstack tries to connect to the hypervisor 3 times (in ~30 seconds) and then decides it is down and starts the vm's on another hypervisor. 
That is the issue/bug Remi meant to identify when filing this ticket.

I personally feel there should be two additional options: hypervisor.heartbeat.interval and hypervisor.heartbeat.max_retry.
This would allow us to decide to (for instance) set the interval to 15 seconds and the max_retry to 12. Which would then add up to 180 seconds as well. 
Since the default heartbeat timeout is 60 seconds I would set the defaults for these to a combination which allows for 60 seconds as well. Otherwise you will never be sure the hypervisor it self has actually rebooted and thus VM corruption could still take place.

regards,

Brenn


was (Author: boosterbaan@schubergphilis.com):
"I've seen similar with KVM - I'm not sure this is necessarily tied to Xen? I'd suggest that possibly CS be a little more thorough before deciding a VM is down...maybe via channels other than the agent/VR?"

John is right on the money here. Although the patch comitted by Daan does give the possibility to specify a check interval for the Xen storage heartbeat script (instead of using the default of 5 seconds) it is not the root cause of this issue.

There are two mechanisms at work here. The xen heartbeat script which checks if the storage is reachable on a specific hypervisors, and Cloudstack which determines if a hypervisor is up or not.

When we set the Xen heartbeat interval to 180 seconds we basicly said: it's ok for vm's living on a hypervisor to 'hang' for 180 seconds in case of storage fail-overs or other issues.
Cloudstack has it's own checking mechanisms to determine if a hypervisor is down or not. Those checks are not in line with the xen heartbeat interval. Which means that even though we decided 180 seconds of unavailability is fine, Cloudstack tries to connect to the hypervisor 3 times (in ~30 seconds) and then decides it is down and starts the vm's on another hypervisor. 
That is the issue/bug Remi meant to identify when filing this ticket.

I personally feel there should be two additional global options: hypervisor.heartbeat.interval and hypervisor.heartbeat.max_retry.
This would allow us to decide to (for instance) set the interval to 15 seconds and the max_retry to 12. Which would then add up to 180 seconds as well. 
Since the default heartbeat timeout is 60 seconds I would set the defaults for these to a combination which allows for 60 seconds as well. Otherwise you will never be sure the hypervisor it self has actually rebooted and thus VM corruption could still take place.

regards,

Brenn

> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-7184
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: Hypervisor Controller, Management Server, XenServer
>    Affects Versions: 4.3.0, 4.4.0, 4.5.0
>         Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>            Reporter: Remi Bergsma
>            Assignee: Daan Hoogland
>            Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did discover this and marked the host as down, and immediately started HA. Just 18 seconds later the hypervisor returned and we ended up with 5 vm's that were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. One side of the story is why XenServer allowed this to happen (will not bother you with this one). The CloudStack side of the story: HA should only start after at least xen.heartbeat.interval seconds. If the host is down long enough, the Xen heartbeat script will fence the hypervisor and prevent corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .....
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on the VMs
> .....
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up:     2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)