You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@cloudstack.apache.org by Nik Martin <ni...@nfinausa.com> on 2013/01/16 16:12:32 UTC

reconnecting to host in alert state - cloud cocked up

Ok, this is a new thread centered on a serious problem in my 3.02 CS 
cloud, running Xenserver 6.02 hosts.  Here is what has transpired so far:
1: user reports console proxy not available
2: confirm console proxy not available, issue reboot via cloudstack UI
3: CS reports VM booted ok, still unavailable
4: tried to migrate to different host, VM stuck in migrating state
5: Log in to host, list_domains command does not show VM , but shows a 
domain in this state:
117 | deadbeef-dead-beef-dead-beef00000075 | DS
which is a pretty bad sign that the VM is hung pretty badly.
6: attempt to destroy domain according to Citrix Support article:
/opt/xensource/debug/destroy_domain -domid 117
7: command hangs
8: I then restart xe api toolstack, it appears to restart fine. I should 
note that ALL vms are on this host via the "first_fit" vm provisioning 
algorithm
9: I attempt to start migrating VMs to two other available hosts in 
preparation for a hard reboot of host
10: migrating VMs fails, and host is now in alert state in CS, and CS 
log states that host is unavailable. Force reconnect fails.

So, here I am, in a production environment with a scenario that the 
whole premise of cloud based computing is specifically designed to 
address, and it is the root cause of the issue it is intended to prevent.

Do I have any other options to prevent down time? I have exhausted 
everything I know to do.   have already scheduled a maintenance window, 
and fudged the truth to my customers stating that there should be no 
downtime during this window, which I have 0 faith will actually be true.


-- 

Regards,

Nik

Nik Martin
nfina Technologies, Inc.
+1.251.243.0043 x1003
http://nfinausa.com
Relentless Reliability

Re: reconnecting to host in alert state - cloud cocked up

Posted by Dave Dunaway <da...@gmail.com>.

Had a somewhat similar case last week where my vmware esxi hypervisor and
vcenter got disconnected. Cloudstack refused to work with the host for
controlling running VM's on the hosts (it could talk to vcenter just fine,
but any communication to the esxi host would result in a timeout to that
host). The Host in cloudstack went into 'Alert' state and we could not do
much of anything with it.

What we did was to cheat a bit and set the host as OK in the cloud.hosts DB
table. Then we could do things like maintenance mode on the host. At that
point Cloudstack started to shutdown machines on my esxi host (It was the
only one in the cluster, so we sort of expected that behavior) so
Cloudstack could obviously talk to vcenter and the esxi host and interact
with VM's to do this... so why the timeouts before?

While your case is different in a few ways, what I would like to bring
forward is that when hosts do 'disconnect' in Cloudstack, Cloudstack itself
does not seem to handle the recovery of the host graciously.

It's always a struggle to recover the host, and in production environments
(this happened to a test environment in my case) totally unacceptable to
not have Cloudstack recover the host rather then sit there doing nothing
with VM's in limbo.

I would suggest that these Cloudstack to Hypervisor failure states be
further tested and made more resilient.

On Wed, Jan 16, 2013 at 10:12 AM, Nik Martin <ni...@nfinausa.com>wrote:

> Ok, this is a new thread centered on a serious problem in my 3.02 CS
> cloud, running Xenserver 6.02 hosts.  Here is what has transpired so far:
> 1: user reports console proxy not available
> 2: confirm console proxy not available, issue reboot via cloudstack UI
> 3: CS reports VM booted ok, still unavailable
> 4: tried to migrate to different host, VM stuck in migrating state
> 5: Log in to host, list_domains command does not show VM , but shows a
> domain in this state:
> 117 | deadbeef-dead-beef-dead-**beef00000075 | DS
> which is a pretty bad sign that the VM is hung pretty badly.
> 6: attempt to destroy domain according to Citrix Support article:
> /opt/xensource/debug/destroy_**domain -domid 117
> 7: command hangs
> 8: I then restart xe api toolstack, it appears to restart fine. I should
> note that ALL vms are on this host via the "first_fit" vm provisioning
> algorithm
> 9: I attempt to start migrating VMs to two other available hosts in
> preparation for a hard reboot of host
> 10: migrating VMs fails, and host is now in alert state in CS, and CS log
> states that host is unavailable. Force reconnect fails.
>
> So, here I am, in a production environment with a scenario that the whole
> premise of cloud based computing is specifically designed to address, and
> it is the root cause of the issue it is intended to prevent.
>
> Do I have any other options to prevent down time? I have exhausted
> everything I know to do.   have already scheduled a maintenance window, and
> fudged the truth to my customers stating that there should be no downtime
> during this window, which I have 0 faith will actually be true.
>
>
> --
>
> Regards,
>
> Nik
>
> Nik Martin
> nfina Technologies, Inc.
> +1.251.243.0043 x1003
> http://nfinausa.com
> Relentless Reliability
>