You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@cloudstack.apache.org by Indra Pramana <in...@sg.or.id> on 2013/07/24 17:24:20 UTC

HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Dear all,

I tried to shutdown one of my hypervisor hosts to simulate a server
failure, and the HA is not working, all the VMs on the affected host is not
started on another available host.

I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD for primary
storage.

My issue is similar to what is being described here:

https://issues.apache.org/jira/browse/CLOUDSTACK-3535

Except that on my case, the host is indeed marked as "Disconnected" but
there is no attempt from CloudStack to try starting the VMs on another
host. I can't provide logs since there's nothing on the logs which suggest
that CloudStack tries to activate the HA and start the affected VMs on
another host.

Anyone has similar experience? Anyone knows if the above bug has been
resolved?

Looking forward to your reply, thank you.

Cheers.

Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Posted by Chip Childers <ch...@sungard.com>.

Nothing, looks good.

(And thanks for opening it Paul)


On Wed, Jul 24, 2013 at 10:27 PM, Bryan Whitehead <dr...@megahappy.net>wrote:

> CLOUDSTACK-3535 bug looks like it is describing the problem perfectly.
> What else can we add?
>
> On Wed, Jul 24, 2013 at 7:20 PM, Chip Childers
> <ch...@sungard.com> wrote:
> > This sucks.
> >
> > Can one of the folks on this thread please open a bug with as much
> > information as possible?  I'd like to make sure that someone picks up the
> > issue and gets it resolved for the next release.
> >
> >
> >
> > On Wed, Jul 24, 2013 at 7:26 PM, Bryan Whitehead <driver@megahappy.net
> >wrote:
> >
> >> This same thing happened to me - but it was a Power-Supply that died
> >> on a box. All my templates have HA turned on.
> >>
> >> All the VM's (including 1 system-router-vm) were shown as "Running"
> >> and the host itself was simply marked "Disconnected". When I tried to
> >> shutdown the VM's to start them again I got errors about not being
> >> able to communicate with the agent. I tried restarting the management
> >> server but that didn't change anything.
> >>
> >> Getting the router working again was extremely annoying. After
> >> changing it to Stopped it kept trying to start it again on the dead
> >> host. I marked it destroyed then restarted the network with the force
> >> option. That fixed it. After I hacked the DB to get all my VM's not
> >> running with state Running to Stopped, then I was able to start all
> >> the VM's that were down on the bad host.
> >>
> >> Anyway, The time between host death and me finding out was about 4
> >> days - as these were on managed servers of a customer and their
> >> monitoring of each host wasn't working. They were pretty unhappy. :(
> >>
> >> Other notes: this is KVM with sharedmountpoint on a gluster mount.
> >> After host got back online gluster rsynced about 200GB of data - I
> >> migrated VM's to the host at the same time as normal. I've had a
> >> similar things happen with 3.0.2 install of cloudstack and everything
> >> seamlessly restarted. Disappointing this happened with 4.1
> >>
> >> On Wed, Jul 24, 2013 at 9:23 AM, Indra Pramana <in...@sg.or.id> wrote:
> >> > Dear Chip, Geoff and all,
> >> >
> >> > I scrutinized the management server's logs during the time when I
> >> shutdown
> >> > the host and the time when I turned the host back on.
> >> >
> >> > This is the management server's logs when the host is being shut down:
> >> >
> >> > http://pastebin.com/4wfV830Z
> >> >
> >> > During the time, I noted that there are quite a lot of "Sending
> >> Disconnect
> >> > to listener" messages, which implies that the management server try to
> >> > notify other listeners that the host is going down. However,
> >> subsequently I
> >> > didn't see any messages on the logs showing that the management
> server is
> >> > trying to activate the HA capability to start the affected VMs on
> another
> >> > available host.
> >> >
> >> > This is the management server's logs when the host is being turned
> back
> >> on:
> >> >
> >> > http://pastebin.com/JrLJxbXH
> >> >
> >> > When the agent is reconnected, then CloudStack marked the affected
> VMs as
> >> > stopped from previously running:
> >> >
> >> > ===
> >> > 2013-07-24 23:04:57,406 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> >> > (AgentConnectTaskPool-7:null) Found 5 VMs for host 34
> >> > 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> >> > (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
> >> > realState = Stopped
> >> > 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> >> > (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
> >> > realState = Stopped
> >> > 2013-07-24 23:04:57,408 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> >> > (AgentConnectTaskPool-7:null) VM does not require investigation so I'm
> >> > marking it as Stopped: VM[User|Ubuntu-12-04-2-64bit]
> >> > 2013-07-24 23:04:57,450 DEBUG [cloud.capacity.CapacityManagerImpl]
> >> > (AgentConnectTaskPool-7:null) VM state transitted from :Running to
> >> Stopping
> >> > with event: StopRequestedvm's original host id: 28 new host id: 34
> host
> >> id
> >> > before state transition: 34
> >> > ===
> >> >
> >> > Then the HA starts to kick in.
> >> >
> >> > ===
> >> > 2013-07-24 23:04:57,955 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> >> > (HA-Worker-1:work-307) Processing HAWork[307-HA-273-Stopped-Scheduled]
> >> > 2013-07-24 23:04:57,956 DEBUG [cloud.capacity.CapacityManagerImpl]
> >> > (AgentConnectTaskPool-7:null) VM state transitted from :Running to
> >> Stopping
> >> > with event: StopRequestedvm's original host id: 28 new host id: 34
> host
> >> id
> >> > before state transition: 34
> >> > 2013-07-24 23:04:57,960 DEBUG [agent.transport.Request]
> >> > (AgentConnectTaskPool-7:null) Seq 34-105644038: Sending  { Cmd ,
> MgmtId:
> >> > 161342671900, via: 34, Ver: v1, Flags: 100111,
> >> > [{"StopCommand":{"isProxy":false,"vmName":"i-2-281-VM","wait":0}}] }
> >> > 2013-07-24 23:04:57,968 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> >> > (HA-Worker-1:work-307) HA on VM[User|Ubuntu-12-04-2-64bit]
> >> > 2013-07-24 23:04:57,984 DEBUG [cloud.capacity.CapacityManagerImpl]
> >> > (HA-Worker-1:work-307) VM state transitted from :Stopped to Starting
> with
> >> > event: StartRequestedvm's original host id: 28 new host id: null host
> id
> >> > before state transition: null
> >> > 2013-07-24 23:04:57,984 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> >> > (HA-Worker-1:work-307) Successfully transitioned to start state for
> >> > VM[User|Ubuntu-12-04-2-64bit] reservation id =
> >> > b56364ef-90d8-443f-a348-7660fda48d34
> >> > 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> >> > (HA-Worker-1:work-307) Trying to deploy VM, vm has dcId: 6 and podId:
> 6
> >> > 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> >> > (HA-Worker-1:work-307) Deploy avoids pods: null, clusters: null,
> hosts:
> >> null
> >> > 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> >> > (HA-Worker-1:work-307) Root volume is ready, need to place VM in
> volume's
> >> > cluster
> >> > 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> >> > (HA-Worker-1:work-307) Vol[295|vm=273|ROOT] is READY, changing
> deployment
> >> > plan to use this pool's dcId: 6 , podId: 6 , and clusterId: 6
> >> > ===
> >> >
> >> > My question is why HA only kicks in when the host is turned back on?
> By
> >> > right it should kick in soon after the host is shut down and marked as
> >> > "Disconnected".
> >> >
> >> > Any insights on the possible solutions to this problem is highly
> >> > appreciated.
> >> >
> >> > Looking forward to your reply, thank you.
> >> >
> >> > Cheers.
> >> >
> >> >
> >> >
> >> > On Thu, Jul 25, 2013 at 12:00 AM, Indra Pramana <in...@sg.or.id>
> wrote:
> >> >
> >> >> Hi Chip,
> >> >>
> >> >> Yes, "Offer HA" is set to "Yes" on all my compute offerings.
> >> >>
> >> >> Hi Geoff,
> >> >>
> >> >> Yes, I am using KVM. Is this a known issue and is there any solution
> to
> >> >> this problem?
> >> >>
> >> >> Looking forward to your reply, thank you.
> >> >>
> >> >> Cheers.
> >> >>
> >> >>
> >> >>
> >> >> On Wed, Jul 24, 2013 at 11:38 PM, Geoff Higginbottom <
> >> >> geoff.higginbottom@shapeblue.com> wrote:
> >> >>
> >> >>> Is it running on KVM, we are seeing some real issue with HA simply
> not
> >> >>> working on KVM.
> >> >>>
> >> >>> Regards
> >> >>>
> >> >>> Geoff Higginbottom
> >> >>>
> >> >>> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581
> >> >>>
> >> >>> geoff.higginbottom@shapeblue.com
> >> >>>
> >> >>> -----Original Message-----
> >> >>> From: Chip Childers [mailto:chip.childers@sungard.com]
> >> >>> Sent: 24 July 2013 16:37
> >> >>> To: <us...@cloudstack.apache.org>
> >> >>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor
> hosts
> >> >>>
> >> >>> Did you enable HA for your compute offering?
> >> >>>
> >> >>> On Jul 24, 2013, at 11:25 AM, Indra Pramana <in...@sg.or.id> wrote:
> >> >>>
> >> >>> > Dear all,
> >> >>> >
> >> >>> > I tried to shutdown one of my hypervisor hosts to simulate a
> server
> >> >>> > failure, and the HA is not working, all the VMs on the affected
> host
> >> >>> > is not started on another available host.
> >> >>> >
> >> >>> > I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD for
> >> >>> > primary storage.
> >> >>> >
> >> >>> > My issue is similar to what is being described here:
> >> >>> >
> >> >>> > https://issues.apache.org/jira/browse/CLOUDSTACK-3535
> >> >>> >
> >> >>> > Except that on my case, the host is indeed marked as
> "Disconnected"
> >> >>> > but there is no attempt from CloudStack to try starting the VMs on
> >> >>> > another host. I can't provide logs since there's nothing on the
> logs
> >> >>> > which suggest that CloudStack tries to activate the HA and start
> the
> >> >>> > affected VMs on another host.
> >> >>> >
> >> >>> > Anyone has similar experience? Anyone knows if the above bug has
> been
> >> >>> > resolved?
> >> >>> >
> >> >>> > Looking forward to your reply, thank you.
> >> >>> >
> >> >>> > Cheers.
> >> >>> This email and any attachments to it may be confidential and are
> >> intended
> >> >>> solely for the use of the individual to whom it is addressed. Any
> >> views or
> >> >>> opinions expressed are solely those of the author and do not
> >> necessarily
> >> >>> represent those of Shape Blue Ltd or related companies. If you are
> not
> >> the
> >> >>> intended recipient of this email, you must neither take any action
> >> based
> >> >>> upon its contents, nor copy or show it to anyone. Please contact the
> >> sender
> >> >>> if you believe you have received this email in error. Shape Blue Ltd
> >> is a
> >> >>> company incorporated in England & Wales. ShapeBlue Services India
> LLP
> >> is
> >> >>> operated under license from Shape Blue Ltd. ShapeBlue is a
> registered
> >> >>> trademark.
> >> >>>
> >> >>
> >> >>
> >>
> >>
>
>

Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Posted by Salvatore Sciacco <sc...@iperweb.com>.

there are workaround / database update to declare a host died so that HA
operations can be triggered?




2013/7/25 Lennert den Teuling <le...@pcextreme.nl>

> Op 25-07-13 07:48, Bryan Whitehead schreef:
>
>  Starting off, there is never going to be a way to "conclusively"
>> decide if a host is down. This is just the nature of complex systems.
>> We can only hope our software does "well" - and if "well" is "wrong" -
>> we have a way to clean up the mess created.
>>
>> That said, I like the old behavior 3.0.x has. As I mentioned in -3535
>> I've had a host lose its network (e1000 oops in kernel) and HA got
>> triggered. The storage (in this case gluster using a sharedmountpount)
>> wouldn't let qemu-kvm start on another host because the underlying
>> qcow2 file was locked by an already running qemu-kvm process (on the
>> machine that lost network). So HA being triggered didn't ruin any VM
>> disks. Gluster was running on Infiniband so the shared storage with
>> working locks prevented HA from screwing things up.
>>
>> Further, even if gluster lost connectivity, gluster itself would
>> split-brain and later I could decide which qcow2/disk image should be
>> "truth". Do I keep the VM that kept on running? Or do I keep the
>> version HA booted and fscked? That's for me - the user - to decide.
>>
>> As a cloudstack admin/user I understand the risks of HA and I choose
>> to live with them - I've even made sure that should such a disaster
>> happen I can recover (gluster will split brain as well). The #1 reason
>> for choosing HA is I want the VM to be available as much as possible.
>>
>> Right now 4.1 DOES NOT have HA... I don't know how "emailing the admin
>> to figure out what to do" is being entertained as an option. That's
>> just nonsense and is NOT HIGH AVAILABILITY. IMHO If one is so
>> terrified of HA screwing up they should probably pass on HA and
>> manually start things up.
>>
>> When a simple reproducible test like pulling the plug on a host can't
>> trigger an HA event - then that feature doesn't exist. It is simple as
>> that.
>>
>
> I would like to add that when testing this on our development cluster,
> something bizar happened:
>
> First, when i killed the VMs _and_ the agent on the host the HA worked
> just fine, after 10 minutes everything was restarted on a working host.
>
> The second time i turned of the host, nothing happened:
>
> 2013-07-25 15:31:41,347 DEBUG [cloud.ha.**AbstractInvestigatorImpl]
> (AgentTaskPool-3:null) host (192.168.122.32) cannot be pinged, returning
> null ('I don't know')
> 2013-07-25 15:31:41,348 DEBUG [cloud.ha.**UserVmDomRInvestigator]
> (AgentTaskPool-3:null) could not reach agent, could not reach agent's host,
> returning that we don't have enough information
> 2013-07-25 15:31:41,348 DEBUG [cloud.ha.**HighAvailabilityManagerImpl]
> (AgentTaskPool-3:null) null unable to determine the state of the host.
> Moving on.
> 2013-07-25 15:31:41,348 DEBUG [cloud.ha.**HighAvailabilityManagerImpl]
> (AgentTaskPool-3:null) null unable to determine the state of the host.
> Moving on.
> 2013-07-25 15:31:41,349 WARN  [agent.manager.**AgentManagerImpl]
> (AgentTaskPool-3:null) Agent state cannot be determined, do nothing
>
> So when the host is still pingable it's "OK" to do a HA, but when it is
> totally unreachable it's not?
>
> My third try was even worse. I killed the agent, forgot to kill the VMs
> and the management server restarted the VMs on another host and it seems
> that all images are corrupted.
>
> 2013-07-25 15:37:31,614 DEBUG [agent.manager.**AgentManagerImpl]
> (HA-Worker-2:work-29) Details from executing class com.cloud.agent.api.**PingTestCommand:
> PING 192.168.122.170 (192.168.122.170): 56 data bytes6
> 4 bytes from 192.168.122.161: Destination Host UnreachableVr HL TOS  Len
>   ID Flg  off TTL Pro  cks      Src      Dst Data 4  5  00 5400 0000 0 0040
>  40  01 0cc4 192.168.122.161  192.168.122.170 --- 192.
> 168.122.170 ping statistics ---1 packets transmitted, 0 packets received,
> 100% packet lossUnable to ping the vm, exiting
> 2013-07-25 15:37:31,614 DEBUG [cloud.ha.**UserVmDomRInvestigator]
> (HA-Worker-2:work-29) VM[User|c88924e9-a8c9-4705-**acc8-3237ffcf009d]
> could not be pinged, returning that it is unknown
>
> Ping is disabled by default if you use security groups, so a ping test is
> not reliable.
>
> Concluding that a VM is down on a simple ping test, is when you use
> security groups for example not the right option. (It's even dangerous)
>
> I will do some more tests, but if it's true that my last HA was based on a
> failed ping i will need to turn ping on on all my production instances asap.
>
> I do agree with Bryan that HA needs to go automatically without
> intervention of a sysadmin.
>
> I think you could base a HA operation on:
> - An unreachable agent
> - Unpingable host
> - A file with a timestamp on the network storage which updates every X
> seconds, when it's not updated, something is wrong.
>
> Ideally the management server would turn of the host using IPMI to make
> sure it's dead, then you are sure no corruption will happen.
>
>
>  On Wed, Jul 24, 2013 at 9:31 PM, Koushik Das <ko...@citrix.com>
>> wrote:
>>
>>> There is another bug for the same. CLOUDSTACK-3421
>>> This document nicely explains how HA works in Cloudstack
>>> https://cwiki.apache.org/**confluence/display/CLOUDSTACK/**
>>> High+Availability+Developer's+**Guide<https://cwiki.apache.org/confluence/display/CLOUDSTACK/High+Availability+Developer's+Guide>
>>> .
>>>
>>> As can be seen from the logs in this case, Cloudstack is not able to
>>> conclusively determine if the host is 'down' and so does nothing. Suppose
>>> HA was done for the VMs in this case and later on the host came back up.
>>> This will corrupt the VM disks which is not desirable.
>>>
>>> Possible options:
>>> - If host state cannot be determined conclusively for some configurable
>>> time then the host may be put into some special state and then admin can
>>> take appropriate action by manually triggering HA
>>> - If KVM cluster has the concept of something like a 'master' from which
>>> the state of any host in the cluster can be determined. Something similar
>>> is there for XS.
>>>
>>> Thoughts?
>>>
>>>
>>>  -----Original Message-----
>>>> From: Bryan Whitehead [mailto:driver@megahappy.net]
>>>> Sent: Thursday, July 25, 2013 7:58 AM
>>>> To: users@cloudstack.apache.org
>>>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts
>>>>
>>>> CLOUDSTACK-3535 bug looks like it is describing the problem perfectly.
>>>> What else can we add?
>>>>
>>>> On Wed, Jul 24, 2013 at 7:20 PM, Chip Childers <
>>>> chip.childers@sungard.com>
>>>> wrote:
>>>>
>>>>> This sucks.
>>>>>
>>>>> Can one of the folks on this thread please open a bug with as much
>>>>> information as possible?  I'd like to make sure that someone picks up
>>>>> the issue and gets it resolved for the next release.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 24, 2013 at 7:26 PM, Bryan Whitehead
>>>>>
>>>> <dr...@megahappy.net>wrote:
>>>>
>>>>>
>>>>>  This same thing happened to me - but it was a Power-Supply that died
>>>>>> on a box. All my templates have HA turned on.
>>>>>>
>>>>>> All the VM's (including 1 system-router-vm) were shown as "Running"
>>>>>> and the host itself was simply marked "Disconnected". When I tried to
>>>>>> shutdown the VM's to start them again I got errors about not being
>>>>>> able to communicate with the agent. I tried restarting the management
>>>>>> server but that didn't change anything.
>>>>>>
>>>>>> Getting the router working again was extremely annoying. After
>>>>>> changing it to Stopped it kept trying to start it again on the dead
>>>>>> host. I marked it destroyed then restarted the network with the force
>>>>>> option. That fixed it. After I hacked the DB to get all my VM's not
>>>>>> running with state Running to Stopped, then I was able to start all
>>>>>> the VM's that were down on the bad host.
>>>>>>
>>>>>> Anyway, The time between host death and me finding out was about 4
>>>>>> days - as these were on managed servers of a customer and their
>>>>>> monitoring of each host wasn't working. They were pretty unhappy. :(
>>>>>>
>>>>>> Other notes: this is KVM with sharedmountpoint on a gluster mount.
>>>>>> After host got back online gluster rsynced about 200GB of data - I
>>>>>> migrated VM's to the host at the same time as normal. I've had a
>>>>>> similar things happen with 3.0.2 install of cloudstack and everything
>>>>>> seamlessly restarted. Disappointing this happened with 4.1
>>>>>>
>>>>>> On Wed, Jul 24, 2013 at 9:23 AM, Indra Pramana <in...@sg.or.id>
>>>>>> wrote:
>>>>>>
>>>>>>> Dear Chip, Geoff and all,
>>>>>>>
>>>>>>> I scrutinized the management server's logs during the time when I
>>>>>>>
>>>>>> shutdown
>>>>>>
>>>>>>> the host and the time when I turned the host back on.
>>>>>>>
>>>>>>> This is the management server's logs when the host is being shut
>>>>>>> down:
>>>>>>>
>>>>>>> http://pastebin.com/4wfV830Z
>>>>>>>
>>>>>>> During the time, I noted that there are quite a lot of "Sending
>>>>>>>
>>>>>> Disconnect
>>>>>>
>>>>>>> to listener" messages, which implies that the management server try
>>>>>>> to notify other listeners that the host is going down. However,
>>>>>>>
>>>>>> subsequently I
>>>>>>
>>>>>>> didn't see any messages on the logs showing that the management
>>>>>>> server is trying to activate the HA capability to start the
>>>>>>> affected VMs on another available host.
>>>>>>>
>>>>>>> This is the management server's logs when the host is being turned
>>>>>>> back
>>>>>>>
>>>>>> on:
>>>>>>
>>>>>>>
>>>>>>> http://pastebin.com/JrLJxbXH
>>>>>>>
>>>>>>> When the agent is reconnected, then CloudStack marked the affected
>>>>>>> VMs as stopped from previously running:
>>>>>>>
>>>>>>> ===
>>>>>>> 2013-07-24 23:04:57,406 DEBUG [cloud.vm.**VirtualMachineManagerImpl]
>>>>>>> (AgentConnectTaskPool-7:null) Found 5 VMs for host 34
>>>>>>> 2013-07-24 23:04:57,408 DEBUG [cloud.vm.**VirtualMachineManagerImpl]
>>>>>>> (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
>>>>>>> realState = Stopped
>>>>>>> 2013-07-24 23:04:57,408 DEBUG [cloud.vm.**VirtualMachineManagerImpl]
>>>>>>> (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
>>>>>>> realState = Stopped
>>>>>>> 2013-07-24 23:04:57,408 DEBUG
>>>>>>> [cloud.ha.**HighAvailabilityManagerImpl]
>>>>>>> (AgentConnectTaskPool-7:null) VM does not require investigation so
>>>>>>> I'm marking it as Stopped: VM[User|Ubuntu-12-04-2-64bit]
>>>>>>> 2013-07-24 23:04:57,450 DEBUG [cloud.capacity.**CapacityManagerImpl]
>>>>>>> (AgentConnectTaskPool-7:null) VM state transitted from :Running to
>>>>>>>
>>>>>> Stopping
>>>>>>
>>>>>>> with event: StopRequestedvm's original host id: 28 new host id: 34
>>>>>>> host
>>>>>>>
>>>>>> id
>>>>>>
>>>>>>> before state transition: 34
>>>>>>> ===
>>>>>>>
>>>>>>> Then the HA starts to kick in.
>>>>>>>
>>>>>>> ===
>>>>>>> 2013-07-24 23:04:57,955 INFO
>>>>>>> [cloud.ha.**HighAvailabilityManagerImpl]
>>>>>>> (HA-Worker-1:work-307) Processing
>>>>>>> HAWork[307-HA-273-Stopped-**Scheduled]
>>>>>>> 2013-07-24 23:04:57,956 DEBUG [cloud.capacity.**CapacityManagerImpl]
>>>>>>> (AgentConnectTaskPool-7:null) VM state transitted from :Running to
>>>>>>>
>>>>>> Stopping
>>>>>>
>>>>>>> with event: StopRequestedvm's original host id: 28 new host id: 34
>>>>>>> host
>>>>>>>
>>>>>> id
>>>>>>
>>>>>>> before state transition: 34
>>>>>>> 2013-07-24 23:04:57,960 DEBUG [agent.transport.Request]
>>>>>>> (AgentConnectTaskPool-7:null) Seq 34-105644038: Sending  { Cmd ,
>>>>>>>
>>>>>> MgmtId:
>>>>
>>>>> 161342671900, via: 34, Ver: v1, Flags: 100111,
>>>>>>> [{"StopCommand":{"isProxy":**false,"vmName":"i-2-281-VM","**
>>>>>>> wait":0}}]
>>>>>>> }
>>>>>>> 2013-07-24 23:04:57,968 INFO
>>>>>>> [cloud.ha.**HighAvailabilityManagerImpl]
>>>>>>> (HA-Worker-1:work-307) HA on VM[User|Ubuntu-12-04-2-64bit]
>>>>>>> 2013-07-24 23:04:57,984 DEBUG [cloud.capacity.**CapacityManagerImpl]
>>>>>>> (HA-Worker-1:work-307) VM state transitted from :Stopped to
>>>>>>> Starting with
>>>>>>> event: StartRequestedvm's original host id: 28 new host id: null
>>>>>>> host id before state transition: null
>>>>>>> 2013-07-24 23:04:57,984 DEBUG [cloud.vm.**VirtualMachineManagerImpl]
>>>>>>> (HA-Worker-1:work-307) Successfully transitioned to start state for
>>>>>>> VM[User|Ubuntu-12-04-2-64bit] reservation id =
>>>>>>> b56364ef-90d8-443f-a348-**7660fda48d34
>>>>>>> 2013-07-24 23:04:58,025 DEBUG [cloud.vm.**VirtualMachineManagerImpl]
>>>>>>> (HA-Worker-1:work-307) Trying to deploy VM, vm has dcId: 6 and
>>>>>>> podId: 6
>>>>>>> 2013-07-24 23:04:58,025 DEBUG [cloud.vm.**VirtualMachineManagerImpl]
>>>>>>> (HA-Worker-1:work-307) Deploy avoids pods: null, clusters: null,
>>>>>>> hosts:
>>>>>>>
>>>>>> null
>>>>>>
>>>>>>> 2013-07-24 23:04:58,031 DEBUG [cloud.vm.**VirtualMachineManagerImpl]
>>>>>>> (HA-Worker-1:work-307) Root volume is ready, need to place VM in
>>>>>>> volume's cluster
>>>>>>> 2013-07-24 23:04:58,031 DEBUG [cloud.vm.**VirtualMachineManagerImpl]
>>>>>>> (HA-Worker-1:work-307) Vol[295|vm=273|ROOT] is READY, changing
>>>>>>> deployment plan to use this pool's dcId: 6 , podId: 6 , and
>>>>>>> clusterId: 6 ===
>>>>>>>
>>>>>>> My question is why HA only kicks in when the host is turned back
>>>>>>> on? By right it should kick in soon after the host is shut down and
>>>>>>> marked as "Disconnected".
>>>>>>>
>>>>>>> Any insights on the possible solutions to this problem is highly
>>>>>>> appreciated.
>>>>>>>
>>>>>>> Looking forward to your reply, thank you.
>>>>>>>
>>>>>>> Cheers.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 25, 2013 at 12:00 AM, Indra Pramana <in...@sg.or.id>
>>>>>>>
>>>>>> wrote:
>>>>
>>>>>
>>>>>>>  Hi Chip,
>>>>>>>>
>>>>>>>> Yes, "Offer HA" is set to "Yes" on all my compute offerings.
>>>>>>>>
>>>>>>>> Hi Geoff,
>>>>>>>>
>>>>>>>> Yes, I am using KVM. Is this a known issue and is there any
>>>>>>>> solution to this problem?
>>>>>>>>
>>>>>>>> Looking forward to your reply, thank you.
>>>>>>>>
>>>>>>>> Cheers.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jul 24, 2013 at 11:38 PM, Geoff Higginbottom <
>>>>>>>> geoff.higginbottom@shapeblue.**com<ge...@shapeblue.com>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  Is it running on KVM, we are seeing some real issue with HA
>>>>>>>>> simply not working on KVM.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Geoff Higginbottom
>>>>>>>>>
>>>>>>>>> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581
>>>>>>>>>
>>>>>>>>> geoff.higginbottom@shapeblue.**com<ge...@shapeblue.com>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Chip Childers [mailto:chip.childers@sungard.**com<ch...@sungard.com>
>>>>>>>>> ]
>>>>>>>>> Sent: 24 July 2013 16:37
>>>>>>>>> To: <us...@cloudstack.apache.org>
>>>>>>>>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor
>>>>>>>>> hosts
>>>>>>>>>
>>>>>>>>> Did you enable HA for your compute offering?
>>>>>>>>>
>>>>>>>>> On Jul 24, 2013, at 11:25 AM, Indra Pramana <in...@sg.or.id>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>  Dear all,
>>>>>>>>>>
>>>>>>>>>> I tried to shutdown one of my hypervisor hosts to simulate a
>>>>>>>>>> server failure, and the HA is not working, all the VMs on the
>>>>>>>>>> affected host is not started on another available host.
>>>>>>>>>>
>>>>>>>>>> I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD
>>>>>>>>>> for primary storage.
>>>>>>>>>>
>>>>>>>>>> My issue is similar to what is being described here:
>>>>>>>>>>
>>>>>>>>>> https://issues.apache.org/**jira/browse/CLOUDSTACK-3535<https://issues.apache.org/jira/browse/CLOUDSTACK-3535>
>>>>>>>>>>
>>>>>>>>>> Except that on my case, the host is indeed marked as
>>>>>>>>>>
>>>>>>>>> "Disconnected"
>>>>
>>>>> but there is no attempt from CloudStack to try starting the VMs
>>>>>>>>>> on another host. I can't provide logs since there's nothing on
>>>>>>>>>> the logs which suggest that CloudStack tries to activate the HA
>>>>>>>>>> and start the affected VMs on another host.
>>>>>>>>>>
>>>>>>>>>> Anyone has similar experience? Anyone knows if the above bug
>>>>>>>>>> has been resolved?
>>>>>>>>>>
>>>>>>>>>> Looking forward to your reply, thank you.
>>>>>>>>>>
>>>>>>>>>> Cheers.
>>>>>>>>>>
>>>>>>>>> This email and any attachments to it may be confidential and are
>>>>>>>>>
>>>>>>>> intended
>>>>>>
>>>>>>> solely for the use of the individual to whom it is addressed. Any
>>>>>>>>>
>>>>>>>> views or
>>>>>>
>>>>>>> opinions expressed are solely those of the author and do not
>>>>>>>>>
>>>>>>>> necessarily
>>>>>>
>>>>>>> represent those of Shape Blue Ltd or related companies. If you
>>>>>>>>> are not
>>>>>>>>>
>>>>>>>> the
>>>>>>
>>>>>>> intended recipient of this email, you must neither take any
>>>>>>>>> action
>>>>>>>>>
>>>>>>>> based
>>>>>>
>>>>>>> upon its contents, nor copy or show it to anyone. Please contact
>>>>>>>>> the
>>>>>>>>>
>>>>>>>> sender
>>>>>>
>>>>>>> if you believe you have received this email in error. Shape Blue
>>>>>>>>> Ltd
>>>>>>>>>
>>>>>>>> is a
>>>>>>
>>>>>>> company incorporated in England & Wales. ShapeBlue Services India
>>>>>>>>> LLP
>>>>>>>>>
>>>>>>>> is
>>>>>>
>>>>>>> operated under license from Shape Blue Ltd. ShapeBlue is a
>>>>>>>>> registered trademark.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>

Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Posted by Lennert den Teuling <le...@pcextreme.nl>.

Op 25-07-13 07:48, Bryan Whitehead schreef:
> Starting off, there is never going to be a way to "conclusively"
> decide if a host is down. This is just the nature of complex systems.
> We can only hope our software does "well" - and if "well" is "wrong" -
> we have a way to clean up the mess created.
>
> That said, I like the old behavior 3.0.x has. As I mentioned in -3535
> I've had a host lose its network (e1000 oops in kernel) and HA got
> triggered. The storage (in this case gluster using a sharedmountpount)
> wouldn't let qemu-kvm start on another host because the underlying
> qcow2 file was locked by an already running qemu-kvm process (on the
> machine that lost network). So HA being triggered didn't ruin any VM
> disks. Gluster was running on Infiniband so the shared storage with
> working locks prevented HA from screwing things up.
>
> Further, even if gluster lost connectivity, gluster itself would
> split-brain and later I could decide which qcow2/disk image should be
> "truth". Do I keep the VM that kept on running? Or do I keep the
> version HA booted and fscked? That's for me - the user - to decide.
>
> As a cloudstack admin/user I understand the risks of HA and I choose
> to live with them - I've even made sure that should such a disaster
> happen I can recover (gluster will split brain as well). The #1 reason
> for choosing HA is I want the VM to be available as much as possible.
>
> Right now 4.1 DOES NOT have HA... I don't know how "emailing the admin
> to figure out what to do" is being entertained as an option. That's
> just nonsense and is NOT HIGH AVAILABILITY. IMHO If one is so
> terrified of HA screwing up they should probably pass on HA and
> manually start things up.
>
> When a simple reproducible test like pulling the plug on a host can't
> trigger an HA event - then that feature doesn't exist. It is simple as
> that.

I would like to add that when testing this on our development cluster, 
something bizar happened:

First, when i killed the VMs _and_ the agent on the host the HA worked 
just fine, after 10 minutes everything was restarted on a working host.

The second time i turned of the host, nothing happened:

2013-07-25 15:31:41,347 DEBUG [cloud.ha.AbstractInvestigatorImpl] 
(AgentTaskPool-3:null) host (192.168.122.32) cannot be pinged, returning 
null ('I don't know')
2013-07-25 15:31:41,348 DEBUG [cloud.ha.UserVmDomRInvestigator] 
(AgentTaskPool-3:null) could not reach agent, could not reach agent's 
host, returning that we don't have enough information
2013-07-25 15:31:41,348 DEBUG [cloud.ha.HighAvailabilityManagerImpl] 
(AgentTaskPool-3:null) null unable to determine the state of the host. 
Moving on.
2013-07-25 15:31:41,348 DEBUG [cloud.ha.HighAvailabilityManagerImpl] 
(AgentTaskPool-3:null) null unable to determine the state of the host. 
Moving on.
2013-07-25 15:31:41,349 WARN  [agent.manager.AgentManagerImpl] 
(AgentTaskPool-3:null) Agent state cannot be determined, do nothing

So when the host is still pingable it's "OK" to do a HA, but when it is 
totally unreachable it's not?

My third try was even worse. I killed the agent, forgot to kill the VMs 
and the management server restarted the VMs on another host and it seems 
that all images are corrupted.

2013-07-25 15:37:31,614 DEBUG [agent.manager.AgentManagerImpl] 
(HA-Worker-2:work-29) Details from executing class 
com.cloud.agent.api.PingTestCommand: PING 192.168.122.170 
(192.168.122.170): 56 data bytes6
4 bytes from 192.168.122.161: Destination Host UnreachableVr HL TOS  Len 
   ID Flg  off TTL Pro  cks      Src      Dst Data 4  5  00 5400 0000 
0 0040  40  01 0cc4 192.168.122.161  192.168.122.170 --- 192.
168.122.170 ping statistics ---1 packets transmitted, 0 packets 
received, 100% packet lossUnable to ping the vm, exiting
2013-07-25 15:37:31,614 DEBUG [cloud.ha.UserVmDomRInvestigator] 
(HA-Worker-2:work-29) VM[User|c88924e9-a8c9-4705-acc8-3237ffcf009d] 
could not be pinged, returning that it is unknown

Ping is disabled by default if you use security groups, so a ping test 
is not reliable.

Concluding that a VM is down on a simple ping test, is when you use 
security groups for example not the right option. (It's even dangerous)

I will do some more tests, but if it's true that my last HA was based on 
a failed ping i will need to turn ping on on all my production instances 
asap.

I do agree with Bryan that HA needs to go automatically without 
intervention of a sysadmin.

I think you could base a HA operation on:
- An unreachable agent
- Unpingable host
- A file with a timestamp on the network storage which updates every X 
seconds, when it's not updated, something is wrong.

Ideally the management server would turn of the host using IPMI to make 
sure it's dead, then you are sure no corruption will happen.

> On Wed, Jul 24, 2013 at 9:31 PM, Koushik Das <ko...@citrix.com> wrote:
>> There is another bug for the same. CLOUDSTACK-3421
>> This document nicely explains how HA works in Cloudstack https://cwiki.apache.org/confluence/display/CLOUDSTACK/High+Availability+Developer's+Guide.
>>
>> As can be seen from the logs in this case, Cloudstack is not able to conclusively determine if the host is 'down' and so does nothing. Suppose HA was done for the VMs in this case and later on the host came back up. This will corrupt the VM disks which is not desirable.
>>
>> Possible options:
>> - If host state cannot be determined conclusively for some configurable time then the host may be put into some special state and then admin can take appropriate action by manually triggering HA
>> - If KVM cluster has the concept of something like a 'master' from which the state of any host in the cluster can be determined. Something similar is there for XS.
>>
>> Thoughts?
>>
>>
>>> -----Original Message-----
>>> From: Bryan Whitehead [mailto:driver@megahappy.net]
>>> Sent: Thursday, July 25, 2013 7:58 AM
>>> To: users@cloudstack.apache.org
>>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts
>>>
>>> CLOUDSTACK-3535 bug looks like it is describing the problem perfectly.
>>> What else can we add?
>>>
>>> On Wed, Jul 24, 2013 at 7:20 PM, Chip Childers <ch...@sungard.com>
>>> wrote:
>>>> This sucks.
>>>>
>>>> Can one of the folks on this thread please open a bug with as much
>>>> information as possible?  I'd like to make sure that someone picks up
>>>> the issue and gets it resolved for the next release.
>>>>
>>>>
>>>>
>>>> On Wed, Jul 24, 2013 at 7:26 PM, Bryan Whitehead
>>> <dr...@megahappy.net>wrote:
>>>>
>>>>> This same thing happened to me - but it was a Power-Supply that died
>>>>> on a box. All my templates have HA turned on.
>>>>>
>>>>> All the VM's (including 1 system-router-vm) were shown as "Running"
>>>>> and the host itself was simply marked "Disconnected". When I tried to
>>>>> shutdown the VM's to start them again I got errors about not being
>>>>> able to communicate with the agent. I tried restarting the management
>>>>> server but that didn't change anything.
>>>>>
>>>>> Getting the router working again was extremely annoying. After
>>>>> changing it to Stopped it kept trying to start it again on the dead
>>>>> host. I marked it destroyed then restarted the network with the force
>>>>> option. That fixed it. After I hacked the DB to get all my VM's not
>>>>> running with state Running to Stopped, then I was able to start all
>>>>> the VM's that were down on the bad host.
>>>>>
>>>>> Anyway, The time between host death and me finding out was about 4
>>>>> days - as these were on managed servers of a customer and their
>>>>> monitoring of each host wasn't working. They were pretty unhappy. :(
>>>>>
>>>>> Other notes: this is KVM with sharedmountpoint on a gluster mount.
>>>>> After host got back online gluster rsynced about 200GB of data - I
>>>>> migrated VM's to the host at the same time as normal. I've had a
>>>>> similar things happen with 3.0.2 install of cloudstack and everything
>>>>> seamlessly restarted. Disappointing this happened with 4.1
>>>>>
>>>>> On Wed, Jul 24, 2013 at 9:23 AM, Indra Pramana <in...@sg.or.id> wrote:
>>>>>> Dear Chip, Geoff and all,
>>>>>>
>>>>>> I scrutinized the management server's logs during the time when I
>>>>> shutdown
>>>>>> the host and the time when I turned the host back on.
>>>>>>
>>>>>> This is the management server's logs when the host is being shut down:
>>>>>>
>>>>>> http://pastebin.com/4wfV830Z
>>>>>>
>>>>>> During the time, I noted that there are quite a lot of "Sending
>>>>> Disconnect
>>>>>> to listener" messages, which implies that the management server try
>>>>>> to notify other listeners that the host is going down. However,
>>>>> subsequently I
>>>>>> didn't see any messages on the logs showing that the management
>>>>>> server is trying to activate the HA capability to start the
>>>>>> affected VMs on another available host.
>>>>>>
>>>>>> This is the management server's logs when the host is being turned
>>>>>> back
>>>>> on:
>>>>>>
>>>>>> http://pastebin.com/JrLJxbXH
>>>>>>
>>>>>> When the agent is reconnected, then CloudStack marked the affected
>>>>>> VMs as stopped from previously running:
>>>>>>
>>>>>> ===
>>>>>> 2013-07-24 23:04:57,406 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>>>>>> (AgentConnectTaskPool-7:null) Found 5 VMs for host 34
>>>>>> 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>>>>>> (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
>>>>>> realState = Stopped
>>>>>> 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>>>>>> (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
>>>>>> realState = Stopped
>>>>>> 2013-07-24 23:04:57,408 DEBUG
>>>>>> [cloud.ha.HighAvailabilityManagerImpl]
>>>>>> (AgentConnectTaskPool-7:null) VM does not require investigation so
>>>>>> I'm marking it as Stopped: VM[User|Ubuntu-12-04-2-64bit]
>>>>>> 2013-07-24 23:04:57,450 DEBUG [cloud.capacity.CapacityManagerImpl]
>>>>>> (AgentConnectTaskPool-7:null) VM state transitted from :Running to
>>>>> Stopping
>>>>>> with event: StopRequestedvm's original host id: 28 new host id: 34
>>>>>> host
>>>>> id
>>>>>> before state transition: 34
>>>>>> ===
>>>>>>
>>>>>> Then the HA starts to kick in.
>>>>>>
>>>>>> ===
>>>>>> 2013-07-24 23:04:57,955 INFO
>>>>>> [cloud.ha.HighAvailabilityManagerImpl]
>>>>>> (HA-Worker-1:work-307) Processing
>>>>>> HAWork[307-HA-273-Stopped-Scheduled]
>>>>>> 2013-07-24 23:04:57,956 DEBUG [cloud.capacity.CapacityManagerImpl]
>>>>>> (AgentConnectTaskPool-7:null) VM state transitted from :Running to
>>>>> Stopping
>>>>>> with event: StopRequestedvm's original host id: 28 new host id: 34
>>>>>> host
>>>>> id
>>>>>> before state transition: 34
>>>>>> 2013-07-24 23:04:57,960 DEBUG [agent.transport.Request]
>>>>>> (AgentConnectTaskPool-7:null) Seq 34-105644038: Sending  { Cmd ,
>>> MgmtId:
>>>>>> 161342671900, via: 34, Ver: v1, Flags: 100111,
>>>>>> [{"StopCommand":{"isProxy":false,"vmName":"i-2-281-VM","wait":0}}]
>>>>>> }
>>>>>> 2013-07-24 23:04:57,968 INFO
>>>>>> [cloud.ha.HighAvailabilityManagerImpl]
>>>>>> (HA-Worker-1:work-307) HA on VM[User|Ubuntu-12-04-2-64bit]
>>>>>> 2013-07-24 23:04:57,984 DEBUG [cloud.capacity.CapacityManagerImpl]
>>>>>> (HA-Worker-1:work-307) VM state transitted from :Stopped to
>>>>>> Starting with
>>>>>> event: StartRequestedvm's original host id: 28 new host id: null
>>>>>> host id before state transition: null
>>>>>> 2013-07-24 23:04:57,984 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>>>>>> (HA-Worker-1:work-307) Successfully transitioned to start state for
>>>>>> VM[User|Ubuntu-12-04-2-64bit] reservation id =
>>>>>> b56364ef-90d8-443f-a348-7660fda48d34
>>>>>> 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>>>>>> (HA-Worker-1:work-307) Trying to deploy VM, vm has dcId: 6 and
>>>>>> podId: 6
>>>>>> 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>>>>>> (HA-Worker-1:work-307) Deploy avoids pods: null, clusters: null, hosts:
>>>>> null
>>>>>> 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>>>>>> (HA-Worker-1:work-307) Root volume is ready, need to place VM in
>>>>>> volume's cluster
>>>>>> 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>>>>>> (HA-Worker-1:work-307) Vol[295|vm=273|ROOT] is READY, changing
>>>>>> deployment plan to use this pool's dcId: 6 , podId: 6 , and
>>>>>> clusterId: 6 ===
>>>>>>
>>>>>> My question is why HA only kicks in when the host is turned back
>>>>>> on? By right it should kick in soon after the host is shut down and
>>>>>> marked as "Disconnected".
>>>>>>
>>>>>> Any insights on the possible solutions to this problem is highly
>>>>>> appreciated.
>>>>>>
>>>>>> Looking forward to your reply, thank you.
>>>>>>
>>>>>> Cheers.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 25, 2013 at 12:00 AM, Indra Pramana <in...@sg.or.id>
>>> wrote:
>>>>>>
>>>>>>> Hi Chip,
>>>>>>>
>>>>>>> Yes, "Offer HA" is set to "Yes" on all my compute offerings.
>>>>>>>
>>>>>>> Hi Geoff,
>>>>>>>
>>>>>>> Yes, I am using KVM. Is this a known issue and is there any
>>>>>>> solution to this problem?
>>>>>>>
>>>>>>> Looking forward to your reply, thank you.
>>>>>>>
>>>>>>> Cheers.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jul 24, 2013 at 11:38 PM, Geoff Higginbottom <
>>>>>>> geoff.higginbottom@shapeblue.com> wrote:
>>>>>>>
>>>>>>>> Is it running on KVM, we are seeing some real issue with HA
>>>>>>>> simply not working on KVM.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Geoff Higginbottom
>>>>>>>>
>>>>>>>> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581
>>>>>>>>
>>>>>>>> geoff.higginbottom@shapeblue.com
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Chip Childers [mailto:chip.childers@sungard.com]
>>>>>>>> Sent: 24 July 2013 16:37
>>>>>>>> To: <us...@cloudstack.apache.org>
>>>>>>>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor
>>>>>>>> hosts
>>>>>>>>
>>>>>>>> Did you enable HA for your compute offering?
>>>>>>>>
>>>>>>>> On Jul 24, 2013, at 11:25 AM, Indra Pramana <in...@sg.or.id> wrote:
>>>>>>>>
>>>>>>>>> Dear all,
>>>>>>>>>
>>>>>>>>> I tried to shutdown one of my hypervisor hosts to simulate a
>>>>>>>>> server failure, and the HA is not working, all the VMs on the
>>>>>>>>> affected host is not started on another available host.
>>>>>>>>>
>>>>>>>>> I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD
>>>>>>>>> for primary storage.
>>>>>>>>>
>>>>>>>>> My issue is similar to what is being described here:
>>>>>>>>>
>>>>>>>>> https://issues.apache.org/jira/browse/CLOUDSTACK-3535
>>>>>>>>>
>>>>>>>>> Except that on my case, the host is indeed marked as
>>> "Disconnected"
>>>>>>>>> but there is no attempt from CloudStack to try starting the VMs
>>>>>>>>> on another host. I can't provide logs since there's nothing on
>>>>>>>>> the logs which suggest that CloudStack tries to activate the HA
>>>>>>>>> and start the affected VMs on another host.
>>>>>>>>>
>>>>>>>>> Anyone has similar experience? Anyone knows if the above bug
>>>>>>>>> has been resolved?
>>>>>>>>>
>>>>>>>>> Looking forward to your reply, thank you.
>>>>>>>>>
>>>>>>>>> Cheers.
>>>>>>>> This email and any attachments to it may be confidential and are
>>>>> intended
>>>>>>>> solely for the use of the individual to whom it is addressed. Any
>>>>> views or
>>>>>>>> opinions expressed are solely those of the author and do not
>>>>> necessarily
>>>>>>>> represent those of Shape Blue Ltd or related companies. If you
>>>>>>>> are not
>>>>> the
>>>>>>>> intended recipient of this email, you must neither take any
>>>>>>>> action
>>>>> based
>>>>>>>> upon its contents, nor copy or show it to anyone. Please contact
>>>>>>>> the
>>>>> sender
>>>>>>>> if you believe you have received this email in error. Shape Blue
>>>>>>>> Ltd
>>>>> is a
>>>>>>>> company incorporated in England & Wales. ShapeBlue Services India
>>>>>>>> LLP
>>>>> is
>>>>>>>> operated under license from Shape Blue Ltd. ShapeBlue is a
>>>>>>>> registered trademark.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>

Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Posted by Bryan Whitehead <dr...@megahappy.net>.

Starting off, there is never going to be a way to "conclusively"
decide if a host is down. This is just the nature of complex systems.
We can only hope our software does "well" - and if "well" is "wrong" -
we have a way to clean up the mess created.

That said, I like the old behavior 3.0.x has. As I mentioned in -3535
I've had a host lose its network (e1000 oops in kernel) and HA got
triggered. The storage (in this case gluster using a sharedmountpount)
wouldn't let qemu-kvm start on another host because the underlying
qcow2 file was locked by an already running qemu-kvm process (on the
machine that lost network). So HA being triggered didn't ruin any VM
disks. Gluster was running on Infiniband so the shared storage with
working locks prevented HA from screwing things up.

Further, even if gluster lost connectivity, gluster itself would
split-brain and later I could decide which qcow2/disk image should be
"truth". Do I keep the VM that kept on running? Or do I keep the
version HA booted and fscked? That's for me - the user - to decide.

As a cloudstack admin/user I understand the risks of HA and I choose
to live with them - I've even made sure that should such a disaster
happen I can recover (gluster will split brain as well). The #1 reason
for choosing HA is I want the VM to be available as much as possible.

Right now 4.1 DOES NOT have HA... I don't know how "emailing the admin
to figure out what to do" is being entertained as an option. That's
just nonsense and is NOT HIGH AVAILABILITY. IMHO If one is so
terrified of HA screwing up they should probably pass on HA and
manually start things up.

When a simple reproducible test like pulling the plug on a host can't
trigger an HA event - then that feature doesn't exist. It is simple as
that.

On Wed, Jul 24, 2013 at 9:31 PM, Koushik Das <ko...@citrix.com> wrote:
> There is another bug for the same. CLOUDSTACK-3421
> This document nicely explains how HA works in Cloudstack https://cwiki.apache.org/confluence/display/CLOUDSTACK/High+Availability+Developer's+Guide.
>
> As can be seen from the logs in this case, Cloudstack is not able to conclusively determine if the host is 'down' and so does nothing. Suppose HA was done for the VMs in this case and later on the host came back up. This will corrupt the VM disks which is not desirable.
>
> Possible options:
> - If host state cannot be determined conclusively for some configurable time then the host may be put into some special state and then admin can take appropriate action by manually triggering HA
> - If KVM cluster has the concept of something like a 'master' from which the state of any host in the cluster can be determined. Something similar is there for XS.
>
> Thoughts?
>
>
>> -----Original Message-----
>> From: Bryan Whitehead [mailto:driver@megahappy.net]
>> Sent: Thursday, July 25, 2013 7:58 AM
>> To: users@cloudstack.apache.org
>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts
>>
>> CLOUDSTACK-3535 bug looks like it is describing the problem perfectly.
>> What else can we add?
>>
>> On Wed, Jul 24, 2013 at 7:20 PM, Chip Childers <ch...@sungard.com>
>> wrote:
>> > This sucks.
>> >
>> > Can one of the folks on this thread please open a bug with as much
>> > information as possible?  I'd like to make sure that someone picks up
>> > the issue and gets it resolved for the next release.
>> >
>> >
>> >
>> > On Wed, Jul 24, 2013 at 7:26 PM, Bryan Whitehead
>> <dr...@megahappy.net>wrote:
>> >
>> >> This same thing happened to me - but it was a Power-Supply that died
>> >> on a box. All my templates have HA turned on.
>> >>
>> >> All the VM's (including 1 system-router-vm) were shown as "Running"
>> >> and the host itself was simply marked "Disconnected". When I tried to
>> >> shutdown the VM's to start them again I got errors about not being
>> >> able to communicate with the agent. I tried restarting the management
>> >> server but that didn't change anything.
>> >>
>> >> Getting the router working again was extremely annoying. After
>> >> changing it to Stopped it kept trying to start it again on the dead
>> >> host. I marked it destroyed then restarted the network with the force
>> >> option. That fixed it. After I hacked the DB to get all my VM's not
>> >> running with state Running to Stopped, then I was able to start all
>> >> the VM's that were down on the bad host.
>> >>
>> >> Anyway, The time between host death and me finding out was about 4
>> >> days - as these were on managed servers of a customer and their
>> >> monitoring of each host wasn't working. They were pretty unhappy. :(
>> >>
>> >> Other notes: this is KVM with sharedmountpoint on a gluster mount.
>> >> After host got back online gluster rsynced about 200GB of data - I
>> >> migrated VM's to the host at the same time as normal. I've had a
>> >> similar things happen with 3.0.2 install of cloudstack and everything
>> >> seamlessly restarted. Disappointing this happened with 4.1
>> >>
>> >> On Wed, Jul 24, 2013 at 9:23 AM, Indra Pramana <in...@sg.or.id> wrote:
>> >> > Dear Chip, Geoff and all,
>> >> >
>> >> > I scrutinized the management server's logs during the time when I
>> >> shutdown
>> >> > the host and the time when I turned the host back on.
>> >> >
>> >> > This is the management server's logs when the host is being shut down:
>> >> >
>> >> > http://pastebin.com/4wfV830Z
>> >> >
>> >> > During the time, I noted that there are quite a lot of "Sending
>> >> Disconnect
>> >> > to listener" messages, which implies that the management server try
>> >> > to notify other listeners that the host is going down. However,
>> >> subsequently I
>> >> > didn't see any messages on the logs showing that the management
>> >> > server is trying to activate the HA capability to start the
>> >> > affected VMs on another available host.
>> >> >
>> >> > This is the management server's logs when the host is being turned
>> >> > back
>> >> on:
>> >> >
>> >> > http://pastebin.com/JrLJxbXH
>> >> >
>> >> > When the agent is reconnected, then CloudStack marked the affected
>> >> > VMs as stopped from previously running:
>> >> >
>> >> > ===
>> >> > 2013-07-24 23:04:57,406 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> >> > (AgentConnectTaskPool-7:null) Found 5 VMs for host 34
>> >> > 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> >> > (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
>> >> > realState = Stopped
>> >> > 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> >> > (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
>> >> > realState = Stopped
>> >> > 2013-07-24 23:04:57,408 DEBUG
>> >> > [cloud.ha.HighAvailabilityManagerImpl]
>> >> > (AgentConnectTaskPool-7:null) VM does not require investigation so
>> >> > I'm marking it as Stopped: VM[User|Ubuntu-12-04-2-64bit]
>> >> > 2013-07-24 23:04:57,450 DEBUG [cloud.capacity.CapacityManagerImpl]
>> >> > (AgentConnectTaskPool-7:null) VM state transitted from :Running to
>> >> Stopping
>> >> > with event: StopRequestedvm's original host id: 28 new host id: 34
>> >> > host
>> >> id
>> >> > before state transition: 34
>> >> > ===
>> >> >
>> >> > Then the HA starts to kick in.
>> >> >
>> >> > ===
>> >> > 2013-07-24 23:04:57,955 INFO
>> >> > [cloud.ha.HighAvailabilityManagerImpl]
>> >> > (HA-Worker-1:work-307) Processing
>> >> > HAWork[307-HA-273-Stopped-Scheduled]
>> >> > 2013-07-24 23:04:57,956 DEBUG [cloud.capacity.CapacityManagerImpl]
>> >> > (AgentConnectTaskPool-7:null) VM state transitted from :Running to
>> >> Stopping
>> >> > with event: StopRequestedvm's original host id: 28 new host id: 34
>> >> > host
>> >> id
>> >> > before state transition: 34
>> >> > 2013-07-24 23:04:57,960 DEBUG [agent.transport.Request]
>> >> > (AgentConnectTaskPool-7:null) Seq 34-105644038: Sending  { Cmd ,
>> MgmtId:
>> >> > 161342671900, via: 34, Ver: v1, Flags: 100111,
>> >> > [{"StopCommand":{"isProxy":false,"vmName":"i-2-281-VM","wait":0}}]
>> >> > }
>> >> > 2013-07-24 23:04:57,968 INFO
>> >> > [cloud.ha.HighAvailabilityManagerImpl]
>> >> > (HA-Worker-1:work-307) HA on VM[User|Ubuntu-12-04-2-64bit]
>> >> > 2013-07-24 23:04:57,984 DEBUG [cloud.capacity.CapacityManagerImpl]
>> >> > (HA-Worker-1:work-307) VM state transitted from :Stopped to
>> >> > Starting with
>> >> > event: StartRequestedvm's original host id: 28 new host id: null
>> >> > host id before state transition: null
>> >> > 2013-07-24 23:04:57,984 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> >> > (HA-Worker-1:work-307) Successfully transitioned to start state for
>> >> > VM[User|Ubuntu-12-04-2-64bit] reservation id =
>> >> > b56364ef-90d8-443f-a348-7660fda48d34
>> >> > 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> >> > (HA-Worker-1:work-307) Trying to deploy VM, vm has dcId: 6 and
>> >> > podId: 6
>> >> > 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> >> > (HA-Worker-1:work-307) Deploy avoids pods: null, clusters: null, hosts:
>> >> null
>> >> > 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> >> > (HA-Worker-1:work-307) Root volume is ready, need to place VM in
>> >> > volume's cluster
>> >> > 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> >> > (HA-Worker-1:work-307) Vol[295|vm=273|ROOT] is READY, changing
>> >> > deployment plan to use this pool's dcId: 6 , podId: 6 , and
>> >> > clusterId: 6 ===
>> >> >
>> >> > My question is why HA only kicks in when the host is turned back
>> >> > on? By right it should kick in soon after the host is shut down and
>> >> > marked as "Disconnected".
>> >> >
>> >> > Any insights on the possible solutions to this problem is highly
>> >> > appreciated.
>> >> >
>> >> > Looking forward to your reply, thank you.
>> >> >
>> >> > Cheers.
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Jul 25, 2013 at 12:00 AM, Indra Pramana <in...@sg.or.id>
>> wrote:
>> >> >
>> >> >> Hi Chip,
>> >> >>
>> >> >> Yes, "Offer HA" is set to "Yes" on all my compute offerings.
>> >> >>
>> >> >> Hi Geoff,
>> >> >>
>> >> >> Yes, I am using KVM. Is this a known issue and is there any
>> >> >> solution to this problem?
>> >> >>
>> >> >> Looking forward to your reply, thank you.
>> >> >>
>> >> >> Cheers.
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Wed, Jul 24, 2013 at 11:38 PM, Geoff Higginbottom <
>> >> >> geoff.higginbottom@shapeblue.com> wrote:
>> >> >>
>> >> >>> Is it running on KVM, we are seeing some real issue with HA
>> >> >>> simply not working on KVM.
>> >> >>>
>> >> >>> Regards
>> >> >>>
>> >> >>> Geoff Higginbottom
>> >> >>>
>> >> >>> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581
>> >> >>>
>> >> >>> geoff.higginbottom@shapeblue.com
>> >> >>>
>> >> >>> -----Original Message-----
>> >> >>> From: Chip Childers [mailto:chip.childers@sungard.com]
>> >> >>> Sent: 24 July 2013 16:37
>> >> >>> To: <us...@cloudstack.apache.org>
>> >> >>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor
>> >> >>> hosts
>> >> >>>
>> >> >>> Did you enable HA for your compute offering?
>> >> >>>
>> >> >>> On Jul 24, 2013, at 11:25 AM, Indra Pramana <in...@sg.or.id> wrote:
>> >> >>>
>> >> >>> > Dear all,
>> >> >>> >
>> >> >>> > I tried to shutdown one of my hypervisor hosts to simulate a
>> >> >>> > server failure, and the HA is not working, all the VMs on the
>> >> >>> > affected host is not started on another available host.
>> >> >>> >
>> >> >>> > I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD
>> >> >>> > for primary storage.
>> >> >>> >
>> >> >>> > My issue is similar to what is being described here:
>> >> >>> >
>> >> >>> > https://issues.apache.org/jira/browse/CLOUDSTACK-3535
>> >> >>> >
>> >> >>> > Except that on my case, the host is indeed marked as
>> "Disconnected"
>> >> >>> > but there is no attempt from CloudStack to try starting the VMs
>> >> >>> > on another host. I can't provide logs since there's nothing on
>> >> >>> > the logs which suggest that CloudStack tries to activate the HA
>> >> >>> > and start the affected VMs on another host.
>> >> >>> >
>> >> >>> > Anyone has similar experience? Anyone knows if the above bug
>> >> >>> > has been resolved?
>> >> >>> >
>> >> >>> > Looking forward to your reply, thank you.
>> >> >>> >
>> >> >>> > Cheers.
>> >> >>> This email and any attachments to it may be confidential and are
>> >> intended
>> >> >>> solely for the use of the individual to whom it is addressed. Any
>> >> views or
>> >> >>> opinions expressed are solely those of the author and do not
>> >> necessarily
>> >> >>> represent those of Shape Blue Ltd or related companies. If you
>> >> >>> are not
>> >> the
>> >> >>> intended recipient of this email, you must neither take any
>> >> >>> action
>> >> based
>> >> >>> upon its contents, nor copy or show it to anyone. Please contact
>> >> >>> the
>> >> sender
>> >> >>> if you believe you have received this email in error. Shape Blue
>> >> >>> Ltd
>> >> is a
>> >> >>> company incorporated in England & Wales. ShapeBlue Services India
>> >> >>> LLP
>> >> is
>> >> >>> operated under license from Shape Blue Ltd. ShapeBlue is a
>> >> >>> registered trademark.
>> >> >>>
>> >> >>
>> >> >>
>> >>
>> >>

RE: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Posted by Koushik Das <ko...@citrix.com>.

There is another bug for the same. CLOUDSTACK-3421
This document nicely explains how HA works in Cloudstack https://cwiki.apache.org/confluence/display/CLOUDSTACK/High+Availability+Developer's+Guide.

As can be seen from the logs in this case, Cloudstack is not able to conclusively determine if the host is 'down' and so does nothing. Suppose HA was done for the VMs in this case and later on the host came back up. This will corrupt the VM disks which is not desirable.

Possible options:
- If host state cannot be determined conclusively for some configurable time then the host may be put into some special state and then admin can take appropriate action by manually triggering HA
- If KVM cluster has the concept of something like a 'master' from which the state of any host in the cluster can be determined. Something similar is there for XS.

Thoughts?


> -----Original Message-----
> From: Bryan Whitehead [mailto:driver@megahappy.net]
> Sent: Thursday, July 25, 2013 7:58 AM
> To: users@cloudstack.apache.org
> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts
> 
> CLOUDSTACK-3535 bug looks like it is describing the problem perfectly.
> What else can we add?
> 
> On Wed, Jul 24, 2013 at 7:20 PM, Chip Childers <ch...@sungard.com>
> wrote:
> > This sucks.
> >
> > Can one of the folks on this thread please open a bug with as much
> > information as possible?  I'd like to make sure that someone picks up
> > the issue and gets it resolved for the next release.
> >
> >
> >
> > On Wed, Jul 24, 2013 at 7:26 PM, Bryan Whitehead
> <dr...@megahappy.net>wrote:
> >
> >> This same thing happened to me - but it was a Power-Supply that died
> >> on a box. All my templates have HA turned on.
> >>
> >> All the VM's (including 1 system-router-vm) were shown as "Running"
> >> and the host itself was simply marked "Disconnected". When I tried to
> >> shutdown the VM's to start them again I got errors about not being
> >> able to communicate with the agent. I tried restarting the management
> >> server but that didn't change anything.
> >>
> >> Getting the router working again was extremely annoying. After
> >> changing it to Stopped it kept trying to start it again on the dead
> >> host. I marked it destroyed then restarted the network with the force
> >> option. That fixed it. After I hacked the DB to get all my VM's not
> >> running with state Running to Stopped, then I was able to start all
> >> the VM's that were down on the bad host.
> >>
> >> Anyway, The time between host death and me finding out was about 4
> >> days - as these were on managed servers of a customer and their
> >> monitoring of each host wasn't working. They were pretty unhappy. :(
> >>
> >> Other notes: this is KVM with sharedmountpoint on a gluster mount.
> >> After host got back online gluster rsynced about 200GB of data - I
> >> migrated VM's to the host at the same time as normal. I've had a
> >> similar things happen with 3.0.2 install of cloudstack and everything
> >> seamlessly restarted. Disappointing this happened with 4.1
> >>
> >> On Wed, Jul 24, 2013 at 9:23 AM, Indra Pramana <in...@sg.or.id> wrote:
> >> > Dear Chip, Geoff and all,
> >> >
> >> > I scrutinized the management server's logs during the time when I
> >> shutdown
> >> > the host and the time when I turned the host back on.
> >> >
> >> > This is the management server's logs when the host is being shut down:
> >> >
> >> > http://pastebin.com/4wfV830Z
> >> >
> >> > During the time, I noted that there are quite a lot of "Sending
> >> Disconnect
> >> > to listener" messages, which implies that the management server try
> >> > to notify other listeners that the host is going down. However,
> >> subsequently I
> >> > didn't see any messages on the logs showing that the management
> >> > server is trying to activate the HA capability to start the
> >> > affected VMs on another available host.
> >> >
> >> > This is the management server's logs when the host is being turned
> >> > back
> >> on:
> >> >
> >> > http://pastebin.com/JrLJxbXH
> >> >
> >> > When the agent is reconnected, then CloudStack marked the affected
> >> > VMs as stopped from previously running:
> >> >
> >> > ===
> >> > 2013-07-24 23:04:57,406 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> >> > (AgentConnectTaskPool-7:null) Found 5 VMs for host 34
> >> > 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> >> > (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
> >> > realState = Stopped
> >> > 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> >> > (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
> >> > realState = Stopped
> >> > 2013-07-24 23:04:57,408 DEBUG
> >> > [cloud.ha.HighAvailabilityManagerImpl]
> >> > (AgentConnectTaskPool-7:null) VM does not require investigation so
> >> > I'm marking it as Stopped: VM[User|Ubuntu-12-04-2-64bit]
> >> > 2013-07-24 23:04:57,450 DEBUG [cloud.capacity.CapacityManagerImpl]
> >> > (AgentConnectTaskPool-7:null) VM state transitted from :Running to
> >> Stopping
> >> > with event: StopRequestedvm's original host id: 28 new host id: 34
> >> > host
> >> id
> >> > before state transition: 34
> >> > ===
> >> >
> >> > Then the HA starts to kick in.
> >> >
> >> > ===
> >> > 2013-07-24 23:04:57,955 INFO
> >> > [cloud.ha.HighAvailabilityManagerImpl]
> >> > (HA-Worker-1:work-307) Processing
> >> > HAWork[307-HA-273-Stopped-Scheduled]
> >> > 2013-07-24 23:04:57,956 DEBUG [cloud.capacity.CapacityManagerImpl]
> >> > (AgentConnectTaskPool-7:null) VM state transitted from :Running to
> >> Stopping
> >> > with event: StopRequestedvm's original host id: 28 new host id: 34
> >> > host
> >> id
> >> > before state transition: 34
> >> > 2013-07-24 23:04:57,960 DEBUG [agent.transport.Request]
> >> > (AgentConnectTaskPool-7:null) Seq 34-105644038: Sending  { Cmd ,
> MgmtId:
> >> > 161342671900, via: 34, Ver: v1, Flags: 100111,
> >> > [{"StopCommand":{"isProxy":false,"vmName":"i-2-281-VM","wait":0}}]
> >> > }
> >> > 2013-07-24 23:04:57,968 INFO
> >> > [cloud.ha.HighAvailabilityManagerImpl]
> >> > (HA-Worker-1:work-307) HA on VM[User|Ubuntu-12-04-2-64bit]
> >> > 2013-07-24 23:04:57,984 DEBUG [cloud.capacity.CapacityManagerImpl]
> >> > (HA-Worker-1:work-307) VM state transitted from :Stopped to
> >> > Starting with
> >> > event: StartRequestedvm's original host id: 28 new host id: null
> >> > host id before state transition: null
> >> > 2013-07-24 23:04:57,984 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> >> > (HA-Worker-1:work-307) Successfully transitioned to start state for
> >> > VM[User|Ubuntu-12-04-2-64bit] reservation id =
> >> > b56364ef-90d8-443f-a348-7660fda48d34
> >> > 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> >> > (HA-Worker-1:work-307) Trying to deploy VM, vm has dcId: 6 and
> >> > podId: 6
> >> > 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> >> > (HA-Worker-1:work-307) Deploy avoids pods: null, clusters: null, hosts:
> >> null
> >> > 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> >> > (HA-Worker-1:work-307) Root volume is ready, need to place VM in
> >> > volume's cluster
> >> > 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> >> > (HA-Worker-1:work-307) Vol[295|vm=273|ROOT] is READY, changing
> >> > deployment plan to use this pool's dcId: 6 , podId: 6 , and
> >> > clusterId: 6 ===
> >> >
> >> > My question is why HA only kicks in when the host is turned back
> >> > on? By right it should kick in soon after the host is shut down and
> >> > marked as "Disconnected".
> >> >
> >> > Any insights on the possible solutions to this problem is highly
> >> > appreciated.
> >> >
> >> > Looking forward to your reply, thank you.
> >> >
> >> > Cheers.
> >> >
> >> >
> >> >
> >> > On Thu, Jul 25, 2013 at 12:00 AM, Indra Pramana <in...@sg.or.id>
> wrote:
> >> >
> >> >> Hi Chip,
> >> >>
> >> >> Yes, "Offer HA" is set to "Yes" on all my compute offerings.
> >> >>
> >> >> Hi Geoff,
> >> >>
> >> >> Yes, I am using KVM. Is this a known issue and is there any
> >> >> solution to this problem?
> >> >>
> >> >> Looking forward to your reply, thank you.
> >> >>
> >> >> Cheers.
> >> >>
> >> >>
> >> >>
> >> >> On Wed, Jul 24, 2013 at 11:38 PM, Geoff Higginbottom <
> >> >> geoff.higginbottom@shapeblue.com> wrote:
> >> >>
> >> >>> Is it running on KVM, we are seeing some real issue with HA
> >> >>> simply not working on KVM.
> >> >>>
> >> >>> Regards
> >> >>>
> >> >>> Geoff Higginbottom
> >> >>>
> >> >>> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581
> >> >>>
> >> >>> geoff.higginbottom@shapeblue.com
> >> >>>
> >> >>> -----Original Message-----
> >> >>> From: Chip Childers [mailto:chip.childers@sungard.com]
> >> >>> Sent: 24 July 2013 16:37
> >> >>> To: <us...@cloudstack.apache.org>
> >> >>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor
> >> >>> hosts
> >> >>>
> >> >>> Did you enable HA for your compute offering?
> >> >>>
> >> >>> On Jul 24, 2013, at 11:25 AM, Indra Pramana <in...@sg.or.id> wrote:
> >> >>>
> >> >>> > Dear all,
> >> >>> >
> >> >>> > I tried to shutdown one of my hypervisor hosts to simulate a
> >> >>> > server failure, and the HA is not working, all the VMs on the
> >> >>> > affected host is not started on another available host.
> >> >>> >
> >> >>> > I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD
> >> >>> > for primary storage.
> >> >>> >
> >> >>> > My issue is similar to what is being described here:
> >> >>> >
> >> >>> > https://issues.apache.org/jira/browse/CLOUDSTACK-3535
> >> >>> >
> >> >>> > Except that on my case, the host is indeed marked as
> "Disconnected"
> >> >>> > but there is no attempt from CloudStack to try starting the VMs
> >> >>> > on another host. I can't provide logs since there's nothing on
> >> >>> > the logs which suggest that CloudStack tries to activate the HA
> >> >>> > and start the affected VMs on another host.
> >> >>> >
> >> >>> > Anyone has similar experience? Anyone knows if the above bug
> >> >>> > has been resolved?
> >> >>> >
> >> >>> > Looking forward to your reply, thank you.
> >> >>> >
> >> >>> > Cheers.
> >> >>> This email and any attachments to it may be confidential and are
> >> intended
> >> >>> solely for the use of the individual to whom it is addressed. Any
> >> views or
> >> >>> opinions expressed are solely those of the author and do not
> >> necessarily
> >> >>> represent those of Shape Blue Ltd or related companies. If you
> >> >>> are not
> >> the
> >> >>> intended recipient of this email, you must neither take any
> >> >>> action
> >> based
> >> >>> upon its contents, nor copy or show it to anyone. Please contact
> >> >>> the
> >> sender
> >> >>> if you believe you have received this email in error. Shape Blue
> >> >>> Ltd
> >> is a
> >> >>> company incorporated in England & Wales. ShapeBlue Services India
> >> >>> LLP
> >> is
> >> >>> operated under license from Shape Blue Ltd. ShapeBlue is a
> >> >>> registered trademark.
> >> >>>
> >> >>
> >> >>
> >>
> >>

Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Posted by Bryan Whitehead <dr...@megahappy.net>.

CLOUDSTACK-3535 bug looks like it is describing the problem perfectly.
What else can we add?

On Wed, Jul 24, 2013 at 7:20 PM, Chip Childers
<ch...@sungard.com> wrote:
> This sucks.
>
> Can one of the folks on this thread please open a bug with as much
> information as possible?  I'd like to make sure that someone picks up the
> issue and gets it resolved for the next release.
>
>
>
> On Wed, Jul 24, 2013 at 7:26 PM, Bryan Whitehead <dr...@megahappy.net>wrote:
>
>> This same thing happened to me - but it was a Power-Supply that died
>> on a box. All my templates have HA turned on.
>>
>> All the VM's (including 1 system-router-vm) were shown as "Running"
>> and the host itself was simply marked "Disconnected". When I tried to
>> shutdown the VM's to start them again I got errors about not being
>> able to communicate with the agent. I tried restarting the management
>> server but that didn't change anything.
>>
>> Getting the router working again was extremely annoying. After
>> changing it to Stopped it kept trying to start it again on the dead
>> host. I marked it destroyed then restarted the network with the force
>> option. That fixed it. After I hacked the DB to get all my VM's not
>> running with state Running to Stopped, then I was able to start all
>> the VM's that were down on the bad host.
>>
>> Anyway, The time between host death and me finding out was about 4
>> days - as these were on managed servers of a customer and their
>> monitoring of each host wasn't working. They were pretty unhappy. :(
>>
>> Other notes: this is KVM with sharedmountpoint on a gluster mount.
>> After host got back online gluster rsynced about 200GB of data - I
>> migrated VM's to the host at the same time as normal. I've had a
>> similar things happen with 3.0.2 install of cloudstack and everything
>> seamlessly restarted. Disappointing this happened with 4.1
>>
>> On Wed, Jul 24, 2013 at 9:23 AM, Indra Pramana <in...@sg.or.id> wrote:
>> > Dear Chip, Geoff and all,
>> >
>> > I scrutinized the management server's logs during the time when I
>> shutdown
>> > the host and the time when I turned the host back on.
>> >
>> > This is the management server's logs when the host is being shut down:
>> >
>> > http://pastebin.com/4wfV830Z
>> >
>> > During the time, I noted that there are quite a lot of "Sending
>> Disconnect
>> > to listener" messages, which implies that the management server try to
>> > notify other listeners that the host is going down. However,
>> subsequently I
>> > didn't see any messages on the logs showing that the management server is
>> > trying to activate the HA capability to start the affected VMs on another
>> > available host.
>> >
>> > This is the management server's logs when the host is being turned back
>> on:
>> >
>> > http://pastebin.com/JrLJxbXH
>> >
>> > When the agent is reconnected, then CloudStack marked the affected VMs as
>> > stopped from previously running:
>> >
>> > ===
>> > 2013-07-24 23:04:57,406 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> > (AgentConnectTaskPool-7:null) Found 5 VMs for host 34
>> > 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> > (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
>> > realState = Stopped
>> > 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> > (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
>> > realState = Stopped
>> > 2013-07-24 23:04:57,408 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>> > (AgentConnectTaskPool-7:null) VM does not require investigation so I'm
>> > marking it as Stopped: VM[User|Ubuntu-12-04-2-64bit]
>> > 2013-07-24 23:04:57,450 DEBUG [cloud.capacity.CapacityManagerImpl]
>> > (AgentConnectTaskPool-7:null) VM state transitted from :Running to
>> Stopping
>> > with event: StopRequestedvm's original host id: 28 new host id: 34 host
>> id
>> > before state transition: 34
>> > ===
>> >
>> > Then the HA starts to kick in.
>> >
>> > ===
>> > 2013-07-24 23:04:57,955 INFO  [cloud.ha.HighAvailabilityManagerImpl]
>> > (HA-Worker-1:work-307) Processing HAWork[307-HA-273-Stopped-Scheduled]
>> > 2013-07-24 23:04:57,956 DEBUG [cloud.capacity.CapacityManagerImpl]
>> > (AgentConnectTaskPool-7:null) VM state transitted from :Running to
>> Stopping
>> > with event: StopRequestedvm's original host id: 28 new host id: 34 host
>> id
>> > before state transition: 34
>> > 2013-07-24 23:04:57,960 DEBUG [agent.transport.Request]
>> > (AgentConnectTaskPool-7:null) Seq 34-105644038: Sending  { Cmd , MgmtId:
>> > 161342671900, via: 34, Ver: v1, Flags: 100111,
>> > [{"StopCommand":{"isProxy":false,"vmName":"i-2-281-VM","wait":0}}] }
>> > 2013-07-24 23:04:57,968 INFO  [cloud.ha.HighAvailabilityManagerImpl]
>> > (HA-Worker-1:work-307) HA on VM[User|Ubuntu-12-04-2-64bit]
>> > 2013-07-24 23:04:57,984 DEBUG [cloud.capacity.CapacityManagerImpl]
>> > (HA-Worker-1:work-307) VM state transitted from :Stopped to Starting with
>> > event: StartRequestedvm's original host id: 28 new host id: null host id
>> > before state transition: null
>> > 2013-07-24 23:04:57,984 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> > (HA-Worker-1:work-307) Successfully transitioned to start state for
>> > VM[User|Ubuntu-12-04-2-64bit] reservation id =
>> > b56364ef-90d8-443f-a348-7660fda48d34
>> > 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> > (HA-Worker-1:work-307) Trying to deploy VM, vm has dcId: 6 and podId: 6
>> > 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> > (HA-Worker-1:work-307) Deploy avoids pods: null, clusters: null, hosts:
>> null
>> > 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> > (HA-Worker-1:work-307) Root volume is ready, need to place VM in volume's
>> > cluster
>> > 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> > (HA-Worker-1:work-307) Vol[295|vm=273|ROOT] is READY, changing deployment
>> > plan to use this pool's dcId: 6 , podId: 6 , and clusterId: 6
>> > ===
>> >
>> > My question is why HA only kicks in when the host is turned back on? By
>> > right it should kick in soon after the host is shut down and marked as
>> > "Disconnected".
>> >
>> > Any insights on the possible solutions to this problem is highly
>> > appreciated.
>> >
>> > Looking forward to your reply, thank you.
>> >
>> > Cheers.
>> >
>> >
>> >
>> > On Thu, Jul 25, 2013 at 12:00 AM, Indra Pramana <in...@sg.or.id> wrote:
>> >
>> >> Hi Chip,
>> >>
>> >> Yes, "Offer HA" is set to "Yes" on all my compute offerings.
>> >>
>> >> Hi Geoff,
>> >>
>> >> Yes, I am using KVM. Is this a known issue and is there any solution to
>> >> this problem?
>> >>
>> >> Looking forward to your reply, thank you.
>> >>
>> >> Cheers.
>> >>
>> >>
>> >>
>> >> On Wed, Jul 24, 2013 at 11:38 PM, Geoff Higginbottom <
>> >> geoff.higginbottom@shapeblue.com> wrote:
>> >>
>> >>> Is it running on KVM, we are seeing some real issue with HA simply not
>> >>> working on KVM.
>> >>>
>> >>> Regards
>> >>>
>> >>> Geoff Higginbottom
>> >>>
>> >>> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581
>> >>>
>> >>> geoff.higginbottom@shapeblue.com
>> >>>
>> >>> -----Original Message-----
>> >>> From: Chip Childers [mailto:chip.childers@sungard.com]
>> >>> Sent: 24 July 2013 16:37
>> >>> To: <us...@cloudstack.apache.org>
>> >>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts
>> >>>
>> >>> Did you enable HA for your compute offering?
>> >>>
>> >>> On Jul 24, 2013, at 11:25 AM, Indra Pramana <in...@sg.or.id> wrote:
>> >>>
>> >>> > Dear all,
>> >>> >
>> >>> > I tried to shutdown one of my hypervisor hosts to simulate a server
>> >>> > failure, and the HA is not working, all the VMs on the affected host
>> >>> > is not started on another available host.
>> >>> >
>> >>> > I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD for
>> >>> > primary storage.
>> >>> >
>> >>> > My issue is similar to what is being described here:
>> >>> >
>> >>> > https://issues.apache.org/jira/browse/CLOUDSTACK-3535
>> >>> >
>> >>> > Except that on my case, the host is indeed marked as "Disconnected"
>> >>> > but there is no attempt from CloudStack to try starting the VMs on
>> >>> > another host. I can't provide logs since there's nothing on the logs
>> >>> > which suggest that CloudStack tries to activate the HA and start the
>> >>> > affected VMs on another host.
>> >>> >
>> >>> > Anyone has similar experience? Anyone knows if the above bug has been
>> >>> > resolved?
>> >>> >
>> >>> > Looking forward to your reply, thank you.
>> >>> >
>> >>> > Cheers.
>> >>> This email and any attachments to it may be confidential and are
>> intended
>> >>> solely for the use of the individual to whom it is addressed. Any
>> views or
>> >>> opinions expressed are solely those of the author and do not
>> necessarily
>> >>> represent those of Shape Blue Ltd or related companies. If you are not
>> the
>> >>> intended recipient of this email, you must neither take any action
>> based
>> >>> upon its contents, nor copy or show it to anyone. Please contact the
>> sender
>> >>> if you believe you have received this email in error. Shape Blue Ltd
>> is a
>> >>> company incorporated in England & Wales. ShapeBlue Services India LLP
>> is
>> >>> operated under license from Shape Blue Ltd. ShapeBlue is a registered
>> >>> trademark.
>> >>>
>> >>
>> >>
>>
>>

Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Posted by Chip Childers <ch...@sungard.com>.

This sucks.

Can one of the folks on this thread please open a bug with as much
information as possible?  I'd like to make sure that someone picks up the
issue and gets it resolved for the next release.



On Wed, Jul 24, 2013 at 7:26 PM, Bryan Whitehead <dr...@megahappy.net>wrote:

> This same thing happened to me - but it was a Power-Supply that died
> on a box. All my templates have HA turned on.
>
> All the VM's (including 1 system-router-vm) were shown as "Running"
> and the host itself was simply marked "Disconnected". When I tried to
> shutdown the VM's to start them again I got errors about not being
> able to communicate with the agent. I tried restarting the management
> server but that didn't change anything.
>
> Getting the router working again was extremely annoying. After
> changing it to Stopped it kept trying to start it again on the dead
> host. I marked it destroyed then restarted the network with the force
> option. That fixed it. After I hacked the DB to get all my VM's not
> running with state Running to Stopped, then I was able to start all
> the VM's that were down on the bad host.
>
> Anyway, The time between host death and me finding out was about 4
> days - as these were on managed servers of a customer and their
> monitoring of each host wasn't working. They were pretty unhappy. :(
>
> Other notes: this is KVM with sharedmountpoint on a gluster mount.
> After host got back online gluster rsynced about 200GB of data - I
> migrated VM's to the host at the same time as normal. I've had a
> similar things happen with 3.0.2 install of cloudstack and everything
> seamlessly restarted. Disappointing this happened with 4.1
>
> On Wed, Jul 24, 2013 at 9:23 AM, Indra Pramana <in...@sg.or.id> wrote:
> > Dear Chip, Geoff and all,
> >
> > I scrutinized the management server's logs during the time when I
> shutdown
> > the host and the time when I turned the host back on.
> >
> > This is the management server's logs when the host is being shut down:
> >
> > http://pastebin.com/4wfV830Z
> >
> > During the time, I noted that there are quite a lot of "Sending
> Disconnect
> > to listener" messages, which implies that the management server try to
> > notify other listeners that the host is going down. However,
> subsequently I
> > didn't see any messages on the logs showing that the management server is
> > trying to activate the HA capability to start the affected VMs on another
> > available host.
> >
> > This is the management server's logs when the host is being turned back
> on:
> >
> > http://pastebin.com/JrLJxbXH
> >
> > When the agent is reconnected, then CloudStack marked the affected VMs as
> > stopped from previously running:
> >
> > ===
> > 2013-07-24 23:04:57,406 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> > (AgentConnectTaskPool-7:null) Found 5 VMs for host 34
> > 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> > (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
> > realState = Stopped
> > 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> > (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
> > realState = Stopped
> > 2013-07-24 23:04:57,408 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > (AgentConnectTaskPool-7:null) VM does not require investigation so I'm
> > marking it as Stopped: VM[User|Ubuntu-12-04-2-64bit]
> > 2013-07-24 23:04:57,450 DEBUG [cloud.capacity.CapacityManagerImpl]
> > (AgentConnectTaskPool-7:null) VM state transitted from :Running to
> Stopping
> > with event: StopRequestedvm's original host id: 28 new host id: 34 host
> id
> > before state transition: 34
> > ===
> >
> > Then the HA starts to kick in.
> >
> > ===
> > 2013-07-24 23:04:57,955 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> > (HA-Worker-1:work-307) Processing HAWork[307-HA-273-Stopped-Scheduled]
> > 2013-07-24 23:04:57,956 DEBUG [cloud.capacity.CapacityManagerImpl]
> > (AgentConnectTaskPool-7:null) VM state transitted from :Running to
> Stopping
> > with event: StopRequestedvm's original host id: 28 new host id: 34 host
> id
> > before state transition: 34
> > 2013-07-24 23:04:57,960 DEBUG [agent.transport.Request]
> > (AgentConnectTaskPool-7:null) Seq 34-105644038: Sending  { Cmd , MgmtId:
> > 161342671900, via: 34, Ver: v1, Flags: 100111,
> > [{"StopCommand":{"isProxy":false,"vmName":"i-2-281-VM","wait":0}}] }
> > 2013-07-24 23:04:57,968 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> > (HA-Worker-1:work-307) HA on VM[User|Ubuntu-12-04-2-64bit]
> > 2013-07-24 23:04:57,984 DEBUG [cloud.capacity.CapacityManagerImpl]
> > (HA-Worker-1:work-307) VM state transitted from :Stopped to Starting with
> > event: StartRequestedvm's original host id: 28 new host id: null host id
> > before state transition: null
> > 2013-07-24 23:04:57,984 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> > (HA-Worker-1:work-307) Successfully transitioned to start state for
> > VM[User|Ubuntu-12-04-2-64bit] reservation id =
> > b56364ef-90d8-443f-a348-7660fda48d34
> > 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> > (HA-Worker-1:work-307) Trying to deploy VM, vm has dcId: 6 and podId: 6
> > 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> > (HA-Worker-1:work-307) Deploy avoids pods: null, clusters: null, hosts:
> null
> > 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> > (HA-Worker-1:work-307) Root volume is ready, need to place VM in volume's
> > cluster
> > 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> > (HA-Worker-1:work-307) Vol[295|vm=273|ROOT] is READY, changing deployment
> > plan to use this pool's dcId: 6 , podId: 6 , and clusterId: 6
> > ===
> >
> > My question is why HA only kicks in when the host is turned back on? By
> > right it should kick in soon after the host is shut down and marked as
> > "Disconnected".
> >
> > Any insights on the possible solutions to this problem is highly
> > appreciated.
> >
> > Looking forward to your reply, thank you.
> >
> > Cheers.
> >
> >
> >
> > On Thu, Jul 25, 2013 at 12:00 AM, Indra Pramana <in...@sg.or.id> wrote:
> >
> >> Hi Chip,
> >>
> >> Yes, "Offer HA" is set to "Yes" on all my compute offerings.
> >>
> >> Hi Geoff,
> >>
> >> Yes, I am using KVM. Is this a known issue and is there any solution to
> >> this problem?
> >>
> >> Looking forward to your reply, thank you.
> >>
> >> Cheers.
> >>
> >>
> >>
> >> On Wed, Jul 24, 2013 at 11:38 PM, Geoff Higginbottom <
> >> geoff.higginbottom@shapeblue.com> wrote:
> >>
> >>> Is it running on KVM, we are seeing some real issue with HA simply not
> >>> working on KVM.
> >>>
> >>> Regards
> >>>
> >>> Geoff Higginbottom
> >>>
> >>> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581
> >>>
> >>> geoff.higginbottom@shapeblue.com
> >>>
> >>> -----Original Message-----
> >>> From: Chip Childers [mailto:chip.childers@sungard.com]
> >>> Sent: 24 July 2013 16:37
> >>> To: <us...@cloudstack.apache.org>
> >>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts
> >>>
> >>> Did you enable HA for your compute offering?
> >>>
> >>> On Jul 24, 2013, at 11:25 AM, Indra Pramana <in...@sg.or.id> wrote:
> >>>
> >>> > Dear all,
> >>> >
> >>> > I tried to shutdown one of my hypervisor hosts to simulate a server
> >>> > failure, and the HA is not working, all the VMs on the affected host
> >>> > is not started on another available host.
> >>> >
> >>> > I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD for
> >>> > primary storage.
> >>> >
> >>> > My issue is similar to what is being described here:
> >>> >
> >>> > https://issues.apache.org/jira/browse/CLOUDSTACK-3535
> >>> >
> >>> > Except that on my case, the host is indeed marked as "Disconnected"
> >>> > but there is no attempt from CloudStack to try starting the VMs on
> >>> > another host. I can't provide logs since there's nothing on the logs
> >>> > which suggest that CloudStack tries to activate the HA and start the
> >>> > affected VMs on another host.
> >>> >
> >>> > Anyone has similar experience? Anyone knows if the above bug has been
> >>> > resolved?
> >>> >
> >>> > Looking forward to your reply, thank you.
> >>> >
> >>> > Cheers.
> >>> This email and any attachments to it may be confidential and are
> intended
> >>> solely for the use of the individual to whom it is addressed. Any
> views or
> >>> opinions expressed are solely those of the author and do not
> necessarily
> >>> represent those of Shape Blue Ltd or related companies. If you are not
> the
> >>> intended recipient of this email, you must neither take any action
> based
> >>> upon its contents, nor copy or show it to anyone. Please contact the
> sender
> >>> if you believe you have received this email in error. Shape Blue Ltd
> is a
> >>> company incorporated in England & Wales. ShapeBlue Services India LLP
> is
> >>> operated under license from Shape Blue Ltd. ShapeBlue is a registered
> >>> trademark.
> >>>
> >>
> >>
>
>

Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Posted by Bryan Whitehead <dr...@megahappy.net>.

This same thing happened to me - but it was a Power-Supply that died
on a box. All my templates have HA turned on.

All the VM's (including 1 system-router-vm) were shown as "Running"
and the host itself was simply marked "Disconnected". When I tried to
shutdown the VM's to start them again I got errors about not being
able to communicate with the agent. I tried restarting the management
server but that didn't change anything.

Getting the router working again was extremely annoying. After
changing it to Stopped it kept trying to start it again on the dead
host. I marked it destroyed then restarted the network with the force
option. That fixed it. After I hacked the DB to get all my VM's not
running with state Running to Stopped, then I was able to start all
the VM's that were down on the bad host.

Anyway, The time between host death and me finding out was about 4
days - as these were on managed servers of a customer and their
monitoring of each host wasn't working. They were pretty unhappy. :(

Other notes: this is KVM with sharedmountpoint on a gluster mount.
After host got back online gluster rsynced about 200GB of data - I
migrated VM's to the host at the same time as normal. I've had a
similar things happen with 3.0.2 install of cloudstack and everything
seamlessly restarted. Disappointing this happened with 4.1

On Wed, Jul 24, 2013 at 9:23 AM, Indra Pramana <in...@sg.or.id> wrote:
> Dear Chip, Geoff and all,
>
> I scrutinized the management server's logs during the time when I shutdown
> the host and the time when I turned the host back on.
>
> This is the management server's logs when the host is being shut down:
>
> http://pastebin.com/4wfV830Z
>
> During the time, I noted that there are quite a lot of "Sending Disconnect
> to listener" messages, which implies that the management server try to
> notify other listeners that the host is going down. However, subsequently I
> didn't see any messages on the logs showing that the management server is
> trying to activate the HA capability to start the affected VMs on another
> available host.
>
> This is the management server's logs when the host is being turned back on:
>
> http://pastebin.com/JrLJxbXH
>
> When the agent is reconnected, then CloudStack marked the affected VMs as
> stopped from previously running:
>
> ===
> 2013-07-24 23:04:57,406 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (AgentConnectTaskPool-7:null) Found 5 VMs for host 34
> 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
> realState = Stopped
> 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
> realState = Stopped
> 2013-07-24 23:04:57,408 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentConnectTaskPool-7:null) VM does not require investigation so I'm
> marking it as Stopped: VM[User|Ubuntu-12-04-2-64bit]
> 2013-07-24 23:04:57,450 DEBUG [cloud.capacity.CapacityManagerImpl]
> (AgentConnectTaskPool-7:null) VM state transitted from :Running to Stopping
> with event: StopRequestedvm's original host id: 28 new host id: 34 host id
> before state transition: 34
> ===
>
> Then the HA starts to kick in.
>
> ===
> 2013-07-24 23:04:57,955 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-307) Processing HAWork[307-HA-273-Stopped-Scheduled]
> 2013-07-24 23:04:57,956 DEBUG [cloud.capacity.CapacityManagerImpl]
> (AgentConnectTaskPool-7:null) VM state transitted from :Running to Stopping
> with event: StopRequestedvm's original host id: 28 new host id: 34 host id
> before state transition: 34
> 2013-07-24 23:04:57,960 DEBUG [agent.transport.Request]
> (AgentConnectTaskPool-7:null) Seq 34-105644038: Sending  { Cmd , MgmtId:
> 161342671900, via: 34, Ver: v1, Flags: 100111,
> [{"StopCommand":{"isProxy":false,"vmName":"i-2-281-VM","wait":0}}] }
> 2013-07-24 23:04:57,968 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-307) HA on VM[User|Ubuntu-12-04-2-64bit]
> 2013-07-24 23:04:57,984 DEBUG [cloud.capacity.CapacityManagerImpl]
> (HA-Worker-1:work-307) VM state transitted from :Stopped to Starting with
> event: StartRequestedvm's original host id: 28 new host id: null host id
> before state transition: null
> 2013-07-24 23:04:57,984 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (HA-Worker-1:work-307) Successfully transitioned to start state for
> VM[User|Ubuntu-12-04-2-64bit] reservation id =
> b56364ef-90d8-443f-a348-7660fda48d34
> 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (HA-Worker-1:work-307) Trying to deploy VM, vm has dcId: 6 and podId: 6
> 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (HA-Worker-1:work-307) Deploy avoids pods: null, clusters: null, hosts: null
> 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (HA-Worker-1:work-307) Root volume is ready, need to place VM in volume's
> cluster
> 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (HA-Worker-1:work-307) Vol[295|vm=273|ROOT] is READY, changing deployment
> plan to use this pool's dcId: 6 , podId: 6 , and clusterId: 6
> ===
>
> My question is why HA only kicks in when the host is turned back on? By
> right it should kick in soon after the host is shut down and marked as
> "Disconnected".
>
> Any insights on the possible solutions to this problem is highly
> appreciated.
>
> Looking forward to your reply, thank you.
>
> Cheers.
>
>
>
> On Thu, Jul 25, 2013 at 12:00 AM, Indra Pramana <in...@sg.or.id> wrote:
>
>> Hi Chip,
>>
>> Yes, "Offer HA" is set to "Yes" on all my compute offerings.
>>
>> Hi Geoff,
>>
>> Yes, I am using KVM. Is this a known issue and is there any solution to
>> this problem?
>>
>> Looking forward to your reply, thank you.
>>
>> Cheers.
>>
>>
>>
>> On Wed, Jul 24, 2013 at 11:38 PM, Geoff Higginbottom <
>> geoff.higginbottom@shapeblue.com> wrote:
>>
>>> Is it running on KVM, we are seeing some real issue with HA simply not
>>> working on KVM.
>>>
>>> Regards
>>>
>>> Geoff Higginbottom
>>>
>>> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581
>>>
>>> geoff.higginbottom@shapeblue.com
>>>
>>> -----Original Message-----
>>> From: Chip Childers [mailto:chip.childers@sungard.com]
>>> Sent: 24 July 2013 16:37
>>> To: <us...@cloudstack.apache.org>
>>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts
>>>
>>> Did you enable HA for your compute offering?
>>>
>>> On Jul 24, 2013, at 11:25 AM, Indra Pramana <in...@sg.or.id> wrote:
>>>
>>> > Dear all,
>>> >
>>> > I tried to shutdown one of my hypervisor hosts to simulate a server
>>> > failure, and the HA is not working, all the VMs on the affected host
>>> > is not started on another available host.
>>> >
>>> > I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD for
>>> > primary storage.
>>> >
>>> > My issue is similar to what is being described here:
>>> >
>>> > https://issues.apache.org/jira/browse/CLOUDSTACK-3535
>>> >
>>> > Except that on my case, the host is indeed marked as "Disconnected"
>>> > but there is no attempt from CloudStack to try starting the VMs on
>>> > another host. I can't provide logs since there's nothing on the logs
>>> > which suggest that CloudStack tries to activate the HA and start the
>>> > affected VMs on another host.
>>> >
>>> > Anyone has similar experience? Anyone knows if the above bug has been
>>> > resolved?
>>> >
>>> > Looking forward to your reply, thank you.
>>> >
>>> > Cheers.
>>> This email and any attachments to it may be confidential and are intended
>>> solely for the use of the individual to whom it is addressed. Any views or
>>> opinions expressed are solely those of the author and do not necessarily
>>> represent those of Shape Blue Ltd or related companies. If you are not the
>>> intended recipient of this email, you must neither take any action based
>>> upon its contents, nor copy or show it to anyone. Please contact the sender
>>> if you believe you have received this email in error. Shape Blue Ltd is a
>>> company incorporated in England & Wales. ShapeBlue Services India LLP is
>>> operated under license from Shape Blue Ltd. ShapeBlue is a registered
>>> trademark.
>>>
>>
>>

Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Posted by Indra Pramana <in...@sg.or.id>.

Dear Chip, Geoff and all,

I scrutinized the management server's logs during the time when I shutdown
the host and the time when I turned the host back on.

This is the management server's logs when the host is being shut down:

http://pastebin.com/4wfV830Z

During the time, I noted that there are quite a lot of "Sending Disconnect
to listener" messages, which implies that the management server try to
notify other listeners that the host is going down. However, subsequently I
didn't see any messages on the logs showing that the management server is
trying to activate the HA capability to start the affected VMs on another
available host.

This is the management server's logs when the host is being turned back on:

http://pastebin.com/JrLJxbXH

When the agent is reconnected, then CloudStack marked the affected VMs as
stopped from previously running:

===
2013-07-24 23:04:57,406 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(AgentConnectTaskPool-7:null) Found 5 VMs for host 34
2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
realState = Stopped
2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
realState = Stopped
2013-07-24 23:04:57,408 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
(AgentConnectTaskPool-7:null) VM does not require investigation so I'm
marking it as Stopped: VM[User|Ubuntu-12-04-2-64bit]
2013-07-24 23:04:57,450 DEBUG [cloud.capacity.CapacityManagerImpl]
(AgentConnectTaskPool-7:null) VM state transitted from :Running to Stopping
with event: StopRequestedvm's original host id: 28 new host id: 34 host id
before state transition: 34
===

Then the HA starts to kick in.

===
2013-07-24 23:04:57,955 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-307) Processing HAWork[307-HA-273-Stopped-Scheduled]
2013-07-24 23:04:57,956 DEBUG [cloud.capacity.CapacityManagerImpl]
(AgentConnectTaskPool-7:null) VM state transitted from :Running to Stopping
with event: StopRequestedvm's original host id: 28 new host id: 34 host id
before state transition: 34
2013-07-24 23:04:57,960 DEBUG [agent.transport.Request]
(AgentConnectTaskPool-7:null) Seq 34-105644038: Sending  { Cmd , MgmtId:
161342671900, via: 34, Ver: v1, Flags: 100111,
[{"StopCommand":{"isProxy":false,"vmName":"i-2-281-VM","wait":0}}] }
2013-07-24 23:04:57,968 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-307) HA on VM[User|Ubuntu-12-04-2-64bit]
2013-07-24 23:04:57,984 DEBUG [cloud.capacity.CapacityManagerImpl]
(HA-Worker-1:work-307) VM state transitted from :Stopped to Starting with
event: StartRequestedvm's original host id: 28 new host id: null host id
before state transition: null
2013-07-24 23:04:57,984 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-1:work-307) Successfully transitioned to start state for
VM[User|Ubuntu-12-04-2-64bit] reservation id =
b56364ef-90d8-443f-a348-7660fda48d34
2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-1:work-307) Trying to deploy VM, vm has dcId: 6 and podId: 6
2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-1:work-307) Deploy avoids pods: null, clusters: null, hosts: null
2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-1:work-307) Root volume is ready, need to place VM in volume's
cluster
2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-1:work-307) Vol[295|vm=273|ROOT] is READY, changing deployment
plan to use this pool's dcId: 6 , podId: 6 , and clusterId: 6
===

My question is why HA only kicks in when the host is turned back on? By
right it should kick in soon after the host is shut down and marked as
"Disconnected".

Any insights on the possible solutions to this problem is highly
appreciated.

Looking forward to your reply, thank you.

Cheers.



On Thu, Jul 25, 2013 at 12:00 AM, Indra Pramana <in...@sg.or.id> wrote:

> Hi Chip,
>
> Yes, "Offer HA" is set to "Yes" on all my compute offerings.
>
> Hi Geoff,
>
> Yes, I am using KVM. Is this a known issue and is there any solution to
> this problem?
>
> Looking forward to your reply, thank you.
>
> Cheers.
>
>
>
> On Wed, Jul 24, 2013 at 11:38 PM, Geoff Higginbottom <
> geoff.higginbottom@shapeblue.com> wrote:
>
>> Is it running on KVM, we are seeing some real issue with HA simply not
>> working on KVM.
>>
>> Regards
>>
>> Geoff Higginbottom
>>
>> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581
>>
>> geoff.higginbottom@shapeblue.com
>>
>> -----Original Message-----
>> From: Chip Childers [mailto:chip.childers@sungard.com]
>> Sent: 24 July 2013 16:37
>> To: <us...@cloudstack.apache.org>
>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts
>>
>> Did you enable HA for your compute offering?
>>
>> On Jul 24, 2013, at 11:25 AM, Indra Pramana <in...@sg.or.id> wrote:
>>
>> > Dear all,
>> >
>> > I tried to shutdown one of my hypervisor hosts to simulate a server
>> > failure, and the HA is not working, all the VMs on the affected host
>> > is not started on another available host.
>> >
>> > I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD for
>> > primary storage.
>> >
>> > My issue is similar to what is being described here:
>> >
>> > https://issues.apache.org/jira/browse/CLOUDSTACK-3535
>> >
>> > Except that on my case, the host is indeed marked as "Disconnected"
>> > but there is no attempt from CloudStack to try starting the VMs on
>> > another host. I can't provide logs since there's nothing on the logs
>> > which suggest that CloudStack tries to activate the HA and start the
>> > affected VMs on another host.
>> >
>> > Anyone has similar experience? Anyone knows if the above bug has been
>> > resolved?
>> >
>> > Looking forward to your reply, thank you.
>> >
>> > Cheers.
>> This email and any attachments to it may be confidential and are intended
>> solely for the use of the individual to whom it is addressed. Any views or
>> opinions expressed are solely those of the author and do not necessarily
>> represent those of Shape Blue Ltd or related companies. If you are not the
>> intended recipient of this email, you must neither take any action based
>> upon its contents, nor copy or show it to anyone. Please contact the sender
>> if you believe you have received this email in error. Shape Blue Ltd is a
>> company incorporated in England & Wales. ShapeBlue Services India LLP is
>> operated under license from Shape Blue Ltd. ShapeBlue is a registered
>> trademark.
>>
>
>

Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Posted by Indra Pramana <in...@sg.or.id>.

Hi Chip,

Yes, "Offer HA" is set to "Yes" on all my compute offerings.

Hi Geoff,

Yes, I am using KVM. Is this a known issue and is there any solution to
this problem?

Looking forward to your reply, thank you.

Cheers.



On Wed, Jul 24, 2013 at 11:38 PM, Geoff Higginbottom <
geoff.higginbottom@shapeblue.com> wrote:

> Is it running on KVM, we are seeing some real issue with HA simply not
> working on KVM.
>
> Regards
>
> Geoff Higginbottom
>
> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581
>
> geoff.higginbottom@shapeblue.com
>
> -----Original Message-----
> From: Chip Childers [mailto:chip.childers@sungard.com]
> Sent: 24 July 2013 16:37
> To: <us...@cloudstack.apache.org>
> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts
>
> Did you enable HA for your compute offering?
>
> On Jul 24, 2013, at 11:25 AM, Indra Pramana <in...@sg.or.id> wrote:
>
> > Dear all,
> >
> > I tried to shutdown one of my hypervisor hosts to simulate a server
> > failure, and the HA is not working, all the VMs on the affected host
> > is not started on another available host.
> >
> > I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD for
> > primary storage.
> >
> > My issue is similar to what is being described here:
> >
> > https://issues.apache.org/jira/browse/CLOUDSTACK-3535
> >
> > Except that on my case, the host is indeed marked as "Disconnected"
> > but there is no attempt from CloudStack to try starting the VMs on
> > another host. I can't provide logs since there's nothing on the logs
> > which suggest that CloudStack tries to activate the HA and start the
> > affected VMs on another host.
> >
> > Anyone has similar experience? Anyone knows if the above bug has been
> > resolved?
> >
> > Looking forward to your reply, thank you.
> >
> > Cheers.
> This email and any attachments to it may be confidential and are intended
> solely for the use of the individual to whom it is addressed. Any views or
> opinions expressed are solely those of the author and do not necessarily
> represent those of Shape Blue Ltd or related companies. If you are not the
> intended recipient of this email, you must neither take any action based
> upon its contents, nor copy or show it to anyone. Please contact the sender
> if you believe you have received this email in error. Shape Blue Ltd is a
> company incorporated in England & Wales. ShapeBlue Services India LLP is
> operated under license from Shape Blue Ltd. ShapeBlue is a registered
> trademark.
>

RE: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Posted by Geoff Higginbottom <ge...@shapeblue.com>.

Is it running on KVM, we are seeing some real issue with HA simply not working on KVM.

Regards

Geoff Higginbottom

D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581

geoff.higginbottom@shapeblue.com

-----Original Message-----
From: Chip Childers [mailto:chip.childers@sungard.com]
Sent: 24 July 2013 16:37
To: <us...@cloudstack.apache.org>
Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Did you enable HA for your compute offering?

On Jul 24, 2013, at 11:25 AM, Indra Pramana <in...@sg.or.id> wrote:

> Dear all,
>
> I tried to shutdown one of my hypervisor hosts to simulate a server
> failure, and the HA is not working, all the VMs on the affected host
> is not started on another available host.
>
> I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD for
> primary storage.
>
> My issue is similar to what is being described here:
>
> https://issues.apache.org/jira/browse/CLOUDSTACK-3535
>
> Except that on my case, the host is indeed marked as "Disconnected"
> but there is no attempt from CloudStack to try starting the VMs on
> another host. I can't provide logs since there's nothing on the logs
> which suggest that CloudStack tries to activate the HA and start the
> affected VMs on another host.
>
> Anyone has similar experience? Anyone knows if the above bug has been
> resolved?
>
> Looking forward to your reply, thank you.
>
> Cheers.
This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Shape Blue Ltd or related companies. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue is a registered trademark.

Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Posted by Chip Childers <ch...@sungard.com>.

Did you enable HA for your compute offering?

On Jul 24, 2013, at 11:25 AM, Indra Pramana <in...@sg.or.id> wrote:

> Dear all,
>
> I tried to shutdown one of my hypervisor hosts to simulate a server
> failure, and the HA is not working, all the VMs on the affected host is not
> started on another available host.
>
> I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD for primary
> storage.
>
> My issue is similar to what is being described here:
>
> https://issues.apache.org/jira/browse/CLOUDSTACK-3535
>
> Except that on my case, the host is indeed marked as "Disconnected" but
> there is no attempt from CloudStack to try starting the VMs on another
> host. I can't provide logs since there's nothing on the logs which suggest
> that CloudStack tries to activate the HA and start the affected VMs on
> another host.
>
> Anyone has similar experience? Anyone knows if the above bug has been
> resolved?
>
> Looking forward to your reply, thank you.
>
> Cheers.