You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Sean Lair <sl...@ippathways.com> on 2018/02/16 15:56:19 UTC

RE: HA issues

We've done a lot of work on VM HA (we are on 4.9.3) and have it working reliably.  We've also been able stop the problem of VMs getting started on two hosts during some HA events.  Since this is 4.9.3, we do not use IPMI for this functionality.  We have not testing how the addition of IPMI in 4.11 affect our patch.

We are running KVM w/ NFS storage.  If you like I can get you our patch for testing.  



-----Original Message-----
From: Nux! [mailto:nux@li.nux.ro] 
Sent: Monday, January 22, 2018 8:15 AM
To: dev <de...@cloudstack.apache.org>
Subject: Re: HA issues

Hi,

Installed and reinstalled, VM HA just does not work for me.
In addition, if the HV going AWOL is hosting the systemvms, then they also do not get restarted despite available HVs online.
I've opened another ticket with logs:

https://issues.apache.org/jira/browse/CLOUDSTACK-10246

Happy to allow access to my rig if it helps.

I've disabled firewall and whatnot also left out other bits of network hardware just to keep it simpler, still no go.

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Paul Angus" <pa...@shapeblue.com>
> To: "dev" <de...@cloudstack.apache.org>
> Sent: Saturday, 20 January, 2018 08:40:01
> Subject: RE: HA issues

> No problem,
> 
> To be honest host-ha was developed *because* vm-ha was not reliable 
> under a number of conditions, including a host failure.
> 
> paul.angus@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>  
> 
> 
> 
> -----Original Message-----
> From: Nux! [mailto:nux@li.nux.ro]
> Sent: 19 January 2018 14:26
> To: dev <de...@cloudstack.apache.org>
> Subject: Re: HA issues
> 
> Hi Paul,
> 
> Thanks for checking. My compute offering is HA enabled, of course.
> Host HA is disabled as well as OOBM.
> 
> 
> I'll do the tests again on Monday and report back.
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> ----- Original Message -----
>> From: "Paul Angus" <pa...@shapeblue.com>
>> To: "dev" <de...@cloudstack.apache.org>
>> Sent: Friday, 19 January, 2018 14:10:06
>> Subject: RE: HA issues
> 
>> Hey Nux,
>> 
>> I've being testing out the host-ha feature against a couple of physical hosts.
>> I've found that if the compute offering isn't ha enabled, then the vm isn't
>> restarted on the original host when it is rebooted, or any other host.    If
>> the vm is ha-enabled, then the vm was restarted on the original host 
>> when host ha restarted the host.
>> 
>> Can you double check that the instance was an ha-enabled one?
>> 
>> OR
>> maybe the timeouts for the host-ha are too long and the vm-ha 
>> timed-out before hand ...?
>> 
>> 
>> 
>> Kind regards,
>> 
>> Paul Angus
>> 
>> paul.angus@shapeblue.com
>> www.shapeblue.com
>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>  
>> 
>> 
>> 
>> -----Original Message-----
>> From: Nux! [mailto:nux@li.nux.ro]
>> Sent: 17 January 2018 09:12
>> To: dev <de...@cloudstack.apache.org>
>> Subject: Re: HA issues
>> 
>> Right, sorry for using the terms interchangeably, I see what you mean.
>> 
>> I'll do further testing then as VM HA was also not working in my setup.
>> 
>> I'll be back.
>> 
>> --
>> Sent from the Delta quadrant using Borg technology!
>> 
>> Nux!
>> www.nux.ro
>> 
>> ----- Original Message -----
>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>> To: "dev" <de...@cloudstack.apache.org>
>>> Sent: Wednesday, 17 January, 2018 09:09:19
>>> Subject: Re: HA issues
>> 
>>> Hi Lucian,
>>> 
>>> 
>>> The "Host HA" feature is entirely different from VM HA, however, 
>>> they may work in tandem, so please stop using the terms 
>>> interchangeably as it may cause the community to believe a regression has been caused.
>>> 
>>> 
>>> The "Host HA" feature currently ships with only "Host HA" provider 
>>> for KVM that is strictly tied to out-of-band management (IPMI for 
>>> fencing, i.e power off and recovery, i.e. reboot) and NFS (as primary storage).
>>> (We also have a provider for simulator, but that's for 
>>> coverage/testing purposes).
>>> 
>>> 
>>> Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is enabled.
>>> The frameowkr allows interested parties may write their own HA 
>>> providers for a hypervisor that can use a different 
>>> strategy/mechanism for fencing/recovery of hosts (including write a 
>>> non-IPMI based OOBM
>>> plugin) and host/disk activity checker that is non-NFS based.
>>> 
>>> 
>>> The "Host HA" feature ships disabled by default and does not cause 
>>> any interference with VM HA. However, when enabled and configured 
>>> correctly, it is a known limitation that when it is unable to 
>>> successfully perform recovery or fencing tasks it may not trigger VM 
>>> HA. We can discuss how to handle such cases (thoughts?). "Host HA"
>>> would try couple of times to recover and failing to do so, it would 
>>> eventually trigger a host fencing task. If it's unable to fence a 
>>> host, it will indefinitely attempt to fence the host (the host state 
>>> will be stuck at fencing state in cloud.ha_config table for example) 
>>> and alerts will be sent to admin who can do some manual intervention 
>>> to handle such situations (if you've email/smtp enabled, you should 
>>> see alert emails).
>>> 
>>> 
>>> We can discuss how to improve and have a workaround for the case 
>>> you've hit, thanks for sharing.
>>> 
>>> 
>>> - Rohit
>>> 
>>> ________________________________
>>> From: Nux! <nu...@li.nux.ro>
>>> Sent: Tuesday, January 16, 2018 10:42:35 PM
>>> To: dev
>>> Subject: Re: HA issues
>>> 
>>> Ok, reinstalled and re-tested.
>>> 
>>> What I've learned:
>>> 
>>> - HA only works now if OOB is configured, the old way HA no longer 
>>> applies - this can be good and bad, not everyone has IPMIs
>>> 
>>> - HA only works if IPMI is reachable. I've pulled the cord on a HV 
>>> and HA failed to do its thing, leaving me with a HV down along with 
>>> all the VMs running there. That's bad.
>>> I've opened this ticket for it:
>>> https://issues.apache.org/jira/browse/CLOUDSTACK-10234
>>> 
>>> Let me know if you need any extra info or stuff to test.
>>> 
>>> Regards,
>>> Lucian
>>> 
>>> --
>>> Sent from the Delta quadrant using Borg technology!
>>> 
>>> Nux!
>>> www.nux.ro
>>> 
>>> 
>>> rohit.yadav@shapeblue.com
>>> www.shapeblue.com
>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>>  
>>> 
>>> 
>>> ----- Original Message -----
>>>> From: "Nux!" <nu...@li.nux.ro>
>>>> To: "dev" <de...@cloudstack.apache.org>
>>>> Sent: Tuesday, 16 January, 2018 11:35:58
>>>> Subject: Re: HA issues
>>> 
>>>> I'll reinstall my setup and try again, just to be sure I'm working 
>>>> on a clean slate.
>>>>
>>>> --
>>>> Sent from the Delta quadrant using Borg technology!
>>>>
>>>> Nux!
>>>> www.nux.ro
>>>>
>>>> ----- Original Message -----
>>>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>>>> To: "dev" <de...@cloudstack.apache.org>
>>>>> Sent: Tuesday, 16 January, 2018 11:29:51
>>>>> Subject: Re: HA issues
>>>>
>>>>> Hi Lucian,
>>>>>
>>>>>
>>>>> If you're talking about the new HostHA feature (with 
>>>>> KVM+nfs+ipmi), please refer to following docs:
>>>>>
>>>>> http://docs.cloudstack.apache.org/projects/cloudstack-administrati
>>>>> o n /en/latest/hosts.html#out-of-band-management
>>>>>
>>>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
>>>>>
>>>>>
>>>>> We'll need to you look at logs perhaps create a JIRA ticket with 
>>>>> the logs and details? If you saw ipmi based reboot, then host-ha 
>>>>> indeed tried to recover i.e. reboot the host, once hostha has done 
>>>>> its work it would schedule HA for VM as soon as the recovery 
>>>>> operation succeeds (we've simulator and kvm based marvin tests for 
>>>>> such scenarios).
>>>>>
>>>>>
>>>>> Can you see it making attempt to schedule VM ha in logs, or any failure?
>>>>>
>>>>>
>>>>> - Rohit
>>>>>
>>>>> <https://cloudstack.apache.org>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Nux! <nu...@li.nux.ro>
>>>>> Sent: Tuesday, January 16, 2018 12:47:56 AM
>>>>> To: dev
>>>>> Subject: [4.11] HA issues
>>>>>
>>>>> Hi,
>>>>>
>>>>> I see there's a new HA engine for KVM and IPMI support which is 
>>>>> really nice, however it seems hit and miss.
>>>>> I have created an instance with HA offering, kernel panicked one 
>>>>> of the hypervisors - after a while the server was rebooted via 
>>>>> IPMI probably, but the instance never moved to a running 
>>>>> hypervisor and even after the original hypervisor came back it was still left in Stopped state.
>>>>> Is there any extra things I need to set up to have proper HA?
>>>>>
>>>>> Regards,
>>>>> Lucian
>>>>>
>>>>> --
>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>
>>>>> Nux!
>>>>> www.nux.ro
>>>>>
>>>>> rohit.yadav@shapeblue.com
>>>>> www.shapeblue.com<http://www.shapeblue.com>
>>>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > @shapeblue