You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Nux! <nu...@li.nux.ro> on 2018/01/15 19:17:56 UTC

[4.11] HA issues

Hi,

I see there's a new HA engine for KVM and IPMI support which is really nice, however it seems hit and miss.
I have created an instance with HA offering, kernel panicked one of the hypervisors - after a while the server was rebooted via IPMI probably, but the instance never moved to a running hypervisor and even after the original hypervisor came back it was still left in Stopped state.
Is there any extra things I need to set up to have proper HA?

Regards,
Lucian

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

RE: HA issues

Posted by Paul Angus <pa...@shapeblue.com>.
Hi Sean,

I have a few questions please.
Could you explain to me as a non-developer, how CloudStack now determines that a host is actually 'Down'. Also, what happens if an operator is using block storage which virtlockd doesn't support, and how does CloudStack determine that virtlockd is installed/configure and enabled on the hosts?

I'm just trying to understand the use case, thanks.

paul.angus@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 


-----Original Message-----
From: Nux! [mailto:nux@li.nux.ro] 
Sent: 02 March 2018 14:41
To: dev <de...@cloudstack.apache.org>
Subject: Re: HA issues

Thanks, looking forward to having HA in Cloudstack again! :-)

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Sean Lair" <sl...@ippathways.com>
> To: "dev" <de...@cloudstack.apache.org>
> Sent: Thursday, 1 March, 2018 22:01:50
> Subject: RE: HA issues

> FYI Nux, I opened the following PR for the change we made in our 
> environment to get VM HA to work.  I referenced your ticket!
> 
> https://github.com/apache/cloudstack/pull/2474
> 
> 
> -----Original Message-----
> From: Nux! [mailto:nux@li.nux.ro]
> Sent: Monday, January 22, 2018 8:15 AM
> To: dev <de...@cloudstack.apache.org>
> Subject: Re: HA issues
> 
> Hi,
> 
> Installed and reinstalled, VM HA just does not work for me.
> In addition, if the HV going AWOL is hosting the systemvms, then they 
> also do not get restarted despite available HVs online.
> I've opened another ticket with logs:
> 
> https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> 
> Happy to allow access to my rig if it helps.
> 
> I've disabled firewall and whatnot also left out other bits of network 
> hardware just to keep it simpler, still no go.
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> ----- Original Message -----
>> From: "Paul Angus" <pa...@shapeblue.com>
>> To: "dev" <de...@cloudstack.apache.org>
>> Sent: Saturday, 20 January, 2018 08:40:01
>> Subject: RE: HA issues
> 
>> No problem,
>> 
>> To be honest host-ha was developed *because* vm-ha was not reliable 
>> under a number of conditions, including a host failure.
>> 
>> paul.angus@shapeblue.com
>> www.shapeblue.com
>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>  
>> 
>> 
>> 
>> -----Original Message-----
>> From: Nux! [mailto:nux@li.nux.ro]
>> Sent: 19 January 2018 14:26
>> To: dev <de...@cloudstack.apache.org>
>> Subject: Re: HA issues
>> 
>> Hi Paul,
>> 
>> Thanks for checking. My compute offering is HA enabled, of course.
>> Host HA is disabled as well as OOBM.
>> 
>> 
>> I'll do the tests again on Monday and report back.
>> 
>> --
>> Sent from the Delta quadrant using Borg technology!
>> 
>> Nux!
>> www.nux.ro
>> 
>> ----- Original Message -----
>>> From: "Paul Angus" <pa...@shapeblue.com>
>>> To: "dev" <de...@cloudstack.apache.org>
>>> Sent: Friday, 19 January, 2018 14:10:06
>>> Subject: RE: HA issues
>> 
>>> Hey Nux,
>>> 
>>> I've being testing out the host-ha feature against a couple of physical hosts.
>>> I've found that if the compute offering isn't ha enabled, then the vm isn't
>>> restarted on the original host when it is rebooted, or any other host.    If
>>> the vm is ha-enabled, then the vm was restarted on the original host 
>>> when host ha restarted the host.
>>> 
>>> Can you double check that the instance was an ha-enabled one?
>>> 
>>> OR
>>> maybe the timeouts for the host-ha are too long and the vm-ha 
>>> timed-out before hand ...?
>>> 
>>> 
>>> 
>>> Kind regards,
>>> 
>>> Paul Angus
>>> 
>>> paul.angus@shapeblue.com
>>> www.shapeblue.com
>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>>  
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Nux! [mailto:nux@li.nux.ro]
>>> Sent: 17 January 2018 09:12
>>> To: dev <de...@cloudstack.apache.org>
>>> Subject: Re: HA issues
>>> 
>>> Right, sorry for using the terms interchangeably, I see what you mean.
>>> 
>>> I'll do further testing then as VM HA was also not working in my setup.
>>> 
>>> I'll be back.
>>> 
>>> --
>>> Sent from the Delta quadrant using Borg technology!
>>> 
>>> Nux!
>>> www.nux.ro
>>> 
>>> ----- Original Message -----
>>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>>> To: "dev" <de...@cloudstack.apache.org>
>>>> Sent: Wednesday, 17 January, 2018 09:09:19
>>>> Subject: Re: HA issues
>>> 
>>>> Hi Lucian,
>>>> 
>>>> 
>>>> The "Host HA" feature is entirely different from VM HA, however, 
>>>> they may work in tandem, so please stop using the terms 
>>>> interchangeably as it may cause the community to believe a 
>>>> regression has been caused.
>>>> 
>>>> 
>>>> The "Host HA" feature currently ships with only "Host HA" provider 
>>>> for KVM that is strictly tied to out-of-band management (IPMI for 
>>>> fencing, i.e power off and recovery, i.e. reboot) and NFS (as primary storage).
>>>> (We also have a provider for simulator, but that's for 
>>>> coverage/testing purposes).
>>>> 
>>>> 
>>>> Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is enabled.
>>>> The frameowkr allows interested parties may write their own HA 
>>>> providers for a hypervisor that can use a different 
>>>> strategy/mechanism for fencing/recovery of hosts (including write a 
>>>> non-IPMI based OOBM
>>>> plugin) and host/disk activity checker that is non-NFS based.
>>>> 
>>>> 
>>>> The "Host HA" feature ships disabled by default and does not cause 
>>>> any interference with VM HA. However, when enabled and configured 
>>>> correctly, it is a known limitation that when it is unable to 
>>>> successfully perform recovery or fencing tasks it may not trigger 
>>>> VM HA. We can discuss how to handle such cases (thoughts?). "Host HA"
>>>> would try couple of times to recover and failing to do so, it would 
>>>> eventually trigger a host fencing task. If it's unable to fence a 
>>>> host, it will indefinitely attempt to fence the host (the host 
>>>> state will be stuck at fencing state in cloud.ha_config table for 
>>>> example) and alerts will be sent to admin who can do some manual 
>>>> intervention to handle such situations (if you've email/smtp 
>>>> enabled, you should see alert emails).
>>>> 
>>>> 
>>>> We can discuss how to improve and have a workaround for the case 
>>>> you've hit, thanks for sharing.
>>>> 
>>>> 
>>>> - Rohit
>>>> 
>>>> ________________________________
>>>> From: Nux! <nu...@li.nux.ro>
>>>> Sent: Tuesday, January 16, 2018 10:42:35 PM
>>>> To: dev
>>>> Subject: Re: HA issues
>>>> 
>>>> Ok, reinstalled and re-tested.
>>>> 
>>>> What I've learned:
>>>> 
>>>> - HA only works now if OOB is configured, the old way HA no longer 
>>>> applies - this can be good and bad, not everyone has IPMIs
>>>> 
>>>> - HA only works if IPMI is reachable. I've pulled the cord on a HV 
>>>> and HA failed to do its thing, leaving me with a HV down along with 
>>>> all the VMs running there. That's bad.
>>>> I've opened this ticket for it:
>>>> https://issues.apache.org/jira/browse/CLOUDSTACK-10234
>>>> 
>>>> Let me know if you need any extra info or stuff to test.
>>>> 
>>>> Regards,
>>>> Lucian
>>>> 
>>>> --
>>>> Sent from the Delta quadrant using Borg technology!
>>>> 
>>>> Nux!
>>>> www.nux.ro
>>>> 
>>>> 
>>>> rohit.yadav@shapeblue.com
>>>> www.shapeblue.com
>>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>>>  
>>>> 
>>>> 
>>>> ----- Original Message -----
>>>>> From: "Nux!" <nu...@li.nux.ro>
>>>>> To: "dev" <de...@cloudstack.apache.org>
>>>>> Sent: Tuesday, 16 January, 2018 11:35:58
>>>>> Subject: Re: HA issues
>>>> 
>>>>> I'll reinstall my setup and try again, just to be sure I'm working 
>>>>> on a clean slate.
>>>>>
>>>>> --
>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>
>>>>> Nux!
>>>>> www.nux.ro
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>>>>> To: "dev" <de...@cloudstack.apache.org>
>>>>>> Sent: Tuesday, 16 January, 2018 11:29:51
>>>>>> Subject: Re: HA issues
>>>>>
>>>>>> Hi Lucian,
>>>>>>
>>>>>>
>>>>>> If you're talking about the new HostHA feature (with
>>>>>> KVM+nfs+ipmi), please refer to following docs:
>>>>>>
>>>>>> http://docs.cloudstack.apache.org/projects/cloudstack-administrat
>>>>>> i o n /en/latest/hosts.html#out-of-band-management
>>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
>>>>>>
>>>>>>
>>>>>> We'll need to you look at logs perhaps create a JIRA ticket with 
>>>>>> the logs and details? If you saw ipmi based reboot, then host-ha 
>>>>>> indeed tried to recover i.e. reboot the host, once hostha has 
>>>>>> done its work it would schedule HA for VM as soon as the recovery 
>>>>>> operation succeeds (we've simulator and kvm based marvin tests 
>>>>>> for such scenarios).
>>>>>>
>>>>>>
>>>>>> Can you see it making attempt to schedule VM ha in logs, or any failure?
>>>>>>
>>>>>>
>>>>>> - Rohit
>>>>>>
>>>>>> <https://cloudstack.apache.org>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ________________________________
>>>>>> From: Nux! <nu...@li.nux.ro>
>>>>>> Sent: Tuesday, January 16, 2018 12:47:56 AM
>>>>>> To: dev
>>>>>> Subject: [4.11] HA issues
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I see there's a new HA engine for KVM and IPMI support which is 
>>>>>> really nice, however it seems hit and miss.
>>>>>> I have created an instance with HA offering, kernel panicked one 
>>>>>> of the hypervisors - after a while the server was rebooted via 
>>>>>> IPMI probably, but the instance never moved to a running 
>>>>>> hypervisor and even after the original hypervisor came back it 
>>>>>> was still left in Stopped state.
>>>>>> Is there any extra things I need to set up to have proper HA?
>>>>>>
>>>>>> Regards,
>>>>>> Lucian
>>>>>>
>>>>>> --
>>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>>
>>>>>> Nux!
>>>>>> www.nux.ro
>>>>>>
>>>>>> rohit.yadav@shapeblue.com
>>>>>> www.shapeblue.com<http://www.shapeblue.com>
>>>>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > > @shapeblue

Re: HA issues

Posted by Nux! <nu...@li.nux.ro>.
Thanks, looking forward to having HA in Cloudstack again! :-)

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Sean Lair" <sl...@ippathways.com>
> To: "dev" <de...@cloudstack.apache.org>
> Sent: Thursday, 1 March, 2018 22:01:50
> Subject: RE: HA issues

> FYI Nux, I opened the following PR for the change we made in our environment to
> get VM HA to work.  I referenced your ticket!
> 
> https://github.com/apache/cloudstack/pull/2474
> 
> 
> -----Original Message-----
> From: Nux! [mailto:nux@li.nux.ro]
> Sent: Monday, January 22, 2018 8:15 AM
> To: dev <de...@cloudstack.apache.org>
> Subject: Re: HA issues
> 
> Hi,
> 
> Installed and reinstalled, VM HA just does not work for me.
> In addition, if the HV going AWOL is hosting the systemvms, then they also do
> not get restarted despite available HVs online.
> I've opened another ticket with logs:
> 
> https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> 
> Happy to allow access to my rig if it helps.
> 
> I've disabled firewall and whatnot also left out other bits of network hardware
> just to keep it simpler, still no go.
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> ----- Original Message -----
>> From: "Paul Angus" <pa...@shapeblue.com>
>> To: "dev" <de...@cloudstack.apache.org>
>> Sent: Saturday, 20 January, 2018 08:40:01
>> Subject: RE: HA issues
> 
>> No problem,
>> 
>> To be honest host-ha was developed *because* vm-ha was not reliable
>> under a number of conditions, including a host failure.
>> 
>> paul.angus@shapeblue.com
>> www.shapeblue.com
>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>  
>> 
>> 
>> 
>> -----Original Message-----
>> From: Nux! [mailto:nux@li.nux.ro]
>> Sent: 19 January 2018 14:26
>> To: dev <de...@cloudstack.apache.org>
>> Subject: Re: HA issues
>> 
>> Hi Paul,
>> 
>> Thanks for checking. My compute offering is HA enabled, of course.
>> Host HA is disabled as well as OOBM.
>> 
>> 
>> I'll do the tests again on Monday and report back.
>> 
>> --
>> Sent from the Delta quadrant using Borg technology!
>> 
>> Nux!
>> www.nux.ro
>> 
>> ----- Original Message -----
>>> From: "Paul Angus" <pa...@shapeblue.com>
>>> To: "dev" <de...@cloudstack.apache.org>
>>> Sent: Friday, 19 January, 2018 14:10:06
>>> Subject: RE: HA issues
>> 
>>> Hey Nux,
>>> 
>>> I've being testing out the host-ha feature against a couple of physical hosts.
>>> I've found that if the compute offering isn't ha enabled, then the vm isn't
>>> restarted on the original host when it is rebooted, or any other host.    If
>>> the vm is ha-enabled, then the vm was restarted on the original host
>>> when host ha restarted the host.
>>> 
>>> Can you double check that the instance was an ha-enabled one?
>>> 
>>> OR
>>> maybe the timeouts for the host-ha are too long and the vm-ha
>>> timed-out before hand ...?
>>> 
>>> 
>>> 
>>> Kind regards,
>>> 
>>> Paul Angus
>>> 
>>> paul.angus@shapeblue.com
>>> www.shapeblue.com
>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>>  
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Nux! [mailto:nux@li.nux.ro]
>>> Sent: 17 January 2018 09:12
>>> To: dev <de...@cloudstack.apache.org>
>>> Subject: Re: HA issues
>>> 
>>> Right, sorry for using the terms interchangeably, I see what you mean.
>>> 
>>> I'll do further testing then as VM HA was also not working in my setup.
>>> 
>>> I'll be back.
>>> 
>>> --
>>> Sent from the Delta quadrant using Borg technology!
>>> 
>>> Nux!
>>> www.nux.ro
>>> 
>>> ----- Original Message -----
>>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>>> To: "dev" <de...@cloudstack.apache.org>
>>>> Sent: Wednesday, 17 January, 2018 09:09:19
>>>> Subject: Re: HA issues
>>> 
>>>> Hi Lucian,
>>>> 
>>>> 
>>>> The "Host HA" feature is entirely different from VM HA, however,
>>>> they may work in tandem, so please stop using the terms
>>>> interchangeably as it may cause the community to believe a regression has been
>>>> caused.
>>>> 
>>>> 
>>>> The "Host HA" feature currently ships with only "Host HA" provider
>>>> for KVM that is strictly tied to out-of-band management (IPMI for
>>>> fencing, i.e power off and recovery, i.e. reboot) and NFS (as primary storage).
>>>> (We also have a provider for simulator, but that's for
>>>> coverage/testing purposes).
>>>> 
>>>> 
>>>> Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is enabled.
>>>> The frameowkr allows interested parties may write their own HA
>>>> providers for a hypervisor that can use a different
>>>> strategy/mechanism for fencing/recovery of hosts (including write a
>>>> non-IPMI based OOBM
>>>> plugin) and host/disk activity checker that is non-NFS based.
>>>> 
>>>> 
>>>> The "Host HA" feature ships disabled by default and does not cause
>>>> any interference with VM HA. However, when enabled and configured
>>>> correctly, it is a known limitation that when it is unable to
>>>> successfully perform recovery or fencing tasks it may not trigger VM
>>>> HA. We can discuss how to handle such cases (thoughts?). "Host HA"
>>>> would try couple of times to recover and failing to do so, it would
>>>> eventually trigger a host fencing task. If it's unable to fence a
>>>> host, it will indefinitely attempt to fence the host (the host state
>>>> will be stuck at fencing state in cloud.ha_config table for example)
>>>> and alerts will be sent to admin who can do some manual intervention
>>>> to handle such situations (if you've email/smtp enabled, you should
>>>> see alert emails).
>>>> 
>>>> 
>>>> We can discuss how to improve and have a workaround for the case
>>>> you've hit, thanks for sharing.
>>>> 
>>>> 
>>>> - Rohit
>>>> 
>>>> ________________________________
>>>> From: Nux! <nu...@li.nux.ro>
>>>> Sent: Tuesday, January 16, 2018 10:42:35 PM
>>>> To: dev
>>>> Subject: Re: HA issues
>>>> 
>>>> Ok, reinstalled and re-tested.
>>>> 
>>>> What I've learned:
>>>> 
>>>> - HA only works now if OOB is configured, the old way HA no longer
>>>> applies - this can be good and bad, not everyone has IPMIs
>>>> 
>>>> - HA only works if IPMI is reachable. I've pulled the cord on a HV
>>>> and HA failed to do its thing, leaving me with a HV down along with
>>>> all the VMs running there. That's bad.
>>>> I've opened this ticket for it:
>>>> https://issues.apache.org/jira/browse/CLOUDSTACK-10234
>>>> 
>>>> Let me know if you need any extra info or stuff to test.
>>>> 
>>>> Regards,
>>>> Lucian
>>>> 
>>>> --
>>>> Sent from the Delta quadrant using Borg technology!
>>>> 
>>>> Nux!
>>>> www.nux.ro
>>>> 
>>>> 
>>>> rohit.yadav@shapeblue.com
>>>> www.shapeblue.com
>>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>>>  
>>>> 
>>>> 
>>>> ----- Original Message -----
>>>>> From: "Nux!" <nu...@li.nux.ro>
>>>>> To: "dev" <de...@cloudstack.apache.org>
>>>>> Sent: Tuesday, 16 January, 2018 11:35:58
>>>>> Subject: Re: HA issues
>>>> 
>>>>> I'll reinstall my setup and try again, just to be sure I'm working
>>>>> on a clean slate.
>>>>>
>>>>> --
>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>
>>>>> Nux!
>>>>> www.nux.ro
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>>>>> To: "dev" <de...@cloudstack.apache.org>
>>>>>> Sent: Tuesday, 16 January, 2018 11:29:51
>>>>>> Subject: Re: HA issues
>>>>>
>>>>>> Hi Lucian,
>>>>>>
>>>>>>
>>>>>> If you're talking about the new HostHA feature (with
>>>>>> KVM+nfs+ipmi), please refer to following docs:
>>>>>>
>>>>>> http://docs.cloudstack.apache.org/projects/cloudstack-administrati
>>>>>> o n /en/latest/hosts.html#out-of-band-management
>>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
>>>>>>
>>>>>>
>>>>>> We'll need to you look at logs perhaps create a JIRA ticket with
>>>>>> the logs and details? If you saw ipmi based reboot, then host-ha
>>>>>> indeed tried to recover i.e. reboot the host, once hostha has done
>>>>>> its work it would schedule HA for VM as soon as the recovery
>>>>>> operation succeeds (we've simulator and kvm based marvin tests for
>>>>>> such scenarios).
>>>>>>
>>>>>>
>>>>>> Can you see it making attempt to schedule VM ha in logs, or any failure?
>>>>>>
>>>>>>
>>>>>> - Rohit
>>>>>>
>>>>>> <https://cloudstack.apache.org>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ________________________________
>>>>>> From: Nux! <nu...@li.nux.ro>
>>>>>> Sent: Tuesday, January 16, 2018 12:47:56 AM
>>>>>> To: dev
>>>>>> Subject: [4.11] HA issues
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I see there's a new HA engine for KVM and IPMI support which is
>>>>>> really nice, however it seems hit and miss.
>>>>>> I have created an instance with HA offering, kernel panicked one
>>>>>> of the hypervisors - after a while the server was rebooted via
>>>>>> IPMI probably, but the instance never moved to a running
>>>>>> hypervisor and even after the original hypervisor came back it was still left in
>>>>>> Stopped state.
>>>>>> Is there any extra things I need to set up to have proper HA?
>>>>>>
>>>>>> Regards,
>>>>>> Lucian
>>>>>>
>>>>>> --
>>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>>
>>>>>> Nux!
>>>>>> www.nux.ro
>>>>>>
>>>>>> rohit.yadav@shapeblue.com
>>>>>> www.shapeblue.com<http://www.shapeblue.com>
>>>>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > > @shapeblue

RE: HA issues

Posted by Sean Lair <sl...@ippathways.com>.
We've done a lot of work on VM HA (we are on 4.9.3) and have it working reliably.  We've also been able stop the problem of VMs getting started on two hosts during some HA events.  Since this is 4.9.3, we do not use IPMI for this functionality.  We have not testing how the addition of IPMI in 4.11 affect our patch.

We are running KVM w/ NFS storage.  If you like I can get you our patch for testing.  



-----Original Message-----
From: Nux! [mailto:nux@li.nux.ro] 
Sent: Monday, January 22, 2018 8:15 AM
To: dev <de...@cloudstack.apache.org>
Subject: Re: HA issues

Hi,

Installed and reinstalled, VM HA just does not work for me.
In addition, if the HV going AWOL is hosting the systemvms, then they also do not get restarted despite available HVs online.
I've opened another ticket with logs:

https://issues.apache.org/jira/browse/CLOUDSTACK-10246

Happy to allow access to my rig if it helps.

I've disabled firewall and whatnot also left out other bits of network hardware just to keep it simpler, still no go.

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Paul Angus" <pa...@shapeblue.com>
> To: "dev" <de...@cloudstack.apache.org>
> Sent: Saturday, 20 January, 2018 08:40:01
> Subject: RE: HA issues

> No problem,
> 
> To be honest host-ha was developed *because* vm-ha was not reliable 
> under a number of conditions, including a host failure.
> 
> paul.angus@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>  
> 
> 
> 
> -----Original Message-----
> From: Nux! [mailto:nux@li.nux.ro]
> Sent: 19 January 2018 14:26
> To: dev <de...@cloudstack.apache.org>
> Subject: Re: HA issues
> 
> Hi Paul,
> 
> Thanks for checking. My compute offering is HA enabled, of course.
> Host HA is disabled as well as OOBM.
> 
> 
> I'll do the tests again on Monday and report back.
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> ----- Original Message -----
>> From: "Paul Angus" <pa...@shapeblue.com>
>> To: "dev" <de...@cloudstack.apache.org>
>> Sent: Friday, 19 January, 2018 14:10:06
>> Subject: RE: HA issues
> 
>> Hey Nux,
>> 
>> I've being testing out the host-ha feature against a couple of physical hosts.
>> I've found that if the compute offering isn't ha enabled, then the vm isn't
>> restarted on the original host when it is rebooted, or any other host.    If
>> the vm is ha-enabled, then the vm was restarted on the original host 
>> when host ha restarted the host.
>> 
>> Can you double check that the instance was an ha-enabled one?
>> 
>> OR
>> maybe the timeouts for the host-ha are too long and the vm-ha 
>> timed-out before hand ...?
>> 
>> 
>> 
>> Kind regards,
>> 
>> Paul Angus
>> 
>> paul.angus@shapeblue.com
>> www.shapeblue.com
>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>  
>> 
>> 
>> 
>> -----Original Message-----
>> From: Nux! [mailto:nux@li.nux.ro]
>> Sent: 17 January 2018 09:12
>> To: dev <de...@cloudstack.apache.org>
>> Subject: Re: HA issues
>> 
>> Right, sorry for using the terms interchangeably, I see what you mean.
>> 
>> I'll do further testing then as VM HA was also not working in my setup.
>> 
>> I'll be back.
>> 
>> --
>> Sent from the Delta quadrant using Borg technology!
>> 
>> Nux!
>> www.nux.ro
>> 
>> ----- Original Message -----
>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>> To: "dev" <de...@cloudstack.apache.org>
>>> Sent: Wednesday, 17 January, 2018 09:09:19
>>> Subject: Re: HA issues
>> 
>>> Hi Lucian,
>>> 
>>> 
>>> The "Host HA" feature is entirely different from VM HA, however, 
>>> they may work in tandem, so please stop using the terms 
>>> interchangeably as it may cause the community to believe a regression has been caused.
>>> 
>>> 
>>> The "Host HA" feature currently ships with only "Host HA" provider 
>>> for KVM that is strictly tied to out-of-band management (IPMI for 
>>> fencing, i.e power off and recovery, i.e. reboot) and NFS (as primary storage).
>>> (We also have a provider for simulator, but that's for 
>>> coverage/testing purposes).
>>> 
>>> 
>>> Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is enabled.
>>> The frameowkr allows interested parties may write their own HA 
>>> providers for a hypervisor that can use a different 
>>> strategy/mechanism for fencing/recovery of hosts (including write a 
>>> non-IPMI based OOBM
>>> plugin) and host/disk activity checker that is non-NFS based.
>>> 
>>> 
>>> The "Host HA" feature ships disabled by default and does not cause 
>>> any interference with VM HA. However, when enabled and configured 
>>> correctly, it is a known limitation that when it is unable to 
>>> successfully perform recovery or fencing tasks it may not trigger VM 
>>> HA. We can discuss how to handle such cases (thoughts?). "Host HA"
>>> would try couple of times to recover and failing to do so, it would 
>>> eventually trigger a host fencing task. If it's unable to fence a 
>>> host, it will indefinitely attempt to fence the host (the host state 
>>> will be stuck at fencing state in cloud.ha_config table for example) 
>>> and alerts will be sent to admin who can do some manual intervention 
>>> to handle such situations (if you've email/smtp enabled, you should 
>>> see alert emails).
>>> 
>>> 
>>> We can discuss how to improve and have a workaround for the case 
>>> you've hit, thanks for sharing.
>>> 
>>> 
>>> - Rohit
>>> 
>>> ________________________________
>>> From: Nux! <nu...@li.nux.ro>
>>> Sent: Tuesday, January 16, 2018 10:42:35 PM
>>> To: dev
>>> Subject: Re: HA issues
>>> 
>>> Ok, reinstalled and re-tested.
>>> 
>>> What I've learned:
>>> 
>>> - HA only works now if OOB is configured, the old way HA no longer 
>>> applies - this can be good and bad, not everyone has IPMIs
>>> 
>>> - HA only works if IPMI is reachable. I've pulled the cord on a HV 
>>> and HA failed to do its thing, leaving me with a HV down along with 
>>> all the VMs running there. That's bad.
>>> I've opened this ticket for it:
>>> https://issues.apache.org/jira/browse/CLOUDSTACK-10234
>>> 
>>> Let me know if you need any extra info or stuff to test.
>>> 
>>> Regards,
>>> Lucian
>>> 
>>> --
>>> Sent from the Delta quadrant using Borg technology!
>>> 
>>> Nux!
>>> www.nux.ro
>>> 
>>> 
>>> rohit.yadav@shapeblue.com
>>> www.shapeblue.com
>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>>  
>>> 
>>> 
>>> ----- Original Message -----
>>>> From: "Nux!" <nu...@li.nux.ro>
>>>> To: "dev" <de...@cloudstack.apache.org>
>>>> Sent: Tuesday, 16 January, 2018 11:35:58
>>>> Subject: Re: HA issues
>>> 
>>>> I'll reinstall my setup and try again, just to be sure I'm working 
>>>> on a clean slate.
>>>>
>>>> --
>>>> Sent from the Delta quadrant using Borg technology!
>>>>
>>>> Nux!
>>>> www.nux.ro
>>>>
>>>> ----- Original Message -----
>>>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>>>> To: "dev" <de...@cloudstack.apache.org>
>>>>> Sent: Tuesday, 16 January, 2018 11:29:51
>>>>> Subject: Re: HA issues
>>>>
>>>>> Hi Lucian,
>>>>>
>>>>>
>>>>> If you're talking about the new HostHA feature (with 
>>>>> KVM+nfs+ipmi), please refer to following docs:
>>>>>
>>>>> http://docs.cloudstack.apache.org/projects/cloudstack-administrati
>>>>> o n /en/latest/hosts.html#out-of-band-management
>>>>>
>>>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
>>>>>
>>>>>
>>>>> We'll need to you look at logs perhaps create a JIRA ticket with 
>>>>> the logs and details? If you saw ipmi based reboot, then host-ha 
>>>>> indeed tried to recover i.e. reboot the host, once hostha has done 
>>>>> its work it would schedule HA for VM as soon as the recovery 
>>>>> operation succeeds (we've simulator and kvm based marvin tests for 
>>>>> such scenarios).
>>>>>
>>>>>
>>>>> Can you see it making attempt to schedule VM ha in logs, or any failure?
>>>>>
>>>>>
>>>>> - Rohit
>>>>>
>>>>> <https://cloudstack.apache.org>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Nux! <nu...@li.nux.ro>
>>>>> Sent: Tuesday, January 16, 2018 12:47:56 AM
>>>>> To: dev
>>>>> Subject: [4.11] HA issues
>>>>>
>>>>> Hi,
>>>>>
>>>>> I see there's a new HA engine for KVM and IPMI support which is 
>>>>> really nice, however it seems hit and miss.
>>>>> I have created an instance with HA offering, kernel panicked one 
>>>>> of the hypervisors - after a while the server was rebooted via 
>>>>> IPMI probably, but the instance never moved to a running 
>>>>> hypervisor and even after the original hypervisor came back it was still left in Stopped state.
>>>>> Is there any extra things I need to set up to have proper HA?
>>>>>
>>>>> Regards,
>>>>> Lucian
>>>>>
>>>>> --
>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>
>>>>> Nux!
>>>>> www.nux.ro
>>>>>
>>>>> rohit.yadav@shapeblue.com
>>>>> www.shapeblue.com<http://www.shapeblue.com>
>>>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > @shapeblue

RE: HA issues

Posted by Sean Lair <sl...@ippathways.com>.
FYI Nux, I opened the following PR for the change we made in our environment to get VM HA to work.  I referenced your ticket!

https://github.com/apache/cloudstack/pull/2474


-----Original Message-----
From: Nux! [mailto:nux@li.nux.ro] 
Sent: Monday, January 22, 2018 8:15 AM
To: dev <de...@cloudstack.apache.org>
Subject: Re: HA issues

Hi,

Installed and reinstalled, VM HA just does not work for me.
In addition, if the HV going AWOL is hosting the systemvms, then they also do not get restarted despite available HVs online.
I've opened another ticket with logs:

https://issues.apache.org/jira/browse/CLOUDSTACK-10246

Happy to allow access to my rig if it helps.

I've disabled firewall and whatnot also left out other bits of network hardware just to keep it simpler, still no go.

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Paul Angus" <pa...@shapeblue.com>
> To: "dev" <de...@cloudstack.apache.org>
> Sent: Saturday, 20 January, 2018 08:40:01
> Subject: RE: HA issues

> No problem,
> 
> To be honest host-ha was developed *because* vm-ha was not reliable 
> under a number of conditions, including a host failure.
> 
> paul.angus@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>  
> 
> 
> 
> -----Original Message-----
> From: Nux! [mailto:nux@li.nux.ro]
> Sent: 19 January 2018 14:26
> To: dev <de...@cloudstack.apache.org>
> Subject: Re: HA issues
> 
> Hi Paul,
> 
> Thanks for checking. My compute offering is HA enabled, of course.
> Host HA is disabled as well as OOBM.
> 
> 
> I'll do the tests again on Monday and report back.
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> ----- Original Message -----
>> From: "Paul Angus" <pa...@shapeblue.com>
>> To: "dev" <de...@cloudstack.apache.org>
>> Sent: Friday, 19 January, 2018 14:10:06
>> Subject: RE: HA issues
> 
>> Hey Nux,
>> 
>> I've being testing out the host-ha feature against a couple of physical hosts.
>> I've found that if the compute offering isn't ha enabled, then the vm isn't
>> restarted on the original host when it is rebooted, or any other host.    If
>> the vm is ha-enabled, then the vm was restarted on the original host 
>> when host ha restarted the host.
>> 
>> Can you double check that the instance was an ha-enabled one?
>> 
>> OR
>> maybe the timeouts for the host-ha are too long and the vm-ha 
>> timed-out before hand ...?
>> 
>> 
>> 
>> Kind regards,
>> 
>> Paul Angus
>> 
>> paul.angus@shapeblue.com
>> www.shapeblue.com
>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>  
>> 
>> 
>> 
>> -----Original Message-----
>> From: Nux! [mailto:nux@li.nux.ro]
>> Sent: 17 January 2018 09:12
>> To: dev <de...@cloudstack.apache.org>
>> Subject: Re: HA issues
>> 
>> Right, sorry for using the terms interchangeably, I see what you mean.
>> 
>> I'll do further testing then as VM HA was also not working in my setup.
>> 
>> I'll be back.
>> 
>> --
>> Sent from the Delta quadrant using Borg technology!
>> 
>> Nux!
>> www.nux.ro
>> 
>> ----- Original Message -----
>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>> To: "dev" <de...@cloudstack.apache.org>
>>> Sent: Wednesday, 17 January, 2018 09:09:19
>>> Subject: Re: HA issues
>> 
>>> Hi Lucian,
>>> 
>>> 
>>> The "Host HA" feature is entirely different from VM HA, however, 
>>> they may work in tandem, so please stop using the terms 
>>> interchangeably as it may cause the community to believe a regression has been caused.
>>> 
>>> 
>>> The "Host HA" feature currently ships with only "Host HA" provider 
>>> for KVM that is strictly tied to out-of-band management (IPMI for 
>>> fencing, i.e power off and recovery, i.e. reboot) and NFS (as primary storage).
>>> (We also have a provider for simulator, but that's for 
>>> coverage/testing purposes).
>>> 
>>> 
>>> Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is enabled.
>>> The frameowkr allows interested parties may write their own HA 
>>> providers for a hypervisor that can use a different 
>>> strategy/mechanism for fencing/recovery of hosts (including write a 
>>> non-IPMI based OOBM
>>> plugin) and host/disk activity checker that is non-NFS based.
>>> 
>>> 
>>> The "Host HA" feature ships disabled by default and does not cause 
>>> any interference with VM HA. However, when enabled and configured 
>>> correctly, it is a known limitation that when it is unable to 
>>> successfully perform recovery or fencing tasks it may not trigger VM 
>>> HA. We can discuss how to handle such cases (thoughts?). "Host HA"
>>> would try couple of times to recover and failing to do so, it would 
>>> eventually trigger a host fencing task. If it's unable to fence a 
>>> host, it will indefinitely attempt to fence the host (the host state 
>>> will be stuck at fencing state in cloud.ha_config table for example) 
>>> and alerts will be sent to admin who can do some manual intervention 
>>> to handle such situations (if you've email/smtp enabled, you should 
>>> see alert emails).
>>> 
>>> 
>>> We can discuss how to improve and have a workaround for the case 
>>> you've hit, thanks for sharing.
>>> 
>>> 
>>> - Rohit
>>> 
>>> ________________________________
>>> From: Nux! <nu...@li.nux.ro>
>>> Sent: Tuesday, January 16, 2018 10:42:35 PM
>>> To: dev
>>> Subject: Re: HA issues
>>> 
>>> Ok, reinstalled and re-tested.
>>> 
>>> What I've learned:
>>> 
>>> - HA only works now if OOB is configured, the old way HA no longer 
>>> applies - this can be good and bad, not everyone has IPMIs
>>> 
>>> - HA only works if IPMI is reachable. I've pulled the cord on a HV 
>>> and HA failed to do its thing, leaving me with a HV down along with 
>>> all the VMs running there. That's bad.
>>> I've opened this ticket for it:
>>> https://issues.apache.org/jira/browse/CLOUDSTACK-10234
>>> 
>>> Let me know if you need any extra info or stuff to test.
>>> 
>>> Regards,
>>> Lucian
>>> 
>>> --
>>> Sent from the Delta quadrant using Borg technology!
>>> 
>>> Nux!
>>> www.nux.ro
>>> 
>>> 
>>> rohit.yadav@shapeblue.com
>>> www.shapeblue.com
>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>>  
>>> 
>>> 
>>> ----- Original Message -----
>>>> From: "Nux!" <nu...@li.nux.ro>
>>>> To: "dev" <de...@cloudstack.apache.org>
>>>> Sent: Tuesday, 16 January, 2018 11:35:58
>>>> Subject: Re: HA issues
>>> 
>>>> I'll reinstall my setup and try again, just to be sure I'm working 
>>>> on a clean slate.
>>>>
>>>> --
>>>> Sent from the Delta quadrant using Borg technology!
>>>>
>>>> Nux!
>>>> www.nux.ro
>>>>
>>>> ----- Original Message -----
>>>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>>>> To: "dev" <de...@cloudstack.apache.org>
>>>>> Sent: Tuesday, 16 January, 2018 11:29:51
>>>>> Subject: Re: HA issues
>>>>
>>>>> Hi Lucian,
>>>>>
>>>>>
>>>>> If you're talking about the new HostHA feature (with 
>>>>> KVM+nfs+ipmi), please refer to following docs:
>>>>>
>>>>> http://docs.cloudstack.apache.org/projects/cloudstack-administrati
>>>>> o n /en/latest/hosts.html#out-of-band-management
>>>>>
>>>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
>>>>>
>>>>>
>>>>> We'll need to you look at logs perhaps create a JIRA ticket with 
>>>>> the logs and details? If you saw ipmi based reboot, then host-ha 
>>>>> indeed tried to recover i.e. reboot the host, once hostha has done 
>>>>> its work it would schedule HA for VM as soon as the recovery 
>>>>> operation succeeds (we've simulator and kvm based marvin tests for 
>>>>> such scenarios).
>>>>>
>>>>>
>>>>> Can you see it making attempt to schedule VM ha in logs, or any failure?
>>>>>
>>>>>
>>>>> - Rohit
>>>>>
>>>>> <https://cloudstack.apache.org>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Nux! <nu...@li.nux.ro>
>>>>> Sent: Tuesday, January 16, 2018 12:47:56 AM
>>>>> To: dev
>>>>> Subject: [4.11] HA issues
>>>>>
>>>>> Hi,
>>>>>
>>>>> I see there's a new HA engine for KVM and IPMI support which is 
>>>>> really nice, however it seems hit and miss.
>>>>> I have created an instance with HA offering, kernel panicked one 
>>>>> of the hypervisors - after a while the server was rebooted via 
>>>>> IPMI probably, but the instance never moved to a running 
>>>>> hypervisor and even after the original hypervisor came back it was still left in Stopped state.
>>>>> Is there any extra things I need to set up to have proper HA?
>>>>>
>>>>> Regards,
>>>>> Lucian
>>>>>
>>>>> --
>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>
>>>>> Nux!
>>>>> www.nux.ro
>>>>>
>>>>> rohit.yadav@shapeblue.com
>>>>> www.shapeblue.com<http://www.shapeblue.com>
>>>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > @shapeblue

Re: HA issues

Posted by Nux! <nu...@li.nux.ro>.
Hi,

Installed and reinstalled, VM HA just does not work for me.
In addition, if the HV going AWOL is hosting the systemvms, then they also do not get restarted despite available HVs online.
I've opened another ticket with logs:

https://issues.apache.org/jira/browse/CLOUDSTACK-10246

Happy to allow access to my rig if it helps.

I've disabled firewall and whatnot also left out other bits of network hardware just to keep it simpler, still no go.

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Paul Angus" <pa...@shapeblue.com>
> To: "dev" <de...@cloudstack.apache.org>
> Sent: Saturday, 20 January, 2018 08:40:01
> Subject: RE: HA issues

> No problem,
> 
> To be honest host-ha was developed *because* vm-ha was not reliable under a
> number of conditions, including a host failure.
> 
> paul.angus@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>  
> 
> 
> 
> -----Original Message-----
> From: Nux! [mailto:nux@li.nux.ro]
> Sent: 19 January 2018 14:26
> To: dev <de...@cloudstack.apache.org>
> Subject: Re: HA issues
> 
> Hi Paul,
> 
> Thanks for checking. My compute offering is HA enabled, of course.
> Host HA is disabled as well as OOBM.
> 
> 
> I'll do the tests again on Monday and report back.
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> ----- Original Message -----
>> From: "Paul Angus" <pa...@shapeblue.com>
>> To: "dev" <de...@cloudstack.apache.org>
>> Sent: Friday, 19 January, 2018 14:10:06
>> Subject: RE: HA issues
> 
>> Hey Nux,
>> 
>> I've being testing out the host-ha feature against a couple of physical hosts.
>> I've found that if the compute offering isn't ha enabled, then the vm isn't
>> restarted on the original host when it is rebooted, or any other host.    If
>> the vm is ha-enabled, then the vm was restarted on the original host
>> when host ha restarted the host.
>> 
>> Can you double check that the instance was an ha-enabled one?
>> 
>> OR
>> maybe the timeouts for the host-ha are too long and the vm-ha
>> timed-out before hand ...?
>> 
>> 
>> 
>> Kind regards,
>> 
>> Paul Angus
>> 
>> paul.angus@shapeblue.com
>> www.shapeblue.com
>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>  
>> 
>> 
>> 
>> -----Original Message-----
>> From: Nux! [mailto:nux@li.nux.ro]
>> Sent: 17 January 2018 09:12
>> To: dev <de...@cloudstack.apache.org>
>> Subject: Re: HA issues
>> 
>> Right, sorry for using the terms interchangeably, I see what you mean.
>> 
>> I'll do further testing then as VM HA was also not working in my setup.
>> 
>> I'll be back.
>> 
>> --
>> Sent from the Delta quadrant using Borg technology!
>> 
>> Nux!
>> www.nux.ro
>> 
>> ----- Original Message -----
>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>> To: "dev" <de...@cloudstack.apache.org>
>>> Sent: Wednesday, 17 January, 2018 09:09:19
>>> Subject: Re: HA issues
>> 
>>> Hi Lucian,
>>> 
>>> 
>>> The "Host HA" feature is entirely different from VM HA, however, they
>>> may work in tandem, so please stop using the terms interchangeably as
>>> it may cause the community to believe a regression has been caused.
>>> 
>>> 
>>> The "Host HA" feature currently ships with only "Host HA" provider
>>> for KVM that is strictly tied to out-of-band management (IPMI for
>>> fencing, i.e power off and recovery, i.e. reboot) and NFS (as primary storage).
>>> (We also have a provider for simulator, but that's for
>>> coverage/testing purposes).
>>> 
>>> 
>>> Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is enabled.
>>> The frameowkr allows interested parties may write their own HA
>>> providers for a hypervisor that can use a different
>>> strategy/mechanism for fencing/recovery of hosts (including write a
>>> non-IPMI based OOBM
>>> plugin) and host/disk activity checker that is non-NFS based.
>>> 
>>> 
>>> The "Host HA" feature ships disabled by default and does not cause
>>> any interference with VM HA. However, when enabled and configured
>>> correctly, it is a known limitation that when it is unable to
>>> successfully perform recovery or fencing tasks it may not trigger VM
>>> HA. We can discuss how to handle such cases (thoughts?). "Host HA"
>>> would try couple of times to recover and failing to do so, it would
>>> eventually trigger a host fencing task. If it's unable to fence a
>>> host, it will indefinitely attempt to fence the host (the host state
>>> will be stuck at fencing state in cloud.ha_config table for example)
>>> and alerts will be sent to admin who can do some manual intervention
>>> to handle such situations (if you've email/smtp enabled, you should see alert
>>> emails).
>>> 
>>> 
>>> We can discuss how to improve and have a workaround for the case
>>> you've hit, thanks for sharing.
>>> 
>>> 
>>> - Rohit
>>> 
>>> ________________________________
>>> From: Nux! <nu...@li.nux.ro>
>>> Sent: Tuesday, January 16, 2018 10:42:35 PM
>>> To: dev
>>> Subject: Re: HA issues
>>> 
>>> Ok, reinstalled and re-tested.
>>> 
>>> What I've learned:
>>> 
>>> - HA only works now if OOB is configured, the old way HA no longer
>>> applies - this can be good and bad, not everyone has IPMIs
>>> 
>>> - HA only works if IPMI is reachable. I've pulled the cord on a HV
>>> and HA failed to do its thing, leaving me with a HV down along with
>>> all the VMs running there. That's bad.
>>> I've opened this ticket for it:
>>> https://issues.apache.org/jira/browse/CLOUDSTACK-10234
>>> 
>>> Let me know if you need any extra info or stuff to test.
>>> 
>>> Regards,
>>> Lucian
>>> 
>>> --
>>> Sent from the Delta quadrant using Borg technology!
>>> 
>>> Nux!
>>> www.nux.ro
>>> 
>>> 
>>> rohit.yadav@shapeblue.com
>>> www.shapeblue.com
>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>>  
>>> 
>>> 
>>> ----- Original Message -----
>>>> From: "Nux!" <nu...@li.nux.ro>
>>>> To: "dev" <de...@cloudstack.apache.org>
>>>> Sent: Tuesday, 16 January, 2018 11:35:58
>>>> Subject: Re: HA issues
>>> 
>>>> I'll reinstall my setup and try again, just to be sure I'm working
>>>> on a clean slate.
>>>>
>>>> --
>>>> Sent from the Delta quadrant using Borg technology!
>>>>
>>>> Nux!
>>>> www.nux.ro
>>>>
>>>> ----- Original Message -----
>>>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>>>> To: "dev" <de...@cloudstack.apache.org>
>>>>> Sent: Tuesday, 16 January, 2018 11:29:51
>>>>> Subject: Re: HA issues
>>>>
>>>>> Hi Lucian,
>>>>>
>>>>>
>>>>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi),
>>>>> please refer to following docs:
>>>>>
>>>>> http://docs.cloudstack.apache.org/projects/cloudstack-administratio
>>>>> n /en/latest/hosts.html#out-of-band-management
>>>>>
>>>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
>>>>>
>>>>>
>>>>> We'll need to you look at logs perhaps create a JIRA ticket with
>>>>> the logs and details? If you saw ipmi based reboot, then host-ha
>>>>> indeed tried to recover i.e. reboot the host, once hostha has done
>>>>> its work it would schedule HA for VM as soon as the recovery
>>>>> operation succeeds (we've simulator and kvm based marvin tests for such
>>>>> scenarios).
>>>>>
>>>>>
>>>>> Can you see it making attempt to schedule VM ha in logs, or any failure?
>>>>>
>>>>>
>>>>> - Rohit
>>>>>
>>>>> <https://cloudstack.apache.org>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Nux! <nu...@li.nux.ro>
>>>>> Sent: Tuesday, January 16, 2018 12:47:56 AM
>>>>> To: dev
>>>>> Subject: [4.11] HA issues
>>>>>
>>>>> Hi,
>>>>>
>>>>> I see there's a new HA engine for KVM and IPMI support which is
>>>>> really nice, however it seems hit and miss.
>>>>> I have created an instance with HA offering, kernel panicked one of
>>>>> the hypervisors - after a while the server was rebooted via IPMI
>>>>> probably, but the instance never moved to a running hypervisor and
>>>>> even after the original hypervisor came back it was still left in Stopped state.
>>>>> Is there any extra things I need to set up to have proper HA?
>>>>>
>>>>> Regards,
>>>>> Lucian
>>>>>
>>>>> --
>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>
>>>>> Nux!
>>>>> www.nux.ro
>>>>>
>>>>> rohit.yadav@shapeblue.com
>>>>> www.shapeblue.com<http://www.shapeblue.com>
>>>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > @shapeblue

RE: HA issues

Posted by Paul Angus <pa...@shapeblue.com>.
No problem,

To be honest host-ha was developed *because* vm-ha was not reliable under a number of conditions, including a host failure.

paul.angus@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 


-----Original Message-----
From: Nux! [mailto:nux@li.nux.ro] 
Sent: 19 January 2018 14:26
To: dev <de...@cloudstack.apache.org>
Subject: Re: HA issues

Hi Paul,

Thanks for checking. My compute offering is HA enabled, of course.
Host HA is disabled as well as OOBM.


I'll do the tests again on Monday and report back.

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Paul Angus" <pa...@shapeblue.com>
> To: "dev" <de...@cloudstack.apache.org>
> Sent: Friday, 19 January, 2018 14:10:06
> Subject: RE: HA issues

> Hey Nux,
> 
> I've being testing out the host-ha feature against a couple of physical hosts.
> I've found that if the compute offering isn't ha enabled, then the vm isn't
> restarted on the original host when it is rebooted, or any other host.    If
> the vm is ha-enabled, then the vm was restarted on the original host 
> when host ha restarted the host.
> 
> Can you double check that the instance was an ha-enabled one?
> 
> OR
> maybe the timeouts for the host-ha are too long and the vm-ha 
> timed-out before hand ...?
> 
> 
> 
> Kind regards,
> 
> Paul Angus
> 
> paul.angus@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>  
> 
> 
> 
> -----Original Message-----
> From: Nux! [mailto:nux@li.nux.ro]
> Sent: 17 January 2018 09:12
> To: dev <de...@cloudstack.apache.org>
> Subject: Re: HA issues
> 
> Right, sorry for using the terms interchangeably, I see what you mean.
> 
> I'll do further testing then as VM HA was also not working in my setup.
> 
> I'll be back.
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> ----- Original Message -----
>> From: "Rohit Yadav" <ro...@shapeblue.com>
>> To: "dev" <de...@cloudstack.apache.org>
>> Sent: Wednesday, 17 January, 2018 09:09:19
>> Subject: Re: HA issues
> 
>> Hi Lucian,
>> 
>> 
>> The "Host HA" feature is entirely different from VM HA, however, they 
>> may work in tandem, so please stop using the terms interchangeably as 
>> it may cause the community to believe a regression has been caused.
>> 
>> 
>> The "Host HA" feature currently ships with only "Host HA" provider 
>> for KVM that is strictly tied to out-of-band management (IPMI for 
>> fencing, i.e power off and recovery, i.e. reboot) and NFS (as primary storage).
>> (We also have a provider for simulator, but that's for 
>> coverage/testing purposes).
>> 
>> 
>> Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is enabled.
>> The frameowkr allows interested parties may write their own HA 
>> providers for a hypervisor that can use a different 
>> strategy/mechanism for fencing/recovery of hosts (including write a 
>> non-IPMI based OOBM
>> plugin) and host/disk activity checker that is non-NFS based.
>> 
>> 
>> The "Host HA" feature ships disabled by default and does not cause 
>> any interference with VM HA. However, when enabled and configured 
>> correctly, it is a known limitation that when it is unable to 
>> successfully perform recovery or fencing tasks it may not trigger VM 
>> HA. We can discuss how to handle such cases (thoughts?). "Host HA"
>> would try couple of times to recover and failing to do so, it would 
>> eventually trigger a host fencing task. If it's unable to fence a 
>> host, it will indefinitely attempt to fence the host (the host state 
>> will be stuck at fencing state in cloud.ha_config table for example) 
>> and alerts will be sent to admin who can do some manual intervention 
>> to handle such situations (if you've email/smtp enabled, you should see alert emails).
>> 
>> 
>> We can discuss how to improve and have a workaround for the case 
>> you've hit, thanks for sharing.
>> 
>> 
>> - Rohit
>> 
>> ________________________________
>> From: Nux! <nu...@li.nux.ro>
>> Sent: Tuesday, January 16, 2018 10:42:35 PM
>> To: dev
>> Subject: Re: HA issues
>> 
>> Ok, reinstalled and re-tested.
>> 
>> What I've learned:
>> 
>> - HA only works now if OOB is configured, the old way HA no longer 
>> applies - this can be good and bad, not everyone has IPMIs
>> 
>> - HA only works if IPMI is reachable. I've pulled the cord on a HV 
>> and HA failed to do its thing, leaving me with a HV down along with 
>> all the VMs running there. That's bad.
>> I've opened this ticket for it:
>> https://issues.apache.org/jira/browse/CLOUDSTACK-10234
>> 
>> Let me know if you need any extra info or stuff to test.
>> 
>> Regards,
>> Lucian
>> 
>> --
>> Sent from the Delta quadrant using Borg technology!
>> 
>> Nux!
>> www.nux.ro
>> 
>> 
>> rohit.yadav@shapeblue.com
>> www.shapeblue.com
>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>  
>> 
>> 
>> ----- Original Message -----
>>> From: "Nux!" <nu...@li.nux.ro>
>>> To: "dev" <de...@cloudstack.apache.org>
>>> Sent: Tuesday, 16 January, 2018 11:35:58
>>> Subject: Re: HA issues
>> 
>>> I'll reinstall my setup and try again, just to be sure I'm working 
>>> on a clean slate.
>>>
>>> --
>>> Sent from the Delta quadrant using Borg technology!
>>>
>>> Nux!
>>> www.nux.ro
>>>
>>> ----- Original Message -----
>>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>>> To: "dev" <de...@cloudstack.apache.org>
>>>> Sent: Tuesday, 16 January, 2018 11:29:51
>>>> Subject: Re: HA issues
>>>
>>>> Hi Lucian,
>>>>
>>>>
>>>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi), 
>>>> please refer to following docs:
>>>>
>>>> http://docs.cloudstack.apache.org/projects/cloudstack-administratio
>>>> n /en/latest/hosts.html#out-of-band-management
>>>>
>>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
>>>>
>>>>
>>>> We'll need to you look at logs perhaps create a JIRA ticket with 
>>>> the logs and details? If you saw ipmi based reboot, then host-ha 
>>>> indeed tried to recover i.e. reboot the host, once hostha has done 
>>>> its work it would schedule HA for VM as soon as the recovery 
>>>> operation succeeds (we've simulator and kvm based marvin tests for such scenarios).
>>>>
>>>>
>>>> Can you see it making attempt to schedule VM ha in logs, or any failure?
>>>>
>>>>
>>>> - Rohit
>>>>
>>>> <https://cloudstack.apache.org>
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Nux! <nu...@li.nux.ro>
>>>> Sent: Tuesday, January 16, 2018 12:47:56 AM
>>>> To: dev
>>>> Subject: [4.11] HA issues
>>>>
>>>> Hi,
>>>>
>>>> I see there's a new HA engine for KVM and IPMI support which is 
>>>> really nice, however it seems hit and miss.
>>>> I have created an instance with HA offering, kernel panicked one of 
>>>> the hypervisors - after a while the server was rebooted via IPMI 
>>>> probably, but the instance never moved to a running hypervisor and 
>>>> even after the original hypervisor came back it was still left in Stopped state.
>>>> Is there any extra things I need to set up to have proper HA?
>>>>
>>>> Regards,
>>>> Lucian
>>>>
>>>> --
>>>> Sent from the Delta quadrant using Borg technology!
>>>>
>>>> Nux!
>>>> www.nux.ro
>>>>
>>>> rohit.yadav@shapeblue.com
>>>> www.shapeblue.com<http://www.shapeblue.com>
>>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > @shapeblue

Re: HA issues

Posted by Nux! <nu...@li.nux.ro>.
Hi Paul,

Thanks for checking. My compute offering is HA enabled, of course.
Host HA is disabled as well as OOBM.


I'll do the tests again on Monday and report back.

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Paul Angus" <pa...@shapeblue.com>
> To: "dev" <de...@cloudstack.apache.org>
> Sent: Friday, 19 January, 2018 14:10:06
> Subject: RE: HA issues

> Hey Nux,
> 
> I've being testing out the host-ha feature against a couple of physical hosts.
> I've found that if the compute offering isn't ha enabled, then the vm isn't
> restarted on the original host when it is rebooted, or any other host.    If
> the vm is ha-enabled, then the vm was restarted on the original host when host
> ha restarted the host.
> 
> Can you double check that the instance was an ha-enabled one?
> 
> OR
> maybe the timeouts for the host-ha are too long and the vm-ha timed-out before
> hand ...?
> 
> 
> 
> Kind regards,
> 
> Paul Angus
> 
> paul.angus@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>  
> 
> 
> 
> -----Original Message-----
> From: Nux! [mailto:nux@li.nux.ro]
> Sent: 17 January 2018 09:12
> To: dev <de...@cloudstack.apache.org>
> Subject: Re: HA issues
> 
> Right, sorry for using the terms interchangeably, I see what you mean.
> 
> I'll do further testing then as VM HA was also not working in my setup.
> 
> I'll be back.
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> ----- Original Message -----
>> From: "Rohit Yadav" <ro...@shapeblue.com>
>> To: "dev" <de...@cloudstack.apache.org>
>> Sent: Wednesday, 17 January, 2018 09:09:19
>> Subject: Re: HA issues
> 
>> Hi Lucian,
>> 
>> 
>> The "Host HA" feature is entirely different from VM HA, however, they
>> may work in tandem, so please stop using the terms interchangeably as
>> it may cause the community to believe a regression has been caused.
>> 
>> 
>> The "Host HA" feature currently ships with only "Host HA" provider for
>> KVM that is strictly tied to out-of-band management (IPMI for fencing,
>> i.e power off and recovery, i.e. reboot) and NFS (as primary storage).
>> (We also have a provider for simulator, but that's for coverage/testing
>> purposes).
>> 
>> 
>> Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is enabled.
>> The frameowkr allows interested parties may write their own HA
>> providers for a hypervisor that can use a different strategy/mechanism
>> for fencing/recovery of hosts (including write a non-IPMI based OOBM
>> plugin) and host/disk activity checker that is non-NFS based.
>> 
>> 
>> The "Host HA" feature ships disabled by default and does not cause any
>> interference with VM HA. However, when enabled and configured
>> correctly, it is a known limitation that when it is unable to
>> successfully perform recovery or fencing tasks it may not trigger VM
>> HA. We can discuss how to handle such cases (thoughts?). "Host HA"
>> would try couple of times to recover and failing to do so, it would
>> eventually trigger a host fencing task. If it's unable to fence a
>> host, it will indefinitely attempt to fence the host (the host state
>> will be stuck at fencing state in cloud.ha_config table for example)
>> and alerts will be sent to admin who can do some manual intervention to handle
>> such situations (if you've email/smtp enabled, you should see alert emails).
>> 
>> 
>> We can discuss how to improve and have a workaround for the case
>> you've hit, thanks for sharing.
>> 
>> 
>> - Rohit
>> 
>> ________________________________
>> From: Nux! <nu...@li.nux.ro>
>> Sent: Tuesday, January 16, 2018 10:42:35 PM
>> To: dev
>> Subject: Re: HA issues
>> 
>> Ok, reinstalled and re-tested.
>> 
>> What I've learned:
>> 
>> - HA only works now if OOB is configured, the old way HA no longer
>> applies - this can be good and bad, not everyone has IPMIs
>> 
>> - HA only works if IPMI is reachable. I've pulled the cord on a HV and
>> HA failed to do its thing, leaving me with a HV down along with all
>> the VMs running there. That's bad.
>> I've opened this ticket for it:
>> https://issues.apache.org/jira/browse/CLOUDSTACK-10234
>> 
>> Let me know if you need any extra info or stuff to test.
>> 
>> Regards,
>> Lucian
>> 
>> --
>> Sent from the Delta quadrant using Borg technology!
>> 
>> Nux!
>> www.nux.ro
>> 
>> 
>> rohit.yadav@shapeblue.com
>> www.shapeblue.com
>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>  
>> 
>> 
>> ----- Original Message -----
>>> From: "Nux!" <nu...@li.nux.ro>
>>> To: "dev" <de...@cloudstack.apache.org>
>>> Sent: Tuesday, 16 January, 2018 11:35:58
>>> Subject: Re: HA issues
>> 
>>> I'll reinstall my setup and try again, just to be sure I'm working on
>>> a clean slate.
>>>
>>> --
>>> Sent from the Delta quadrant using Borg technology!
>>>
>>> Nux!
>>> www.nux.ro
>>>
>>> ----- Original Message -----
>>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>>> To: "dev" <de...@cloudstack.apache.org>
>>>> Sent: Tuesday, 16 January, 2018 11:29:51
>>>> Subject: Re: HA issues
>>>
>>>> Hi Lucian,
>>>>
>>>>
>>>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi),
>>>> please refer to following docs:
>>>>
>>>> http://docs.cloudstack.apache.org/projects/cloudstack-administration
>>>> /en/latest/hosts.html#out-of-band-management
>>>>
>>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
>>>>
>>>>
>>>> We'll need to you look at logs perhaps create a JIRA ticket with the
>>>> logs and details? If you saw ipmi based reboot, then host-ha indeed
>>>> tried to recover i.e. reboot the host, once hostha has done its work
>>>> it would schedule HA for VM as soon as the recovery operation
>>>> succeeds (we've simulator and kvm based marvin tests for such scenarios).
>>>>
>>>>
>>>> Can you see it making attempt to schedule VM ha in logs, or any failure?
>>>>
>>>>
>>>> - Rohit
>>>>
>>>> <https://cloudstack.apache.org>
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Nux! <nu...@li.nux.ro>
>>>> Sent: Tuesday, January 16, 2018 12:47:56 AM
>>>> To: dev
>>>> Subject: [4.11] HA issues
>>>>
>>>> Hi,
>>>>
>>>> I see there's a new HA engine for KVM and IPMI support which is
>>>> really nice, however it seems hit and miss.
>>>> I have created an instance with HA offering, kernel panicked one of
>>>> the hypervisors - after a while the server was rebooted via IPMI
>>>> probably, but the instance never moved to a running hypervisor and
>>>> even after the original hypervisor came back it was still left in Stopped state.
>>>> Is there any extra things I need to set up to have proper HA?
>>>>
>>>> Regards,
>>>> Lucian
>>>>
>>>> --
>>>> Sent from the Delta quadrant using Borg technology!
>>>>
>>>> Nux!
>>>> www.nux.ro
>>>>
>>>> rohit.yadav@shapeblue.com
>>>> www.shapeblue.com<http://www.shapeblue.com>
>>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > @shapeblue

RE: HA issues

Posted by Paul Angus <pa...@shapeblue.com>.
Hey Nux,

I've being testing out the host-ha feature against a couple of physical hosts.  I've found that if the compute offering isn't ha enabled, then the vm isn't restarted on the original host when it is rebooted, or any other host.    If the vm is ha-enabled, then the vm was restarted on the original host when host ha restarted the host.

Can you double check that the instance was an ha-enabled one?

OR
maybe the timeouts for the host-ha are too long and the vm-ha timed-out before hand ...?



Kind regards,

Paul Angus

paul.angus@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 


-----Original Message-----
From: Nux! [mailto:nux@li.nux.ro] 
Sent: 17 January 2018 09:12
To: dev <de...@cloudstack.apache.org>
Subject: Re: HA issues

Right, sorry for using the terms interchangeably, I see what you mean.

I'll do further testing then as VM HA was also not working in my setup.

I'll be back.

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Rohit Yadav" <ro...@shapeblue.com>
> To: "dev" <de...@cloudstack.apache.org>
> Sent: Wednesday, 17 January, 2018 09:09:19
> Subject: Re: HA issues

> Hi Lucian,
> 
> 
> The "Host HA" feature is entirely different from VM HA, however, they 
> may work in tandem, so please stop using the terms interchangeably as 
> it may cause the community to believe a regression has been caused.
> 
> 
> The "Host HA" feature currently ships with only "Host HA" provider for 
> KVM that is strictly tied to out-of-band management (IPMI for fencing, 
> i.e power off and recovery, i.e. reboot) and NFS (as primary storage). 
> (We also have a provider for simulator, but that's for coverage/testing purposes).
> 
> 
> Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is enabled.
> The frameowkr allows interested parties may write their own HA 
> providers for a hypervisor that can use a different strategy/mechanism 
> for fencing/recovery of hosts (including write a non-IPMI based OOBM 
> plugin) and host/disk activity checker that is non-NFS based.
> 
> 
> The "Host HA" feature ships disabled by default and does not cause any 
> interference with VM HA. However, when enabled and configured 
> correctly, it is a known limitation that when it is unable to 
> successfully perform recovery or fencing tasks it may not trigger VM 
> HA. We can discuss how to handle such cases (thoughts?). "Host HA" 
> would try couple of times to recover and failing to do so, it would 
> eventually trigger a host fencing task. If it's unable to fence a 
> host, it will indefinitely attempt to fence the host (the host state 
> will be stuck at fencing state in cloud.ha_config table for example) 
> and alerts will be sent to admin who can do some manual intervention to handle such situations (if you've email/smtp enabled, you should see alert emails).
> 
> 
> We can discuss how to improve and have a workaround for the case 
> you've hit, thanks for sharing.
> 
> 
> - Rohit
> 
> ________________________________
> From: Nux! <nu...@li.nux.ro>
> Sent: Tuesday, January 16, 2018 10:42:35 PM
> To: dev
> Subject: Re: HA issues
> 
> Ok, reinstalled and re-tested.
> 
> What I've learned:
> 
> - HA only works now if OOB is configured, the old way HA no longer 
> applies - this can be good and bad, not everyone has IPMIs
> 
> - HA only works if IPMI is reachable. I've pulled the cord on a HV and 
> HA failed to do its thing, leaving me with a HV down along with all 
> the VMs running there. That's bad.
> I've opened this ticket for it:
> https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> 
> Let me know if you need any extra info or stuff to test.
> 
> Regards,
> Lucian
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> 
> rohit.yadav@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>  
> 
> 
> ----- Original Message -----
>> From: "Nux!" <nu...@li.nux.ro>
>> To: "dev" <de...@cloudstack.apache.org>
>> Sent: Tuesday, 16 January, 2018 11:35:58
>> Subject: Re: HA issues
> 
>> I'll reinstall my setup and try again, just to be sure I'm working on 
>> a clean slate.
>>
>> --
>> Sent from the Delta quadrant using Borg technology!
>>
>> Nux!
>> www.nux.ro
>>
>> ----- Original Message -----
>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>> To: "dev" <de...@cloudstack.apache.org>
>>> Sent: Tuesday, 16 January, 2018 11:29:51
>>> Subject: Re: HA issues
>>
>>> Hi Lucian,
>>>
>>>
>>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi), 
>>> please refer to following docs:
>>>
>>> http://docs.cloudstack.apache.org/projects/cloudstack-administration
>>> /en/latest/hosts.html#out-of-band-management
>>>
>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
>>>
>>>
>>> We'll need to you look at logs perhaps create a JIRA ticket with the 
>>> logs and details? If you saw ipmi based reboot, then host-ha indeed 
>>> tried to recover i.e. reboot the host, once hostha has done its work 
>>> it would schedule HA for VM as soon as the recovery operation 
>>> succeeds (we've simulator and kvm based marvin tests for such scenarios).
>>>
>>>
>>> Can you see it making attempt to schedule VM ha in logs, or any failure?
>>>
>>>
>>> - Rohit
>>>
>>> <https://cloudstack.apache.org>
>>>
>>>
>>>
>>> ________________________________
>>> From: Nux! <nu...@li.nux.ro>
>>> Sent: Tuesday, January 16, 2018 12:47:56 AM
>>> To: dev
>>> Subject: [4.11] HA issues
>>>
>>> Hi,
>>>
>>> I see there's a new HA engine for KVM and IPMI support which is 
>>> really nice, however it seems hit and miss.
>>> I have created an instance with HA offering, kernel panicked one of 
>>> the hypervisors - after a while the server was rebooted via IPMI 
>>> probably, but the instance never moved to a running hypervisor and 
>>> even after the original hypervisor came back it was still left in Stopped state.
>>> Is there any extra things I need to set up to have proper HA?
>>>
>>> Regards,
>>> Lucian
>>>
>>> --
>>> Sent from the Delta quadrant using Borg technology!
>>>
>>> Nux!
>>> www.nux.ro
>>>
>>> rohit.yadav@shapeblue.com
>>> www.shapeblue.com<http://www.shapeblue.com>
>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > @shapeblue

Re: HA issues

Posted by Nux! <nu...@li.nux.ro>.
Right, sorry for using the terms interchangeably, I see what you mean.

I'll do further testing then as VM HA was also not working in my setup.

I'll be back.

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Rohit Yadav" <ro...@shapeblue.com>
> To: "dev" <de...@cloudstack.apache.org>
> Sent: Wednesday, 17 January, 2018 09:09:19
> Subject: Re: HA issues

> Hi Lucian,
> 
> 
> The "Host HA" feature is entirely different from VM HA, however, they may work
> in tandem, so please stop using the terms interchangeably as it may cause the
> community to believe a regression has been caused.
> 
> 
> The "Host HA" feature currently ships with only "Host HA" provider for KVM that
> is strictly tied to out-of-band management (IPMI for fencing, i.e power off and
> recovery, i.e. reboot) and NFS (as primary storage). (We also have a provider
> for simulator, but that's for coverage/testing purposes).
> 
> 
> Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is enabled.
> The frameowkr allows interested parties may write their own HA providers for a
> hypervisor that can use a different strategy/mechanism for fencing/recovery of
> hosts (including write a non-IPMI based OOBM plugin) and host/disk activity
> checker that is non-NFS based.
> 
> 
> The "Host HA" feature ships disabled by default and does not cause any
> interference with VM HA. However, when enabled and configured correctly, it is
> a known limitation that when it is unable to successfully perform recovery or
> fencing tasks it may not trigger VM HA. We can discuss how to handle such cases
> (thoughts?). "Host HA" would try couple of times to recover and failing to do
> so, it would eventually trigger a host fencing task. If it's unable to fence a
> host, it will indefinitely attempt to fence the host (the host state will be
> stuck at fencing state in cloud.ha_config table for example) and alerts will be
> sent to admin who can do some manual intervention to handle such situations (if
> you've email/smtp enabled, you should see alert emails).
> 
> 
> We can discuss how to improve and have a workaround for the case you've hit,
> thanks for sharing.
> 
> 
> - Rohit
> 
> ________________________________
> From: Nux! <nu...@li.nux.ro>
> Sent: Tuesday, January 16, 2018 10:42:35 PM
> To: dev
> Subject: Re: HA issues
> 
> Ok, reinstalled and re-tested.
> 
> What I've learned:
> 
> - HA only works now if OOB is configured, the old way HA no longer applies -
> this can be good and bad, not everyone has IPMIs
> 
> - HA only works if IPMI is reachable. I've pulled the cord on a HV and HA failed
> to do its thing, leaving me with a HV down along with all the VMs running
> there. That's bad.
> I've opened this ticket for it:
> https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> 
> Let me know if you need any extra info or stuff to test.
> 
> Regards,
> Lucian
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> 
> rohit.yadav@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>  
> 
> 
> ----- Original Message -----
>> From: "Nux!" <nu...@li.nux.ro>
>> To: "dev" <de...@cloudstack.apache.org>
>> Sent: Tuesday, 16 January, 2018 11:35:58
>> Subject: Re: HA issues
> 
>> I'll reinstall my setup and try again, just to be sure I'm working on a clean
>> slate.
>>
>> --
>> Sent from the Delta quadrant using Borg technology!
>>
>> Nux!
>> www.nux.ro
>>
>> ----- Original Message -----
>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>> To: "dev" <de...@cloudstack.apache.org>
>>> Sent: Tuesday, 16 January, 2018 11:29:51
>>> Subject: Re: HA issues
>>
>>> Hi Lucian,
>>>
>>>
>>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi), please refer
>>> to following docs:
>>>
>>> http://docs.cloudstack.apache.org/projects/cloudstack-administration/en/latest/hosts.html#out-of-band-management
>>>
>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
>>>
>>>
>>> We'll need to you look at logs perhaps create a JIRA ticket with the logs and
>>> details? If you saw ipmi based reboot, then host-ha indeed tried to recover
>>> i.e. reboot the host, once hostha has done its work it would schedule HA for VM
>>> as soon as the recovery operation succeeds (we've simulator and kvm based
>>> marvin tests for such scenarios).
>>>
>>>
>>> Can you see it making attempt to schedule VM ha in logs, or any failure?
>>>
>>>
>>> - Rohit
>>>
>>> <https://cloudstack.apache.org>
>>>
>>>
>>>
>>> ________________________________
>>> From: Nux! <nu...@li.nux.ro>
>>> Sent: Tuesday, January 16, 2018 12:47:56 AM
>>> To: dev
>>> Subject: [4.11] HA issues
>>>
>>> Hi,
>>>
>>> I see there's a new HA engine for KVM and IPMI support which is really nice,
>>> however it seems hit and miss.
>>> I have created an instance with HA offering, kernel panicked one of the
>>> hypervisors - after a while the server was rebooted via IPMI probably, but the
>>> instance never moved to a running hypervisor and even after the original
>>> hypervisor came back it was still left in Stopped state.
>>> Is there any extra things I need to set up to have proper HA?
>>>
>>> Regards,
>>> Lucian
>>>
>>> --
>>> Sent from the Delta quadrant using Borg technology!
>>>
>>> Nux!
>>> www.nux.ro
>>>
>>> rohit.yadav@shapeblue.com
>>> www.shapeblue.com<http://www.shapeblue.com>
>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > @shapeblue

Re: HA issues

Posted by Andrija Panic <an...@gmail.com>.
HI Again Simon,

thanks for these, we also had something commited (actually the whole RBD
snap deletion logic on CEPH side, which was initially missing):
https://github.com/apache/cloudstack/pull/1230/commits and some stuff were
also handled here.

But what we have here, is, afaik a new case, where customer try to delete
large volume on CEPH (4TB in our case, or a bit smaller - happened a few
times), then this takes time (whatever reason...) - this is during the
actual delete command sent from MGMT to AGENT, so not "lazy delete" with
later purge thread) - the deletion process itself timeout after 30minutes
(1800sec - I guess this is the default "wait" global parameter) and after
this libvirt just hanges (kill -9 is the only way to restart libvirt)

i.e: delete volume sent 14

2018-02-14 15:20:53,032 INFO  [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Trying to fetch storage pool
8457c284-cf5d-3979-b82e-32ea5efeb97b from libvirt
2018-02-14 15:20:53,032 DEBUG [kvm.resource.LibvirtConnection]
(agentRequest-Handler-5:null) Looking for libvirtd connection at:
qemu:///system
2018-02-14 15:20:53,041 DEBUG [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Succesfully refreshed pool
8457c284-cf5d-3979-b82e-32ea5efeb97b Capacity: 235312757125120 Used:
44773027414768 Available: 99561505730560
2018-02-14 15:20:53,190 INFO  [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Attempting to remove volume
84c12d6f-7536-429a-8994-1b860446b672 from pool
8457c284-cf5d-3979-b82e-32ea5efeb97b
2018-02-14 15:20:53,190 INFO  [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Unprotecting and Removing RBD snapshots of
image cold-storage/84c12d6f-7536-429a-8994-1b860446b672 prior to removing
the image
2018-02-14 15:20:53,202 DEBUG [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
mon.xxxxyyyy.local:6789
2018-02-14 15:20:53,216 DEBUG [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Fetching list of snapshots of RBD image
cold-storage/84c12d6f-7536-429a-8994-1b860446b672
2018-02-14 15:20:53,224 DEBUG [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Succesfully unprotected and removed any
snapshots of cold-storage/84c12d6f-7536-429a-8994-1b860446b672 Continuing
to remove the RBD image
2018-02-14 15:20:53,228 DEBUG [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Succesfully closed rbd image and destroyed io
context.
2018-02-14 15:20:53,229 DEBUG [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Instructing libvirt to remove volume
84c12d6f-7536-429a-8994-1b860446b672 from pool
8457c284-cf5d-3979-b82e-32ea5efeb97b

then 30 minutes later, timeout.

2018-02-14 15:50:53,030 WARN  [c.c.a.m.AgentAttache]
(catalina-exec-4:ctx-468724bd ctx-b9984210) (logid:d23624d1) Seq
16-3001086201689154455: *Timed out on Seq 16-3001086201689154455*:  { Cmd ,
MgmtId: 90520740254323, via: 16(eq4-c2-2), Ver: v1, Flags: 100011,
[{"org.apache.cloudstack.storage.command.DeleteCommand":{"data":{"org.apache.cloudstack.storage.to.VolumeObjectTO":{"uuid":"84c12d6f-7536-429a-8994-1b860446b672","volumeType":"DATADISK","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"8457c284-cf5d-3979-b82e-32ea5efeb97b","id":1,"poolType":"RBD","host":"mon.xxxxyyyy.local","path":"cold-storage","port":6789,"url":"RBD://mon.xxxxyyyy.local/cold-storage/?ROLE=Primary&STOREUUID=8457c284-cf5d-3979-b82e-32ea5efeb97b"}},"name":"PRDRMSSQL01-DATA-DR","size":1073741824000,"path":"84c12d6f-7536-429a-8994-1b860446b672","volumeId":13889,"accountId":722,"format":"RAW","provisioningType":"THIN","id":13889,"hypervisorType":"KVM"}},"wait":0}}]
}

and then 3minute later

We get agent disconnected (virsh stuck, even virsh list don't work).
Nothing special in libvirt logs...

After this the volume still exist on CEPH, but I beleive later is again
removed via purge thread in ACS (I dont remember manually deleting them) -
which is very interesting actaully - why it does (seems to do) immediate
volume deletion, when later it's again removed (by purge thread I assume).

CHeers



On 19 February 2018 at 12:55, Simon Weller <sw...@ena.com.invalid> wrote:

> Also these -
>
> https://github.com/myENA/cloudstack/pull/20/commits/
> 1948ce5d24b87433ae9e8f4faebdfc20b56b751a
>
>
> https://github.com/myENA/cloudstack/pull/12/commits
>
>
>
>
>
> ________________________________
> From: Andrija Panic <an...@gmail.com>
> Sent: Monday, February 19, 2018 5:23 AM
> To: dev
> Subject: Re: HA issues
>
> Hi Simon,
>
> a big thank you for this, will have our devs check this!
>
> Thanks!
>
> On 19 February 2018 at 09:02, Simon Weller <sw...@ena.com.invalid>
> wrote:
>
> > Andrija,
> >
> >
> > We pushed quite a few PRs on the exception and lockup issues related to
> > Ceph in the agent.
> >
> >
> > We have a PR for the deletion issue. See if you have it pulled into your
> > release - https://github.com/myENA/cloudstack/pull/9
> [https://avatars1.githubusercontent.com/u/1444686?s=400&v=4]<https://
> github.com/myENA/cloudstack/pull/9>
>
> context cleanup by leprechau · Pull Request #9 · myENA/cloudstack<https://
> github.com/myENA/cloudstack/pull/9>
> github.com
> cleanup rbd image and rados context even if exceptions are thrown in
> deletePhysicalDisk routine
>
>
>
> >
> >
> > - Si
> >
> >
> >
> >
> > ________________________________
> > From: Andrija Panic <an...@gmail.com>
> > Sent: Saturday, February 17, 2018 1:49 PM
> > To: dev
> > Subject: Re: HA issues
> >
> > Hi Sean,
> >
> > (we have 2 threads interleaving on the libvirt lockd..) - so, did you
> > manage to understand what can cause the Agent Disconnect in most cases,
> for
> > you specifically? Is there any software (CloudStack) root cause
> > (disregarding i.e. networking issues etc)
> >
> > Just our examples, which you should probably not have:
> >
> > We had CEPH cluster running (with ACS), and there any exception in librbd
> > would crash JVM and the agent, but this has been fixed mostly -
> > Now get i.e. agent disconnect when ACS try to delete volume on CEPH (and
> > for some reason not succeed withing 30 minutes, volume deletion fails) -
> > then libvirt get's completety stuck (virsh list even dont work)...so
> agent
> > get's disconnect eventually.
> >
> > It would be good to get rid of agent disconnections in general, obviously
> > :) so that is why I'm asking (you are on NFS, so would like to see your
> > experience here).
> >
> > Thanks
> >
> > On 16 February 2018 at 21:52, Sean Lair <sl...@ippathways.com> wrote:
> >
> > > We were in the same situation as Nux.
> > >
> > > In our test environment we hit the issue with VMs not getting fenced
> and
> > > coming up on two hosts because of VM HA.   However, we updated some of
> > the
> > > logic for VM HA and turned on libvirtd's locking mechanism.  Now we are
> > > working great w/o IPMI.  The locking stops the VMs from starting
> > elsewhere,
> > > and everything recovers very nicely when the host starts responding
> > again.
> > >
> > > We are on 4.9.3 and haven't started testing with 4.11 yet, but it may
> > work
> > > along-side IPMI just fine - it would just have affect the fencing.
> > > However, we *currently* prefer how we are doing it now, because if the
> > > agent stops responding, but the host is still up, the VMs continue
> > running
> > > and no actual downtime is incurred.  Even when VM HA attempts to power
> on
> > > the VMs on another host, it just fails the power-up and the VMs
> continue
> > to
> > > run on the "agent disconnected" host. The host goes into alarm state
> and
> > > our NOC can look into what is wrong the agent on the host.  If IPMI was
> > > enabled, it sounds like it would power off the host (fence) and force
> > > downtime for us even if the VMs were actually running OK - and just the
> > > agent is unreachable.
> > >
> > > I plan on submitting our updates via a pull request at some point.
> But I
> > > can also send the updated code to anyone that wants to do some testing
> > > before then.
> > >
> > > -----Original Message-----
> > > From: Marcus [mailto:shadowsor@gmail.com]
> > > Sent: Friday, February 16, 2018 11:27 AM
> > > To: dev@cloudstack.apache.org
> > > Subject: Re: HA issues
> > >
> > > From your other emails it sounds as though you do not have IPMI
> > > configured, nor host HA enabled, correct? In this case, the correct
> thing
> > > to do is nothing. If CloudStack cannot guarantee the VM state (as is
> the
> > > case with an unreachable hypervisor), it should do nothing, for fear of
> > > causing a split brain and corrupting the VM disk (VM running on two
> > hosts).
> > >
> > > Clustering and fencing is a tricky proposition. When CloudStack (or any
> > > other cluster manager) is not configured to or cannot guarantee state
> > then
> > > things will simply lock up, in this case your HA VM on your broken
> > > hypervisor will not run elsewhere. This has been the case for a long
> time
> > > with CloudStack, HA would only start a VM after the original hypervisor
> > > agent came back and reported no VM is running.
> > >
> > > The new feature, from what I gather, simply adds the possibility of
> > > CloudStack being able to reach out and shut down the hypervisor to
> > > guarantee state. At that point it can start the VM elsewhere. If
> > something
> > > fails in that process (IPMI unreachable, for example, or bad
> > credentials),
> > > you're still going to be stuck with a VM not coming back.
> > >
> > > It's the nature of the thing. I'd be wary of any HA solution that does
> > not
> > > reach out and guarantee state via host or storage fencing before
> > starting a
> > > VM elsewhere, as it will be making assumptions. Its entirely possible a
> > VM
> > > might be unreachable or unable to access it storage for a short while,
> a
> > > new instance of the VM is started elsewhere, and the original VM comes
> > back.
> > >
> > > On Wed, Jan 17, 2018 at 9:02 AM Nux! <nu...@li.nux.ro> wrote:
> > >
> > > > Hi Rohit,
> > > >
> > > > I've reinstalled and tested. Still no go with VM HA.
> > > >
> > > > What I did was to kernel panic that particular HV ("echo c >
> > > > /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
> > > > What happened next is the HV got marked as "Alert", the VM on it was
> > > > all the time marked as "Running" and it was not migrated to another
> HV.
> > > > Once the panicked HV has booted back the VM reboots and becomes
> > > available.
> > > >
> > > > I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary
> > storage.
> > > > The VM has HA enabled service offering.
> > > > Host HA or OOBM configuration was not touched.
> > > >
> > > > Full log http://tmp.nux.ro/W3s-management-server.log
> > > >
> > > > --
> > > > Sent from the Delta quadrant using Borg technology!
> > > >
> > > > Nux!
> > > > www.nux.ro
> > > >
> > > > ----- Original Message -----
> > > > > From: "Rohit Yadav" <ro...@shapeblue.com>
> > > > > To: "dev" <de...@cloudstack.apache.org>
> > > > > Sent: Wednesday, 17 January, 2018 12:13:33
> > > > > Subject: Re: HA issues
> > > >
> > > > > I performed VM HA sanity checks and was not able to reproduce any
> > > > regression
> > > > > against two KVM CentOS7 hosts in a cluster.
> > > > >
> > > > >
> > > > > Without the "Host HA" feature, I deployed few HA-enabled VMs on a
> > > > > KVM
> > > > host2 and
> > > > > killed it (powered off). After few minutes of CloudStack attempting
> > > > > to
> > > > find why
> > > > > the host (kvm agent) timed out, CloudStack kicked investigators,
> > > > > that eventually led KVM fencers to work and VM HA job kicked to
> > > > > start those
> > > > few VMs
> > > > > on host1 and the KVM host2 was put to "Down" state.
> > > > >
> > > > >
> > > > > - Rohit
> > > > >
> > > > > <https://cloudstack.apache.org>
> > > > >
> > > > >
> > > > >
> > > > > ________________________________
> > > > >
> > > > > rohit.yadav@shapeblue.com
> > > > > www.shapeblue.com<http://www.shapeblue.com>
> > > > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
> > > > >
> > > > >
> > > > >
> > > > > From: Rohit Yadav
> > > > > Sent: Wednesday, January 17, 2018 2:39:19 PM
> > > > > To: dev
> > > > > Subject: Re: HA issues
> > > > >
> > > > >
> > > > > Hi Lucian,
> > > > >
> > > > >
> > > > > The "Host HA" feature is entirely different from VM HA, however,
> they
> > > > may work
> > > > > in tandem, so please stop using the terms interchangeably as it may
> > > > cause the
> > > > > community to believe a regression has been caused.
> > > > >
> > > > >
> > > > > The "Host HA" feature currently ships with only "Host HA" provider
> > for
> > > > KVM that
> > > > > is strictly tied to out-of-band management (IPMI for fencing, i.e
> > power
> > > > off and
> > > > > recovery, i.e. reboot) and NFS (as primary storage). (We also have
> a
> > > > provider
> > > > > for simulator, but that's for coverage/testing purposes).
> > > > >
> > > > >
> > > > > Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM
> is
> > > > enabled.
> > > > > The frameowkr allows interested parties may write their own HA
> > > providers
> > > > for a
> > > > > hypervisor that can use a different strategy/mechanism for
> > > > fencing/recovery of
> > > > > hosts (including write a non-IPMI based OOBM plugin) and host/disk
> > > > activity
> > > > > checker that is non-NFS based.
> > > > >
> > > > >
> > > > > The "Host HA" feature ships disabled by default and does not cause
> > any
> > > > > interference with VM HA. However, when enabled and configured
> > > correctly,
> > > > it is
> > > > > a known limitation that when it is unable to successfully perform
> > > > recovery or
> > > > > fencing tasks it may not trigger VM HA. We can discuss how to
> handle
> > > > such cases
> > > > > (thoughts?). "Host HA" would try couple of times to recover and
> > failing
> > > > to do
> > > > > so, it would eventually trigger a host fencing task. If it's unable
> > to
> > > > fence a
> > > > > host, it will indefinitely attempt to fence the host (the host
> state
> > > > will be
> > > > > stuck at fencing state in cloud.ha_config table for example) and
> > alerts
> > > > will be
> > > > > sent to admin who can do some manual intervention to handle such
> > > > situations (if
> > > > > you've email/smtp enabled, you should see alert emails).
> > > > >
> > > > >
> > > > > We can discuss how to improve and have a workaround for the case
> > you've
> > > > hit,
> > > > > thanks for sharing.
> > > > >
> > > > >
> > > > > - Rohit
> > > > >
> > > > > ________________________________
> > > > > From: Nux! <nu...@li.nux.ro>
> > > > > Sent: Tuesday, January 16, 2018 10:42:35 PM
> > > > > To: dev
> > > > > Subject: Re: HA issues
> > > > >
> > > > > Ok, reinstalled and re-tested.
> > > > >
> > > > > What I've learned:
> > > > >
> > > > > - HA only works now if OOB is configured, the old way HA no longer
> > > > applies -
> > > > > this can be good and bad, not everyone has IPMIs
> > > > >
> > > > > - HA only works if IPMI is reachable. I've pulled the cord on a HV
> > and
> > > > HA failed
> > > > > to do its thing, leaving me with a HV down along with all the VMs
> > > running
> > > > > there. That's bad.
> > > > > I've opened this ticket for it:
> > > > > https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> > > > >
> > > > > Let me know if you need any extra info or stuff to test.
> > > > >
> > > > > Regards,
> > > > > Lucian
> > > > >
> > > > > --
> > > > > Sent from the Delta quadrant using Borg technology!
> > > > >
> > > > > Nux!
> > > > > www.nux.ro
> > > > >
> > > > > ----- Original Message -----
> > > > >> From: "Nux!" <nu...@li.nux.ro>
> > > > >> To: "dev" <de...@cloudstack.apache.org>
> > > > >> Sent: Tuesday, 16 January, 2018 11:35:58
> > > > >> Subject: Re: HA issues
> > > > >
> > > > >> I'll reinstall my setup and try again, just to be sure I'm working
> > on
> > > a
> > > > clean
> > > > >> slate.
> > > > >>
> > > > >> --
> > > > >> Sent from the Delta quadrant using Borg technology!
> > > > >>
> > > > >> Nux!
> > > > >> www.nux.ro
> > > > >>
> > > > >> ----- Original Message -----
> > > > >>> From: "Rohit Yadav" <ro...@shapeblue.com>
> > > > >>> To: "dev" <de...@cloudstack.apache.org>
> > > > >>> Sent: Tuesday, 16 January, 2018 11:29:51
> > > > >>> Subject: Re: HA issues
> > > > >>
> > > > >>> Hi Lucian,
> > > > >>>
> > > > >>>
> > > > >>> If you're talking about the new HostHA feature (with
> KVM+nfs+ipmi),
> > > > please refer
> > > > >>> to following docs:
> > > > >>>
> > > > >>>
> > > > http://docs.cloudstack.apache.org/projects/cloudstack-
> > > administration/en/latest/hosts.html#out-of-band-management
> > > > >>>
> > > > >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
> > > > >>>
> > > > >>>
> > > > >>> We'll need to you look at logs perhaps create a JIRA ticket with
> > the
> > > > logs and
> > > > >>> details? If you saw ipmi based reboot, then host-ha indeed tried
> to
> > > > recover
> > > > >>> i.e. reboot the host, once hostha has done its work it would
> > schedule
> > > > HA for VM
> > > > >>> as soon as the recovery operation succeeds (we've simulator and
> kvm
> > > > based
> > > > >>> marvin tests for such scenarios).
> > > > >>>
> > > > >>>
> > > > >>> Can you see it making attempt to schedule VM ha in logs, or any
> > > > failure?
> > > > >>>
> > > > >>>
> > > > >>> - Rohit
> > > > >>>
> > > > >>> <https://cloudstack.apache.org>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> ________________________________
> > > > >>> From: Nux! <nu...@li.nux.ro>
> > > > >>> Sent: Tuesday, January 16, 2018 12:47:56 AM
> > > > >>> To: dev
> > > > >>> Subject: [4.11] HA issues
> > > > >>>
> > > > >>> Hi,
> > > > >>>
> > > > >>> I see there's a new HA engine for KVM and IPMI support which is
> > > really
> > > > nice,
> > > > >>> however it seems hit and miss.
> > > > >>> I have created an instance with HA offering, kernel panicked one
> of
> > > the
> > > > >>> hypervisors - after a while the server was rebooted via IPMI
> > > probably,
> > > > but the
> > > > >>> instance never moved to a running hypervisor and even after the o
> > > > <https://maps.google.com/?q=to+a+running+hypervisor+and+
> > > even+after+the+o&entry=gmail&source=g>
> > > > riginal
> > > > >>> hypervisor came back it was still left in Stopped state.
> > > > >>> Is there any extra things I need to set up to have proper HA?
> > > > >>>
> > > > >>> Regards,
> > > > >>> Lucian
> > > > >>>
> > > > >>> --
> > > > >>> Sent from the Delta quadrant using Borg technology!
> > > > >>>
> > > > >>> Nux!
> > > > >>> www.nux.ro
> > > > >>>
> > > > >>> rohit.yadav@shapeblue.com
> > > > >>> www.shapeblue.com<http://www.shapeblue.com>
> > > > >>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > > > @shapeblue
> > > >
> > >
> >
> >
> >
> > --
> >
> > Andrija Panić
> >
>
>
>
> --
>
> Andrija Panić
>



-- 

Andrija Panić

Re: HA issues

Posted by Simon Weller <sw...@ena.com.INVALID>.
Also these -

https://github.com/myENA/cloudstack/pull/20/commits/1948ce5d24b87433ae9e8f4faebdfc20b56b751a


https://github.com/myENA/cloudstack/pull/12/commits





________________________________
From: Andrija Panic <an...@gmail.com>
Sent: Monday, February 19, 2018 5:23 AM
To: dev
Subject: Re: HA issues

Hi Simon,

a big thank you for this, will have our devs check this!

Thanks!

On 19 February 2018 at 09:02, Simon Weller <sw...@ena.com.invalid> wrote:

> Andrija,
>
>
> We pushed quite a few PRs on the exception and lockup issues related to
> Ceph in the agent.
>
>
> We have a PR for the deletion issue. See if you have it pulled into your
> release - https://github.com/myENA/cloudstack/pull/9
[https://avatars1.githubusercontent.com/u/1444686?s=400&v=4]<https://github.com/myENA/cloudstack/pull/9>

context cleanup by leprechau · Pull Request #9 · myENA/cloudstack<https://github.com/myENA/cloudstack/pull/9>
github.com
cleanup rbd image and rados context even if exceptions are thrown in deletePhysicalDisk routine



>
>
> - Si
>
>
>
>
> ________________________________
> From: Andrija Panic <an...@gmail.com>
> Sent: Saturday, February 17, 2018 1:49 PM
> To: dev
> Subject: Re: HA issues
>
> Hi Sean,
>
> (we have 2 threads interleaving on the libvirt lockd..) - so, did you
> manage to understand what can cause the Agent Disconnect in most cases, for
> you specifically? Is there any software (CloudStack) root cause
> (disregarding i.e. networking issues etc)
>
> Just our examples, which you should probably not have:
>
> We had CEPH cluster running (with ACS), and there any exception in librbd
> would crash JVM and the agent, but this has been fixed mostly -
> Now get i.e. agent disconnect when ACS try to delete volume on CEPH (and
> for some reason not succeed withing 30 minutes, volume deletion fails) -
> then libvirt get's completety stuck (virsh list even dont work)...so  agent
> get's disconnect eventually.
>
> It would be good to get rid of agent disconnections in general, obviously
> :) so that is why I'm asking (you are on NFS, so would like to see your
> experience here).
>
> Thanks
>
> On 16 February 2018 at 21:52, Sean Lair <sl...@ippathways.com> wrote:
>
> > We were in the same situation as Nux.
> >
> > In our test environment we hit the issue with VMs not getting fenced and
> > coming up on two hosts because of VM HA.   However, we updated some of
> the
> > logic for VM HA and turned on libvirtd's locking mechanism.  Now we are
> > working great w/o IPMI.  The locking stops the VMs from starting
> elsewhere,
> > and everything recovers very nicely when the host starts responding
> again.
> >
> > We are on 4.9.3 and haven't started testing with 4.11 yet, but it may
> work
> > along-side IPMI just fine - it would just have affect the fencing.
> > However, we *currently* prefer how we are doing it now, because if the
> > agent stops responding, but the host is still up, the VMs continue
> running
> > and no actual downtime is incurred.  Even when VM HA attempts to power on
> > the VMs on another host, it just fails the power-up and the VMs continue
> to
> > run on the "agent disconnected" host. The host goes into alarm state and
> > our NOC can look into what is wrong the agent on the host.  If IPMI was
> > enabled, it sounds like it would power off the host (fence) and force
> > downtime for us even if the VMs were actually running OK - and just the
> > agent is unreachable.
> >
> > I plan on submitting our updates via a pull request at some point.  But I
> > can also send the updated code to anyone that wants to do some testing
> > before then.
> >
> > -----Original Message-----
> > From: Marcus [mailto:shadowsor@gmail.com]
> > Sent: Friday, February 16, 2018 11:27 AM
> > To: dev@cloudstack.apache.org
> > Subject: Re: HA issues
> >
> > From your other emails it sounds as though you do not have IPMI
> > configured, nor host HA enabled, correct? In this case, the correct thing
> > to do is nothing. If CloudStack cannot guarantee the VM state (as is the
> > case with an unreachable hypervisor), it should do nothing, for fear of
> > causing a split brain and corrupting the VM disk (VM running on two
> hosts).
> >
> > Clustering and fencing is a tricky proposition. When CloudStack (or any
> > other cluster manager) is not configured to or cannot guarantee state
> then
> > things will simply lock up, in this case your HA VM on your broken
> > hypervisor will not run elsewhere. This has been the case for a long time
> > with CloudStack, HA would only start a VM after the original hypervisor
> > agent came back and reported no VM is running.
> >
> > The new feature, from what I gather, simply adds the possibility of
> > CloudStack being able to reach out and shut down the hypervisor to
> > guarantee state. At that point it can start the VM elsewhere. If
> something
> > fails in that process (IPMI unreachable, for example, or bad
> credentials),
> > you're still going to be stuck with a VM not coming back.
> >
> > It's the nature of the thing. I'd be wary of any HA solution that does
> not
> > reach out and guarantee state via host or storage fencing before
> starting a
> > VM elsewhere, as it will be making assumptions. Its entirely possible a
> VM
> > might be unreachable or unable to access it storage for a short while, a
> > new instance of the VM is started elsewhere, and the original VM comes
> back.
> >
> > On Wed, Jan 17, 2018 at 9:02 AM Nux! <nu...@li.nux.ro> wrote:
> >
> > > Hi Rohit,
> > >
> > > I've reinstalled and tested. Still no go with VM HA.
> > >
> > > What I did was to kernel panic that particular HV ("echo c >
> > > /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
> > > What happened next is the HV got marked as "Alert", the VM on it was
> > > all the time marked as "Running" and it was not migrated to another HV.
> > > Once the panicked HV has booted back the VM reboots and becomes
> > available.
> > >
> > > I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary
> storage.
> > > The VM has HA enabled service offering.
> > > Host HA or OOBM configuration was not touched.
> > >
> > > Full log http://tmp.nux.ro/W3s-management-server.log
> > >
> > > --
> > > Sent from the Delta quadrant using Borg technology!
> > >
> > > Nux!
> > > www.nux.ro
> > >
> > > ----- Original Message -----
> > > > From: "Rohit Yadav" <ro...@shapeblue.com>
> > > > To: "dev" <de...@cloudstack.apache.org>
> > > > Sent: Wednesday, 17 January, 2018 12:13:33
> > > > Subject: Re: HA issues
> > >
> > > > I performed VM HA sanity checks and was not able to reproduce any
> > > regression
> > > > against two KVM CentOS7 hosts in a cluster.
> > > >
> > > >
> > > > Without the "Host HA" feature, I deployed few HA-enabled VMs on a
> > > > KVM
> > > host2 and
> > > > killed it (powered off). After few minutes of CloudStack attempting
> > > > to
> > > find why
> > > > the host (kvm agent) timed out, CloudStack kicked investigators,
> > > > that eventually led KVM fencers to work and VM HA job kicked to
> > > > start those
> > > few VMs
> > > > on host1 and the KVM host2 was put to "Down" state.
> > > >
> > > >
> > > > - Rohit
> > > >
> > > > <https://cloudstack.apache.org>
> > > >
> > > >
> > > >
> > > > ________________________________
> > > >
> > > > rohit.yadav@shapeblue.com
> > > > www.shapeblue.com<http://www.shapeblue.com>
> > > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
> > > >
> > > >
> > > >
> > > > From: Rohit Yadav
> > > > Sent: Wednesday, January 17, 2018 2:39:19 PM
> > > > To: dev
> > > > Subject: Re: HA issues
> > > >
> > > >
> > > > Hi Lucian,
> > > >
> > > >
> > > > The "Host HA" feature is entirely different from VM HA, however, they
> > > may work
> > > > in tandem, so please stop using the terms interchangeably as it may
> > > cause the
> > > > community to believe a regression has been caused.
> > > >
> > > >
> > > > The "Host HA" feature currently ships with only "Host HA" provider
> for
> > > KVM that
> > > > is strictly tied to out-of-band management (IPMI for fencing, i.e
> power
> > > off and
> > > > recovery, i.e. reboot) and NFS (as primary storage). (We also have a
> > > provider
> > > > for simulator, but that's for coverage/testing purposes).
> > > >
> > > >
> > > > Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is
> > > enabled.
> > > > The frameowkr allows interested parties may write their own HA
> > providers
> > > for a
> > > > hypervisor that can use a different strategy/mechanism for
> > > fencing/recovery of
> > > > hosts (including write a non-IPMI based OOBM plugin) and host/disk
> > > activity
> > > > checker that is non-NFS based.
> > > >
> > > >
> > > > The "Host HA" feature ships disabled by default and does not cause
> any
> > > > interference with VM HA. However, when enabled and configured
> > correctly,
> > > it is
> > > > a known limitation that when it is unable to successfully perform
> > > recovery or
> > > > fencing tasks it may not trigger VM HA. We can discuss how to handle
> > > such cases
> > > > (thoughts?). "Host HA" would try couple of times to recover and
> failing
> > > to do
> > > > so, it would eventually trigger a host fencing task. If it's unable
> to
> > > fence a
> > > > host, it will indefinitely attempt to fence the host (the host state
> > > will be
> > > > stuck at fencing state in cloud.ha_config table for example) and
> alerts
> > > will be
> > > > sent to admin who can do some manual intervention to handle such
> > > situations (if
> > > > you've email/smtp enabled, you should see alert emails).
> > > >
> > > >
> > > > We can discuss how to improve and have a workaround for the case
> you've
> > > hit,
> > > > thanks for sharing.
> > > >
> > > >
> > > > - Rohit
> > > >
> > > > ________________________________
> > > > From: Nux! <nu...@li.nux.ro>
> > > > Sent: Tuesday, January 16, 2018 10:42:35 PM
> > > > To: dev
> > > > Subject: Re: HA issues
> > > >
> > > > Ok, reinstalled and re-tested.
> > > >
> > > > What I've learned:
> > > >
> > > > - HA only works now if OOB is configured, the old way HA no longer
> > > applies -
> > > > this can be good and bad, not everyone has IPMIs
> > > >
> > > > - HA only works if IPMI is reachable. I've pulled the cord on a HV
> and
> > > HA failed
> > > > to do its thing, leaving me with a HV down along with all the VMs
> > running
> > > > there. That's bad.
> > > > I've opened this ticket for it:
> > > > https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> > > >
> > > > Let me know if you need any extra info or stuff to test.
> > > >
> > > > Regards,
> > > > Lucian
> > > >
> > > > --
> > > > Sent from the Delta quadrant using Borg technology!
> > > >
> > > > Nux!
> > > > www.nux.ro
> > > >
> > > > ----- Original Message -----
> > > >> From: "Nux!" <nu...@li.nux.ro>
> > > >> To: "dev" <de...@cloudstack.apache.org>
> > > >> Sent: Tuesday, 16 January, 2018 11:35:58
> > > >> Subject: Re: HA issues
> > > >
> > > >> I'll reinstall my setup and try again, just to be sure I'm working
> on
> > a
> > > clean
> > > >> slate.
> > > >>
> > > >> --
> > > >> Sent from the Delta quadrant using Borg technology!
> > > >>
> > > >> Nux!
> > > >> www.nux.ro
> > > >>
> > > >> ----- Original Message -----
> > > >>> From: "Rohit Yadav" <ro...@shapeblue.com>
> > > >>> To: "dev" <de...@cloudstack.apache.org>
> > > >>> Sent: Tuesday, 16 January, 2018 11:29:51
> > > >>> Subject: Re: HA issues
> > > >>
> > > >>> Hi Lucian,
> > > >>>
> > > >>>
> > > >>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi),
> > > please refer
> > > >>> to following docs:
> > > >>>
> > > >>>
> > > http://docs.cloudstack.apache.org/projects/cloudstack-
> > administration/en/latest/hosts.html#out-of-band-management
> > > >>>
> > > >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
> > > >>>
> > > >>>
> > > >>> We'll need to you look at logs perhaps create a JIRA ticket with
> the
> > > logs and
> > > >>> details? If you saw ipmi based reboot, then host-ha indeed tried to
> > > recover
> > > >>> i.e. reboot the host, once hostha has done its work it would
> schedule
> > > HA for VM
> > > >>> as soon as the recovery operation succeeds (we've simulator and kvm
> > > based
> > > >>> marvin tests for such scenarios).
> > > >>>
> > > >>>
> > > >>> Can you see it making attempt to schedule VM ha in logs, or any
> > > failure?
> > > >>>
> > > >>>
> > > >>> - Rohit
> > > >>>
> > > >>> <https://cloudstack.apache.org>
> > > >>>
> > > >>>
> > > >>>
> > > >>> ________________________________
> > > >>> From: Nux! <nu...@li.nux.ro>
> > > >>> Sent: Tuesday, January 16, 2018 12:47:56 AM
> > > >>> To: dev
> > > >>> Subject: [4.11] HA issues
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> I see there's a new HA engine for KVM and IPMI support which is
> > really
> > > nice,
> > > >>> however it seems hit and miss.
> > > >>> I have created an instance with HA offering, kernel panicked one of
> > the
> > > >>> hypervisors - after a while the server was rebooted via IPMI
> > probably,
> > > but the
> > > >>> instance never moved to a running hypervisor and even after the o
> > > <https://maps.google.com/?q=to+a+running+hypervisor+and+
> > even+after+the+o&entry=gmail&source=g>
> > > riginal
> > > >>> hypervisor came back it was still left in Stopped state.
> > > >>> Is there any extra things I need to set up to have proper HA?
> > > >>>
> > > >>> Regards,
> > > >>> Lucian
> > > >>>
> > > >>> --
> > > >>> Sent from the Delta quadrant using Borg technology!
> > > >>>
> > > >>> Nux!
> > > >>> www.nux.ro
> > > >>>
> > > >>> rohit.yadav@shapeblue.com
> > > >>> www.shapeblue.com<http://www.shapeblue.com>
> > > >>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > > @shapeblue
> > >
> >
>
>
>
> --
>
> Andrija Panić
>



--

Andrija Panić

Re: HA issues

Posted by Andrija Panic <an...@gmail.com>.
Hi Simon,

a big thank you for this, will have our devs check this!

Thanks!

On 19 February 2018 at 09:02, Simon Weller <sw...@ena.com.invalid> wrote:

> Andrija,
>
>
> We pushed quite a few PRs on the exception and lockup issues related to
> Ceph in the agent.
>
>
> We have a PR for the deletion issue. See if you have it pulled into your
> release - https://github.com/myENA/cloudstack/pull/9
>
>
> - Si
>
>
>
>
> ________________________________
> From: Andrija Panic <an...@gmail.com>
> Sent: Saturday, February 17, 2018 1:49 PM
> To: dev
> Subject: Re: HA issues
>
> Hi Sean,
>
> (we have 2 threads interleaving on the libvirt lockd..) - so, did you
> manage to understand what can cause the Agent Disconnect in most cases, for
> you specifically? Is there any software (CloudStack) root cause
> (disregarding i.e. networking issues etc)
>
> Just our examples, which you should probably not have:
>
> We had CEPH cluster running (with ACS), and there any exception in librbd
> would crash JVM and the agent, but this has been fixed mostly -
> Now get i.e. agent disconnect when ACS try to delete volume on CEPH (and
> for some reason not succeed withing 30 minutes, volume deletion fails) -
> then libvirt get's completety stuck (virsh list even dont work)...so  agent
> get's disconnect eventually.
>
> It would be good to get rid of agent disconnections in general, obviously
> :) so that is why I'm asking (you are on NFS, so would like to see your
> experience here).
>
> Thanks
>
> On 16 February 2018 at 21:52, Sean Lair <sl...@ippathways.com> wrote:
>
> > We were in the same situation as Nux.
> >
> > In our test environment we hit the issue with VMs not getting fenced and
> > coming up on two hosts because of VM HA.   However, we updated some of
> the
> > logic for VM HA and turned on libvirtd's locking mechanism.  Now we are
> > working great w/o IPMI.  The locking stops the VMs from starting
> elsewhere,
> > and everything recovers very nicely when the host starts responding
> again.
> >
> > We are on 4.9.3 and haven't started testing with 4.11 yet, but it may
> work
> > along-side IPMI just fine - it would just have affect the fencing.
> > However, we *currently* prefer how we are doing it now, because if the
> > agent stops responding, but the host is still up, the VMs continue
> running
> > and no actual downtime is incurred.  Even when VM HA attempts to power on
> > the VMs on another host, it just fails the power-up and the VMs continue
> to
> > run on the "agent disconnected" host. The host goes into alarm state and
> > our NOC can look into what is wrong the agent on the host.  If IPMI was
> > enabled, it sounds like it would power off the host (fence) and force
> > downtime for us even if the VMs were actually running OK - and just the
> > agent is unreachable.
> >
> > I plan on submitting our updates via a pull request at some point.  But I
> > can also send the updated code to anyone that wants to do some testing
> > before then.
> >
> > -----Original Message-----
> > From: Marcus [mailto:shadowsor@gmail.com]
> > Sent: Friday, February 16, 2018 11:27 AM
> > To: dev@cloudstack.apache.org
> > Subject: Re: HA issues
> >
> > From your other emails it sounds as though you do not have IPMI
> > configured, nor host HA enabled, correct? In this case, the correct thing
> > to do is nothing. If CloudStack cannot guarantee the VM state (as is the
> > case with an unreachable hypervisor), it should do nothing, for fear of
> > causing a split brain and corrupting the VM disk (VM running on two
> hosts).
> >
> > Clustering and fencing is a tricky proposition. When CloudStack (or any
> > other cluster manager) is not configured to or cannot guarantee state
> then
> > things will simply lock up, in this case your HA VM on your broken
> > hypervisor will not run elsewhere. This has been the case for a long time
> > with CloudStack, HA would only start a VM after the original hypervisor
> > agent came back and reported no VM is running.
> >
> > The new feature, from what I gather, simply adds the possibility of
> > CloudStack being able to reach out and shut down the hypervisor to
> > guarantee state. At that point it can start the VM elsewhere. If
> something
> > fails in that process (IPMI unreachable, for example, or bad
> credentials),
> > you're still going to be stuck with a VM not coming back.
> >
> > It's the nature of the thing. I'd be wary of any HA solution that does
> not
> > reach out and guarantee state via host or storage fencing before
> starting a
> > VM elsewhere, as it will be making assumptions. Its entirely possible a
> VM
> > might be unreachable or unable to access it storage for a short while, a
> > new instance of the VM is started elsewhere, and the original VM comes
> back.
> >
> > On Wed, Jan 17, 2018 at 9:02 AM Nux! <nu...@li.nux.ro> wrote:
> >
> > > Hi Rohit,
> > >
> > > I've reinstalled and tested. Still no go with VM HA.
> > >
> > > What I did was to kernel panic that particular HV ("echo c >
> > > /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
> > > What happened next is the HV got marked as "Alert", the VM on it was
> > > all the time marked as "Running" and it was not migrated to another HV.
> > > Once the panicked HV has booted back the VM reboots and becomes
> > available.
> > >
> > > I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary
> storage.
> > > The VM has HA enabled service offering.
> > > Host HA or OOBM configuration was not touched.
> > >
> > > Full log http://tmp.nux.ro/W3s-management-server.log
> > >
> > > --
> > > Sent from the Delta quadrant using Borg technology!
> > >
> > > Nux!
> > > www.nux.ro
> > >
> > > ----- Original Message -----
> > > > From: "Rohit Yadav" <ro...@shapeblue.com>
> > > > To: "dev" <de...@cloudstack.apache.org>
> > > > Sent: Wednesday, 17 January, 2018 12:13:33
> > > > Subject: Re: HA issues
> > >
> > > > I performed VM HA sanity checks and was not able to reproduce any
> > > regression
> > > > against two KVM CentOS7 hosts in a cluster.
> > > >
> > > >
> > > > Without the "Host HA" feature, I deployed few HA-enabled VMs on a
> > > > KVM
> > > host2 and
> > > > killed it (powered off). After few minutes of CloudStack attempting
> > > > to
> > > find why
> > > > the host (kvm agent) timed out, CloudStack kicked investigators,
> > > > that eventually led KVM fencers to work and VM HA job kicked to
> > > > start those
> > > few VMs
> > > > on host1 and the KVM host2 was put to "Down" state.
> > > >
> > > >
> > > > - Rohit
> > > >
> > > > <https://cloudstack.apache.org>
> > > >
> > > >
> > > >
> > > > ________________________________
> > > >
> > > > rohit.yadav@shapeblue.com
> > > > www.shapeblue.com<http://www.shapeblue.com>
> > > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
> > > >
> > > >
> > > >
> > > > From: Rohit Yadav
> > > > Sent: Wednesday, January 17, 2018 2:39:19 PM
> > > > To: dev
> > > > Subject: Re: HA issues
> > > >
> > > >
> > > > Hi Lucian,
> > > >
> > > >
> > > > The "Host HA" feature is entirely different from VM HA, however, they
> > > may work
> > > > in tandem, so please stop using the terms interchangeably as it may
> > > cause the
> > > > community to believe a regression has been caused.
> > > >
> > > >
> > > > The "Host HA" feature currently ships with only "Host HA" provider
> for
> > > KVM that
> > > > is strictly tied to out-of-band management (IPMI for fencing, i.e
> power
> > > off and
> > > > recovery, i.e. reboot) and NFS (as primary storage). (We also have a
> > > provider
> > > > for simulator, but that's for coverage/testing purposes).
> > > >
> > > >
> > > > Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is
> > > enabled.
> > > > The frameowkr allows interested parties may write their own HA
> > providers
> > > for a
> > > > hypervisor that can use a different strategy/mechanism for
> > > fencing/recovery of
> > > > hosts (including write a non-IPMI based OOBM plugin) and host/disk
> > > activity
> > > > checker that is non-NFS based.
> > > >
> > > >
> > > > The "Host HA" feature ships disabled by default and does not cause
> any
> > > > interference with VM HA. However, when enabled and configured
> > correctly,
> > > it is
> > > > a known limitation that when it is unable to successfully perform
> > > recovery or
> > > > fencing tasks it may not trigger VM HA. We can discuss how to handle
> > > such cases
> > > > (thoughts?). "Host HA" would try couple of times to recover and
> failing
> > > to do
> > > > so, it would eventually trigger a host fencing task. If it's unable
> to
> > > fence a
> > > > host, it will indefinitely attempt to fence the host (the host state
> > > will be
> > > > stuck at fencing state in cloud.ha_config table for example) and
> alerts
> > > will be
> > > > sent to admin who can do some manual intervention to handle such
> > > situations (if
> > > > you've email/smtp enabled, you should see alert emails).
> > > >
> > > >
> > > > We can discuss how to improve and have a workaround for the case
> you've
> > > hit,
> > > > thanks for sharing.
> > > >
> > > >
> > > > - Rohit
> > > >
> > > > ________________________________
> > > > From: Nux! <nu...@li.nux.ro>
> > > > Sent: Tuesday, January 16, 2018 10:42:35 PM
> > > > To: dev
> > > > Subject: Re: HA issues
> > > >
> > > > Ok, reinstalled and re-tested.
> > > >
> > > > What I've learned:
> > > >
> > > > - HA only works now if OOB is configured, the old way HA no longer
> > > applies -
> > > > this can be good and bad, not everyone has IPMIs
> > > >
> > > > - HA only works if IPMI is reachable. I've pulled the cord on a HV
> and
> > > HA failed
> > > > to do its thing, leaving me with a HV down along with all the VMs
> > running
> > > > there. That's bad.
> > > > I've opened this ticket for it:
> > > > https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> > > >
> > > > Let me know if you need any extra info or stuff to test.
> > > >
> > > > Regards,
> > > > Lucian
> > > >
> > > > --
> > > > Sent from the Delta quadrant using Borg technology!
> > > >
> > > > Nux!
> > > > www.nux.ro
> > > >
> > > > ----- Original Message -----
> > > >> From: "Nux!" <nu...@li.nux.ro>
> > > >> To: "dev" <de...@cloudstack.apache.org>
> > > >> Sent: Tuesday, 16 January, 2018 11:35:58
> > > >> Subject: Re: HA issues
> > > >
> > > >> I'll reinstall my setup and try again, just to be sure I'm working
> on
> > a
> > > clean
> > > >> slate.
> > > >>
> > > >> --
> > > >> Sent from the Delta quadrant using Borg technology!
> > > >>
> > > >> Nux!
> > > >> www.nux.ro
> > > >>
> > > >> ----- Original Message -----
> > > >>> From: "Rohit Yadav" <ro...@shapeblue.com>
> > > >>> To: "dev" <de...@cloudstack.apache.org>
> > > >>> Sent: Tuesday, 16 January, 2018 11:29:51
> > > >>> Subject: Re: HA issues
> > > >>
> > > >>> Hi Lucian,
> > > >>>
> > > >>>
> > > >>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi),
> > > please refer
> > > >>> to following docs:
> > > >>>
> > > >>>
> > > http://docs.cloudstack.apache.org/projects/cloudstack-
> > administration/en/latest/hosts.html#out-of-band-management
> > > >>>
> > > >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
> > > >>>
> > > >>>
> > > >>> We'll need to you look at logs perhaps create a JIRA ticket with
> the
> > > logs and
> > > >>> details? If you saw ipmi based reboot, then host-ha indeed tried to
> > > recover
> > > >>> i.e. reboot the host, once hostha has done its work it would
> schedule
> > > HA for VM
> > > >>> as soon as the recovery operation succeeds (we've simulator and kvm
> > > based
> > > >>> marvin tests for such scenarios).
> > > >>>
> > > >>>
> > > >>> Can you see it making attempt to schedule VM ha in logs, or any
> > > failure?
> > > >>>
> > > >>>
> > > >>> - Rohit
> > > >>>
> > > >>> <https://cloudstack.apache.org>
> > > >>>
> > > >>>
> > > >>>
> > > >>> ________________________________
> > > >>> From: Nux! <nu...@li.nux.ro>
> > > >>> Sent: Tuesday, January 16, 2018 12:47:56 AM
> > > >>> To: dev
> > > >>> Subject: [4.11] HA issues
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> I see there's a new HA engine for KVM and IPMI support which is
> > really
> > > nice,
> > > >>> however it seems hit and miss.
> > > >>> I have created an instance with HA offering, kernel panicked one of
> > the
> > > >>> hypervisors - after a while the server was rebooted via IPMI
> > probably,
> > > but the
> > > >>> instance never moved to a running hypervisor and even after the o
> > > <https://maps.google.com/?q=to+a+running+hypervisor+and+
> > even+after+the+o&entry=gmail&source=g>
> > > riginal
> > > >>> hypervisor came back it was still left in Stopped state.
> > > >>> Is there any extra things I need to set up to have proper HA?
> > > >>>
> > > >>> Regards,
> > > >>> Lucian
> > > >>>
> > > >>> --
> > > >>> Sent from the Delta quadrant using Borg technology!
> > > >>>
> > > >>> Nux!
> > > >>> www.nux.ro
> > > >>>
> > > >>> rohit.yadav@shapeblue.com
> > > >>> www.shapeblue.com<http://www.shapeblue.com>
> > > >>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > > @shapeblue
> > >
> >
>
>
>
> --
>
> Andrija Panić
>



-- 

Andrija Panić

Re: HA issues

Posted by Simon Weller <sw...@ena.com.INVALID>.
Andrija,


We pushed quite a few PRs on the exception and lockup issues related to Ceph in the agent.


We have a PR for the deletion issue. See if you have it pulled into your release - https://github.com/myENA/cloudstack/pull/9


- Si




________________________________
From: Andrija Panic <an...@gmail.com>
Sent: Saturday, February 17, 2018 1:49 PM
To: dev
Subject: Re: HA issues

Hi Sean,

(we have 2 threads interleaving on the libvirt lockd..) - so, did you
manage to understand what can cause the Agent Disconnect in most cases, for
you specifically? Is there any software (CloudStack) root cause
(disregarding i.e. networking issues etc)

Just our examples, which you should probably not have:

We had CEPH cluster running (with ACS), and there any exception in librbd
would crash JVM and the agent, but this has been fixed mostly -
Now get i.e. agent disconnect when ACS try to delete volume on CEPH (and
for some reason not succeed withing 30 minutes, volume deletion fails) -
then libvirt get's completety stuck (virsh list even dont work)...so  agent
get's disconnect eventually.

It would be good to get rid of agent disconnections in general, obviously
:) so that is why I'm asking (you are on NFS, so would like to see your
experience here).

Thanks

On 16 February 2018 at 21:52, Sean Lair <sl...@ippathways.com> wrote:

> We were in the same situation as Nux.
>
> In our test environment we hit the issue with VMs not getting fenced and
> coming up on two hosts because of VM HA.   However, we updated some of the
> logic for VM HA and turned on libvirtd's locking mechanism.  Now we are
> working great w/o IPMI.  The locking stops the VMs from starting elsewhere,
> and everything recovers very nicely when the host starts responding again.
>
> We are on 4.9.3 and haven't started testing with 4.11 yet, but it may work
> along-side IPMI just fine - it would just have affect the fencing.
> However, we *currently* prefer how we are doing it now, because if the
> agent stops responding, but the host is still up, the VMs continue running
> and no actual downtime is incurred.  Even when VM HA attempts to power on
> the VMs on another host, it just fails the power-up and the VMs continue to
> run on the "agent disconnected" host. The host goes into alarm state and
> our NOC can look into what is wrong the agent on the host.  If IPMI was
> enabled, it sounds like it would power off the host (fence) and force
> downtime for us even if the VMs were actually running OK - and just the
> agent is unreachable.
>
> I plan on submitting our updates via a pull request at some point.  But I
> can also send the updated code to anyone that wants to do some testing
> before then.
>
> -----Original Message-----
> From: Marcus [mailto:shadowsor@gmail.com]
> Sent: Friday, February 16, 2018 11:27 AM
> To: dev@cloudstack.apache.org
> Subject: Re: HA issues
>
> From your other emails it sounds as though you do not have IPMI
> configured, nor host HA enabled, correct? In this case, the correct thing
> to do is nothing. If CloudStack cannot guarantee the VM state (as is the
> case with an unreachable hypervisor), it should do nothing, for fear of
> causing a split brain and corrupting the VM disk (VM running on two hosts).
>
> Clustering and fencing is a tricky proposition. When CloudStack (or any
> other cluster manager) is not configured to or cannot guarantee state then
> things will simply lock up, in this case your HA VM on your broken
> hypervisor will not run elsewhere. This has been the case for a long time
> with CloudStack, HA would only start a VM after the original hypervisor
> agent came back and reported no VM is running.
>
> The new feature, from what I gather, simply adds the possibility of
> CloudStack being able to reach out and shut down the hypervisor to
> guarantee state. At that point it can start the VM elsewhere. If something
> fails in that process (IPMI unreachable, for example, or bad credentials),
> you're still going to be stuck with a VM not coming back.
>
> It's the nature of the thing. I'd be wary of any HA solution that does not
> reach out and guarantee state via host or storage fencing before starting a
> VM elsewhere, as it will be making assumptions. Its entirely possible a VM
> might be unreachable or unable to access it storage for a short while, a
> new instance of the VM is started elsewhere, and the original VM comes back.
>
> On Wed, Jan 17, 2018 at 9:02 AM Nux! <nu...@li.nux.ro> wrote:
>
> > Hi Rohit,
> >
> > I've reinstalled and tested. Still no go with VM HA.
> >
> > What I did was to kernel panic that particular HV ("echo c >
> > /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
> > What happened next is the HV got marked as "Alert", the VM on it was
> > all the time marked as "Running" and it was not migrated to another HV.
> > Once the panicked HV has booted back the VM reboots and becomes
> available.
> >
> > I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary storage.
> > The VM has HA enabled service offering.
> > Host HA or OOBM configuration was not touched.
> >
> > Full log http://tmp.nux.ro/W3s-management-server.log
> >
> > --
> > Sent from the Delta quadrant using Borg technology!
> >
> > Nux!
> > www.nux.ro
> >
> > ----- Original Message -----
> > > From: "Rohit Yadav" <ro...@shapeblue.com>
> > > To: "dev" <de...@cloudstack.apache.org>
> > > Sent: Wednesday, 17 January, 2018 12:13:33
> > > Subject: Re: HA issues
> >
> > > I performed VM HA sanity checks and was not able to reproduce any
> > regression
> > > against two KVM CentOS7 hosts in a cluster.
> > >
> > >
> > > Without the "Host HA" feature, I deployed few HA-enabled VMs on a
> > > KVM
> > host2 and
> > > killed it (powered off). After few minutes of CloudStack attempting
> > > to
> > find why
> > > the host (kvm agent) timed out, CloudStack kicked investigators,
> > > that eventually led KVM fencers to work and VM HA job kicked to
> > > start those
> > few VMs
> > > on host1 and the KVM host2 was put to "Down" state.
> > >
> > >
> > > - Rohit
> > >
> > > <https://cloudstack.apache.org>
> > >
> > >
> > >
> > > ________________________________
> > >
> > > rohit.yadav@shapeblue.com
> > > www.shapeblue.com<http://www.shapeblue.com>
> > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
> > >
> > >
> > >
> > > From: Rohit Yadav
> > > Sent: Wednesday, January 17, 2018 2:39:19 PM
> > > To: dev
> > > Subject: Re: HA issues
> > >
> > >
> > > Hi Lucian,
> > >
> > >
> > > The "Host HA" feature is entirely different from VM HA, however, they
> > may work
> > > in tandem, so please stop using the terms interchangeably as it may
> > cause the
> > > community to believe a regression has been caused.
> > >
> > >
> > > The "Host HA" feature currently ships with only "Host HA" provider for
> > KVM that
> > > is strictly tied to out-of-band management (IPMI for fencing, i.e power
> > off and
> > > recovery, i.e. reboot) and NFS (as primary storage). (We also have a
> > provider
> > > for simulator, but that's for coverage/testing purposes).
> > >
> > >
> > > Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is
> > enabled.
> > > The frameowkr allows interested parties may write their own HA
> providers
> > for a
> > > hypervisor that can use a different strategy/mechanism for
> > fencing/recovery of
> > > hosts (including write a non-IPMI based OOBM plugin) and host/disk
> > activity
> > > checker that is non-NFS based.
> > >
> > >
> > > The "Host HA" feature ships disabled by default and does not cause any
> > > interference with VM HA. However, when enabled and configured
> correctly,
> > it is
> > > a known limitation that when it is unable to successfully perform
> > recovery or
> > > fencing tasks it may not trigger VM HA. We can discuss how to handle
> > such cases
> > > (thoughts?). "Host HA" would try couple of times to recover and failing
> > to do
> > > so, it would eventually trigger a host fencing task. If it's unable to
> > fence a
> > > host, it will indefinitely attempt to fence the host (the host state
> > will be
> > > stuck at fencing state in cloud.ha_config table for example) and alerts
> > will be
> > > sent to admin who can do some manual intervention to handle such
> > situations (if
> > > you've email/smtp enabled, you should see alert emails).
> > >
> > >
> > > We can discuss how to improve and have a workaround for the case you've
> > hit,
> > > thanks for sharing.
> > >
> > >
> > > - Rohit
> > >
> > > ________________________________
> > > From: Nux! <nu...@li.nux.ro>
> > > Sent: Tuesday, January 16, 2018 10:42:35 PM
> > > To: dev
> > > Subject: Re: HA issues
> > >
> > > Ok, reinstalled and re-tested.
> > >
> > > What I've learned:
> > >
> > > - HA only works now if OOB is configured, the old way HA no longer
> > applies -
> > > this can be good and bad, not everyone has IPMIs
> > >
> > > - HA only works if IPMI is reachable. I've pulled the cord on a HV and
> > HA failed
> > > to do its thing, leaving me with a HV down along with all the VMs
> running
> > > there. That's bad.
> > > I've opened this ticket for it:
> > > https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> > >
> > > Let me know if you need any extra info or stuff to test.
> > >
> > > Regards,
> > > Lucian
> > >
> > > --
> > > Sent from the Delta quadrant using Borg technology!
> > >
> > > Nux!
> > > www.nux.ro
> > >
> > > ----- Original Message -----
> > >> From: "Nux!" <nu...@li.nux.ro>
> > >> To: "dev" <de...@cloudstack.apache.org>
> > >> Sent: Tuesday, 16 January, 2018 11:35:58
> > >> Subject: Re: HA issues
> > >
> > >> I'll reinstall my setup and try again, just to be sure I'm working on
> a
> > clean
> > >> slate.
> > >>
> > >> --
> > >> Sent from the Delta quadrant using Borg technology!
> > >>
> > >> Nux!
> > >> www.nux.ro
> > >>
> > >> ----- Original Message -----
> > >>> From: "Rohit Yadav" <ro...@shapeblue.com>
> > >>> To: "dev" <de...@cloudstack.apache.org>
> > >>> Sent: Tuesday, 16 January, 2018 11:29:51
> > >>> Subject: Re: HA issues
> > >>
> > >>> Hi Lucian,
> > >>>
> > >>>
> > >>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi),
> > please refer
> > >>> to following docs:
> > >>>
> > >>>
> > http://docs.cloudstack.apache.org/projects/cloudstack-
> administration/en/latest/hosts.html#out-of-band-management
> > >>>
> > >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
> > >>>
> > >>>
> > >>> We'll need to you look at logs perhaps create a JIRA ticket with the
> > logs and
> > >>> details? If you saw ipmi based reboot, then host-ha indeed tried to
> > recover
> > >>> i.e. reboot the host, once hostha has done its work it would schedule
> > HA for VM
> > >>> as soon as the recovery operation succeeds (we've simulator and kvm
> > based
> > >>> marvin tests for such scenarios).
> > >>>
> > >>>
> > >>> Can you see it making attempt to schedule VM ha in logs, or any
> > failure?
> > >>>
> > >>>
> > >>> - Rohit
> > >>>
> > >>> <https://cloudstack.apache.org>
> > >>>
> > >>>
> > >>>
> > >>> ________________________________
> > >>> From: Nux! <nu...@li.nux.ro>
> > >>> Sent: Tuesday, January 16, 2018 12:47:56 AM
> > >>> To: dev
> > >>> Subject: [4.11] HA issues
> > >>>
> > >>> Hi,
> > >>>
> > >>> I see there's a new HA engine for KVM and IPMI support which is
> really
> > nice,
> > >>> however it seems hit and miss.
> > >>> I have created an instance with HA offering, kernel panicked one of
> the
> > >>> hypervisors - after a while the server was rebooted via IPMI
> probably,
> > but the
> > >>> instance never moved to a running hypervisor and even after the o
> > <https://maps.google.com/?q=to+a+running+hypervisor+and+
> even+after+the+o&entry=gmail&source=g>
> > riginal
> > >>> hypervisor came back it was still left in Stopped state.
> > >>> Is there any extra things I need to set up to have proper HA?
> > >>>
> > >>> Regards,
> > >>> Lucian
> > >>>
> > >>> --
> > >>> Sent from the Delta quadrant using Borg technology!
> > >>>
> > >>> Nux!
> > >>> www.nux.ro
> > >>>
> > >>> rohit.yadav@shapeblue.com
> > >>> www.shapeblue.com<http://www.shapeblue.com>
> > >>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > @shapeblue
> >
>



--

Andrija Panić

RE: HA issues

Posted by Sean Lair <sl...@ippathways.com>.
Thanks so much for the info - we'll look at that line also!

I'll let you know when we create a PR for our changes - in case you want to review them for your environment

-----Original Message-----
From: Andrija Panic [mailto:andrija.panic@gmail.com] 
Sent: Tuesday, February 20, 2018 5:16 PM
To: dev <de...@cloudstack.apache.org>
Subject: Re: HA issues

That is good to hear ( no NFS issues causing Agent Disconnect).

I assume you are using "normal" NFS solution with proper HA and no ZFS (kernel panic etc), but anyway be aware of this one

https://github.com/apache/cloudstack/blob/e532b574ddb186a117da638fb6059356fe7c266c/scripts/vm/hypervisor/kvm/kvmheartbeat.sh#L161



we used to comment this line, because we did have some issues with communication link, and this commented line saved our a$$ few times :)

CHeers

On 20 February 2018 at 20:50, Sean Lair <sl...@ippathways.com> wrote:

> Hi Andrija
>
> We are currently running XenServer in production.  We are working on 
> moving to KVM and have it deployed in a development environment.
>
> The team is putting CloudStack + KVM through its paces and that is 
> when it was discovered how broken VM HA is in 4.9.3.  Initially our 
> patches fixed VM HA, but just caused VMs to get started on two hosts 
> during failure testing.  The libvirt lockd has solved that issue thus far.
>
> Short answer to you question is :-), we were not having problems with 
> Agent Disconnects in a production environment.  It was our testing/QA 
> that revealed the issues.  Our NFS has been stable so far, no issues 
> with the agent crashing/stopping that wasn't initiated by the team's testing.
>
> Thanks
> Sean
>
>
> -----Original Message-----
> From: Andrija Panic [mailto:andrija.panic@gmail.com]
> Sent: Saturday, February 17, 2018 1:49 PM
> To: dev <de...@cloudstack.apache.org>
> Subject: Re: HA issues
>
> Hi Sean,
>
> (we have 2 threads interleaving on the libvirt lockd..) - so, did you 
> manage to understand what can cause the Agent Disconnect in most 
> cases, for you specifically? Is there any software (CloudStack) root 
> cause (disregarding i.e. networking issues etc)
>
> Just our examples, which you should probably not have:
>
> We had CEPH cluster running (with ACS), and there any exception in 
> librbd would crash JVM and the agent, but this has been fixed mostly - 
> Now get i.e. agent disconnect when ACS try to delete volume on CEPH 
> (and for some reason not succeed withing 30 minutes, volume deletion 
> fails) - then libvirt get's completety stuck (virsh list even dont 
> work)...so  agent get's disconnect eventually.
>
> It would be good to get rid of agent disconnections in general, 
> obviously
> :) so that is why I'm asking (you are on NFS, so would like to see 
> your experience here).
>
> Thanks
>
> On 16 February 2018 at 21:52, Sean Lair <sl...@ippathways.com> wrote:
>
> > We were in the same situation as Nux.
> >
> > In our test environment we hit the issue with VMs not getting fenced and
> > coming up on two hosts because of VM HA.   However, we updated some of
> the
> > logic for VM HA and turned on libvirtd's locking mechanism.  Now we 
> > are working great w/o IPMI.  The locking stops the VMs from starting 
> > elsewhere, and everything recovers very nicely when the host starts
> responding again.
> >
> > We are on 4.9.3 and haven't started testing with 4.11 yet, but it 
> > may work along-side IPMI just fine - it would just have affect the fencing.
> > However, we *currently* prefer how we are doing it now, because if 
> > the agent stops responding, but the host is still up, the VMs 
> > continue running and no actual downtime is incurred.  Even when VM 
> > HA attempts to power on the VMs on another host, it just fails the 
> > power-up and the VMs continue to run on the "agent disconnected" 
> > host. The host goes into alarm state and our NOC can look into what 
> > is wrong the agent on the host.  If IPMI was enabled, it sounds like 
> > it would power off the host (fence) and force downtime for us even 
> > if the VMs were actually running OK - and just the agent is unreachable.
> >
> > I plan on submitting our updates via a pull request at some point.
> > But I can also send the updated code to anyone that wants to do some 
> > testing before then.
> >
> > -----Original Message-----
> > From: Marcus [mailto:shadowsor@gmail.com]
> > Sent: Friday, February 16, 2018 11:27 AM
> > To: dev@cloudstack.apache.org
> > Subject: Re: HA issues
> >
> > From your other emails it sounds as though you do not have IPMI 
> > configured, nor host HA enabled, correct? In this case, the correct 
> > thing to do is nothing. If CloudStack cannot guarantee the VM state 
> > (as is the case with an unreachable hypervisor), it should do 
> > nothing, for fear of causing a split brain and corrupting the VM 
> > disk (VM running
> on two hosts).
> >
> > Clustering and fencing is a tricky proposition. When CloudStack (or 
> > any other cluster manager) is not configured to or cannot guarantee 
> > state then things will simply lock up, in this case your HA VM on 
> > your broken hypervisor will not run elsewhere. This has been the 
> > case for a long time with CloudStack, HA would only start a VM after 
> > the original hypervisor agent came back and reported no VM is running.
> >
> > The new feature, from what I gather, simply adds the possibility of 
> > CloudStack being able to reach out and shut down the hypervisor to 
> > guarantee state. At that point it can start the VM elsewhere. If 
> > something fails in that process (IPMI unreachable, for example, or 
> > bad credentials), you're still going to be stuck with a VM not coming back.
> >
> > It's the nature of the thing. I'd be wary of any HA solution that 
> > does not reach out and guarantee state via host or storage fencing 
> > before starting a VM elsewhere, as it will be making assumptions. 
> > Its entirely possible a VM might be unreachable or unable to access 
> > it storage for a short while, a new instance of the VM is started
> elsewhere, and the original VM comes back.
> >
> > On Wed, Jan 17, 2018 at 9:02 AM Nux! <nu...@li.nux.ro> wrote:
> >
> > > Hi Rohit,
> > >
> > > I've reinstalled and tested. Still no go with VM HA.
> > >
> > > What I did was to kernel panic that particular HV ("echo c > 
> > > /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
> > > What happened next is the HV got marked as "Alert", the VM on it 
> > > was all the time marked as "Running" and it was not migrated to another HV.
> > > Once the panicked HV has booted back the VM reboots and becomes
> > available.
> > >
> > > I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary
> storage.
> > > The VM has HA enabled service offering.
> > > Host HA or OOBM configuration was not touched.
> > >
> > > Full log http://tmp.nux.ro/W3s-management-server.log
> > >
> > > --
> > > Sent from the Delta quadrant using Borg technology!
> > >
> > > Nux!
> > > www.nux.ro
> > >
> > > ----- Original Message -----
> > > > From: "Rohit Yadav" <ro...@shapeblue.com>
> > > > To: "dev" <de...@cloudstack.apache.org>
> > > > Sent: Wednesday, 17 January, 2018 12:13:33
> > > > Subject: Re: HA issues
> > >
> > > > I performed VM HA sanity checks and was not able to reproduce 
> > > > any
> > > regression
> > > > against two KVM CentOS7 hosts in a cluster.
> > > >
> > > >
> > > > Without the "Host HA" feature, I deployed few HA-enabled VMs on 
> > > > a KVM
> > > host2 and
> > > > killed it (powered off). After few minutes of CloudStack 
> > > > attempting to
> > > find why
> > > > the host (kvm agent) timed out, CloudStack kicked investigators, 
> > > > that eventually led KVM fencers to work and VM HA job kicked to 
> > > > start those
> > > few VMs
> > > > on host1 and the KVM host2 was put to "Down" state.
> > > >
> > > >
> > > > - Rohit
> > > >
> > > > <https://cloudstack.apache.org>
> > > >
> > > >
> > > >
> > > > ________________________________
> > > >
> > > > rohit.yadav@shapeblue.com
> > > > www.shapeblue.com
> > > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
> > > >
> > > >
> > > >
> > > > From: Rohit Yadav
> > > > Sent: Wednesday, January 17, 2018 2:39:19 PM
> > > > To: dev
> > > > Subject: Re: HA issues
> > > >
> > > >
> > > > Hi Lucian,
> > > >
> > > >
> > > > The "Host HA" feature is entirely different from VM HA, however, 
> > > > they
> > > may work
> > > > in tandem, so please stop using the terms interchangeably as it 
> > > > may
> > > cause the
> > > > community to believe a regression has been caused.
> > > >
> > > >
> > > > The "Host HA" feature currently ships with only "Host HA" 
> > > > provider for
> > > KVM that
> > > > is strictly tied to out-of-band management (IPMI for fencing, 
> > > > i.e power
> > > off and
> > > > recovery, i.e. reboot) and NFS (as primary storage). (We also 
> > > > have a
> > > provider
> > > > for simulator, but that's for coverage/testing purposes).
> > > >
> > > >
> > > > Therefore, "Host HA" for KVM (+nfs) currently works only when 
> > > > OOBM is
> > > enabled.
> > > > The frameowkr allows interested parties may write their own HA
> > providers
> > > for a
> > > > hypervisor that can use a different strategy/mechanism for
> > > fencing/recovery of
> > > > hosts (including write a non-IPMI based OOBM plugin) and 
> > > > host/disk
> > > activity
> > > > checker that is non-NFS based.
> > > >
> > > >
> > > > The "Host HA" feature ships disabled by default and does not 
> > > > cause any interference with VM HA. However, when enabled and 
> > > > configured
> > correctly,
> > > it is
> > > > a known limitation that when it is unable to successfully 
> > > > perform
> > > recovery or
> > > > fencing tasks it may not trigger VM HA. We can discuss how to 
> > > > handle
> > > such cases
> > > > (thoughts?). "Host HA" would try couple of times to recover and 
> > > > failing
> > > to do
> > > > so, it would eventually trigger a host fencing task. If it's 
> > > > unable to
> > > fence a
> > > > host, it will indefinitely attempt to fence the host (the host 
> > > > state
> > > will be
> > > > stuck at fencing state in cloud.ha_config table for example) and 
> > > > alerts
> > > will be
> > > > sent to admin who can do some manual intervention to handle such
> > > situations (if
> > > > you've email/smtp enabled, you should see alert emails).
> > > >
> > > >
> > > > We can discuss how to improve and have a workaround for the case 
> > > > you've
> > > hit,
> > > > thanks for sharing.
> > > >
> > > >
> > > > - Rohit
> > > >
> > > > ________________________________
> > > > From: Nux! <nu...@li.nux.ro>
> > > > Sent: Tuesday, January 16, 2018 10:42:35 PM
> > > > To: dev
> > > > Subject: Re: HA issues
> > > >
> > > > Ok, reinstalled and re-tested.
> > > >
> > > > What I've learned:
> > > >
> > > > - HA only works now if OOB is configured, the old way HA no 
> > > > longer
> > > applies -
> > > > this can be good and bad, not everyone has IPMIs
> > > >
> > > > - HA only works if IPMI is reachable. I've pulled the cord on a 
> > > > HV and
> > > HA failed
> > > > to do its thing, leaving me with a HV down along with all the 
> > > > VMs
> > running
> > > > there. That's bad.
> > > > I've opened this ticket for it:
> > > > https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> > > >
> > > > Let me know if you need any extra info or stuff to test.
> > > >
> > > > Regards,
> > > > Lucian
> > > >
> > > > --
> > > > Sent from the Delta quadrant using Borg technology!
> > > >
> > > > Nux!
> > > > www.nux.ro
> > > >
> > > > ----- Original Message -----
> > > >> From: "Nux!" <nu...@li.nux.ro>
> > > >> To: "dev" <de...@cloudstack.apache.org>
> > > >> Sent: Tuesday, 16 January, 2018 11:35:58
> > > >> Subject: Re: HA issues
> > > >
> > > >> I'll reinstall my setup and try again, just to be sure I'm 
> > > >> working on
> > a
> > > clean
> > > >> slate.
> > > >>
> > > >> --
> > > >> Sent from the Delta quadrant using Borg technology!
> > > >>
> > > >> Nux!
> > > >> www.nux.ro
> > > >>
> > > >> ----- Original Message -----
> > > >>> From: "Rohit Yadav" <ro...@shapeblue.com>
> > > >>> To: "dev" <de...@cloudstack.apache.org>
> > > >>> Sent: Tuesday, 16 January, 2018 11:29:51
> > > >>> Subject: Re: HA issues
> > > >>
> > > >>> Hi Lucian,
> > > >>>
> > > >>>
> > > >>> If you're talking about the new HostHA feature (with
> > > >>> KVM+nfs+ipmi),
> > > please refer
> > > >>> to following docs:
> > > >>>
> > > >>>
> > > http://docs.cloudstack.apache.org/projects/cloudstack-
> > administration/en/latest/hosts.html#out-of-band-management
> > > >>>
> > > >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
> > > >>>
> > > >>>
> > > >>> We'll need to you look at logs perhaps create a JIRA ticket 
> > > >>> with the
> > > logs and
> > > >>> details? If you saw ipmi based reboot, then host-ha indeed 
> > > >>> tried to
> > > recover
> > > >>> i.e. reboot the host, once hostha has done its work it would 
> > > >>> schedule
> > > HA for VM
> > > >>> as soon as the recovery operation succeeds (we've simulator 
> > > >>> and kvm
> > > based
> > > >>> marvin tests for such scenarios).
> > > >>>
> > > >>>
> > > >>> Can you see it making attempt to schedule VM ha in logs, or 
> > > >>> any
> > > failure?
> > > >>>
> > > >>>
> > > >>> - Rohit
> > > >>>
> > > >>> <https://cloudstack.apache.org>
> > > >>>
> > > >>>
> > > >>>
> > > >>> ________________________________
> > > >>> From: Nux! <nu...@li.nux.ro>
> > > >>> Sent: Tuesday, January 16, 2018 12:47:56 AM
> > > >>> To: dev
> > > >>> Subject: [4.11] HA issues
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> I see there's a new HA engine for KVM and IPMI support which 
> > > >>> is
> > really
> > > nice,
> > > >>> however it seems hit and miss.
> > > >>> I have created an instance with HA offering, kernel panicked 
> > > >>> one of
> > the
> > > >>> hypervisors - after a while the server was rebooted via IPMI
> > probably,
> > > but the
> > > >>> instance never moved to a running hypervisor and even after 
> > > >>> the o
> > > <https://maps.google.com/?q=to+a+running+hypervisor+and+
> > even+after+the+o&entry=gmail&source=g>
> > > riginal
> > > >>> hypervisor came back it was still left in Stopped state.
> > > >>> Is there any extra things I need to set up to have proper HA?
> > > >>>
> > > >>> Regards,
> > > >>> Lucian
> > > >>>
> > > >>> --
> > > >>> Sent from the Delta quadrant using Borg technology!
> > > >>>
> > > >>> Nux!
> > > >>> www.nux.ro
> > > >>>
> > > >>> rohit.yadav@shapeblue.com
> > > >>> www.shapeblue.com<http://www.shapeblue.com>
> > > >>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > > @shapeblue
> > >
> >
>
>
>
> --
>
> Andrija Panić
>



-- 

Andrija Panić

RE: HA issues

Posted by Sean Lair <sl...@ippathways.com>.
Based on your note we made the following change:

https://github.com/apache/cloudstack/pull/2472

It adds a sleep between retries and then stops the cloudstack-agent if it still can't write the heartbeat file after the retries...  At least this way an alert is raised instead of a hard reboot.  Also, it allows HA to kick-in and handle correctly.


-----Original Message-----
From: Andrija Panic [mailto:andrija.panic@gmail.com] 
Sent: Tuesday, February 20, 2018 5:16 PM
To: dev <de...@cloudstack.apache.org>
Subject: Re: HA issues

That is good to hear ( no NFS issues causing Agent Disconnect).

I assume you are using "normal" NFS solution with proper HA and no ZFS (kernel panic etc), but anyway be aware of this one

https://github.com/apache/cloudstack/blob/e532b574ddb186a117da638fb6059356fe7c266c/scripts/vm/hypervisor/kvm/kvmheartbeat.sh#L161



we used to comment this line, because we did have some issues with communication link, and this commented line saved our a$$ few times :)

CHeers

On 20 February 2018 at 20:50, Sean Lair <sl...@ippathways.com> wrote:

> Hi Andrija
>
> We are currently running XenServer in production.  We are working on 
> moving to KVM and have it deployed in a development environment.
>
> The team is putting CloudStack + KVM through its paces and that is 
> when it was discovered how broken VM HA is in 4.9.3.  Initially our 
> patches fixed VM HA, but just caused VMs to get started on two hosts 
> during failure testing.  The libvirt lockd has solved that issue thus far.
>
> Short answer to you question is :-), we were not having problems with 
> Agent Disconnects in a production environment.  It was our testing/QA 
> that revealed the issues.  Our NFS has been stable so far, no issues 
> with the agent crashing/stopping that wasn't initiated by the team's testing.
>
> Thanks
> Sean
>
>
> -----Original Message-----
> From: Andrija Panic [mailto:andrija.panic@gmail.com]
> Sent: Saturday, February 17, 2018 1:49 PM
> To: dev <de...@cloudstack.apache.org>
> Subject: Re: HA issues
>
> Hi Sean,
>
> (we have 2 threads interleaving on the libvirt lockd..) - so, did you 
> manage to understand what can cause the Agent Disconnect in most 
> cases, for you specifically? Is there any software (CloudStack) root 
> cause (disregarding i.e. networking issues etc)
>
> Just our examples, which you should probably not have:
>
> We had CEPH cluster running (with ACS), and there any exception in 
> librbd would crash JVM and the agent, but this has been fixed mostly - 
> Now get i.e. agent disconnect when ACS try to delete volume on CEPH 
> (and for some reason not succeed withing 30 minutes, volume deletion 
> fails) - then libvirt get's completety stuck (virsh list even dont 
> work)...so  agent get's disconnect eventually.
>
> It would be good to get rid of agent disconnections in general, 
> obviously
> :) so that is why I'm asking (you are on NFS, so would like to see 
> your experience here).
>
> Thanks
>
> On 16 February 2018 at 21:52, Sean Lair <sl...@ippathways.com> wrote:
>
> > We were in the same situation as Nux.
> >
> > In our test environment we hit the issue with VMs not getting fenced and
> > coming up on two hosts because of VM HA.   However, we updated some of
> the
> > logic for VM HA and turned on libvirtd's locking mechanism.  Now we 
> > are working great w/o IPMI.  The locking stops the VMs from starting 
> > elsewhere, and everything recovers very nicely when the host starts
> responding again.
> >
> > We are on 4.9.3 and haven't started testing with 4.11 yet, but it 
> > may work along-side IPMI just fine - it would just have affect the fencing.
> > However, we *currently* prefer how we are doing it now, because if 
> > the agent stops responding, but the host is still up, the VMs 
> > continue running and no actual downtime is incurred.  Even when VM 
> > HA attempts to power on the VMs on another host, it just fails the 
> > power-up and the VMs continue to run on the "agent disconnected" 
> > host. The host goes into alarm state and our NOC can look into what 
> > is wrong the agent on the host.  If IPMI was enabled, it sounds like 
> > it would power off the host (fence) and force downtime for us even 
> > if the VMs were actually running OK - and just the agent is unreachable.
> >
> > I plan on submitting our updates via a pull request at some point.
> > But I can also send the updated code to anyone that wants to do some 
> > testing before then.
> >
> > -----Original Message-----
> > From: Marcus [mailto:shadowsor@gmail.com]
> > Sent: Friday, February 16, 2018 11:27 AM
> > To: dev@cloudstack.apache.org
> > Subject: Re: HA issues
> >
> > From your other emails it sounds as though you do not have IPMI 
> > configured, nor host HA enabled, correct? In this case, the correct 
> > thing to do is nothing. If CloudStack cannot guarantee the VM state 
> > (as is the case with an unreachable hypervisor), it should do 
> > nothing, for fear of causing a split brain and corrupting the VM 
> > disk (VM running
> on two hosts).
> >
> > Clustering and fencing is a tricky proposition. When CloudStack (or 
> > any other cluster manager) is not configured to or cannot guarantee 
> > state then things will simply lock up, in this case your HA VM on 
> > your broken hypervisor will not run elsewhere. This has been the 
> > case for a long time with CloudStack, HA would only start a VM after 
> > the original hypervisor agent came back and reported no VM is running.
> >
> > The new feature, from what I gather, simply adds the possibility of 
> > CloudStack being able to reach out and shut down the hypervisor to 
> > guarantee state. At that point it can start the VM elsewhere. If 
> > something fails in that process (IPMI unreachable, for example, or 
> > bad credentials), you're still going to be stuck with a VM not coming back.
> >
> > It's the nature of the thing. I'd be wary of any HA solution that 
> > does not reach out and guarantee state via host or storage fencing 
> > before starting a VM elsewhere, as it will be making assumptions. 
> > Its entirely possible a VM might be unreachable or unable to access 
> > it storage for a short while, a new instance of the VM is started
> elsewhere, and the original VM comes back.
> >
> > On Wed, Jan 17, 2018 at 9:02 AM Nux! <nu...@li.nux.ro> wrote:
> >
> > > Hi Rohit,
> > >
> > > I've reinstalled and tested. Still no go with VM HA.
> > >
> > > What I did was to kernel panic that particular HV ("echo c > 
> > > /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
> > > What happened next is the HV got marked as "Alert", the VM on it 
> > > was all the time marked as "Running" and it was not migrated to another HV.
> > > Once the panicked HV has booted back the VM reboots and becomes
> > available.
> > >
> > > I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary
> storage.
> > > The VM has HA enabled service offering.
> > > Host HA or OOBM configuration was not touched.
> > >
> > > Full log http://tmp.nux.ro/W3s-management-server.log
> > >
> > > --
> > > Sent from the Delta quadrant using Borg technology!
> > >
> > > Nux!
> > > www.nux.ro
> > >
> > > ----- Original Message -----
> > > > From: "Rohit Yadav" <ro...@shapeblue.com>
> > > > To: "dev" <de...@cloudstack.apache.org>
> > > > Sent: Wednesday, 17 January, 2018 12:13:33
> > > > Subject: Re: HA issues
> > >
> > > > I performed VM HA sanity checks and was not able to reproduce 
> > > > any
> > > regression
> > > > against two KVM CentOS7 hosts in a cluster.
> > > >
> > > >
> > > > Without the "Host HA" feature, I deployed few HA-enabled VMs on 
> > > > a KVM
> > > host2 and
> > > > killed it (powered off). After few minutes of CloudStack 
> > > > attempting to
> > > find why
> > > > the host (kvm agent) timed out, CloudStack kicked investigators, 
> > > > that eventually led KVM fencers to work and VM HA job kicked to 
> > > > start those
> > > few VMs
> > > > on host1 and the KVM host2 was put to "Down" state.
> > > >
> > > >
> > > > - Rohit
> > > >
> > > > <https://cloudstack.apache.org>
> > > >
> > > >
> > > >
> > > > ________________________________
> > > >
> > > > rohit.yadav@shapeblue.com
> > > > www.shapeblue.com
> > > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
> > > >
> > > >
> > > >
> > > > From: Rohit Yadav
> > > > Sent: Wednesday, January 17, 2018 2:39:19 PM
> > > > To: dev
> > > > Subject: Re: HA issues
> > > >
> > > >
> > > > Hi Lucian,
> > > >
> > > >
> > > > The "Host HA" feature is entirely different from VM HA, however, 
> > > > they
> > > may work
> > > > in tandem, so please stop using the terms interchangeably as it 
> > > > may
> > > cause the
> > > > community to believe a regression has been caused.
> > > >
> > > >
> > > > The "Host HA" feature currently ships with only "Host HA" 
> > > > provider for
> > > KVM that
> > > > is strictly tied to out-of-band management (IPMI for fencing, 
> > > > i.e power
> > > off and
> > > > recovery, i.e. reboot) and NFS (as primary storage). (We also 
> > > > have a
> > > provider
> > > > for simulator, but that's for coverage/testing purposes).
> > > >
> > > >
> > > > Therefore, "Host HA" for KVM (+nfs) currently works only when 
> > > > OOBM is
> > > enabled.
> > > > The frameowkr allows interested parties may write their own HA
> > providers
> > > for a
> > > > hypervisor that can use a different strategy/mechanism for
> > > fencing/recovery of
> > > > hosts (including write a non-IPMI based OOBM plugin) and 
> > > > host/disk
> > > activity
> > > > checker that is non-NFS based.
> > > >
> > > >
> > > > The "Host HA" feature ships disabled by default and does not 
> > > > cause any interference with VM HA. However, when enabled and 
> > > > configured
> > correctly,
> > > it is
> > > > a known limitation that when it is unable to successfully 
> > > > perform
> > > recovery or
> > > > fencing tasks it may not trigger VM HA. We can discuss how to 
> > > > handle
> > > such cases
> > > > (thoughts?). "Host HA" would try couple of times to recover and 
> > > > failing
> > > to do
> > > > so, it would eventually trigger a host fencing task. If it's 
> > > > unable to
> > > fence a
> > > > host, it will indefinitely attempt to fence the host (the host 
> > > > state
> > > will be
> > > > stuck at fencing state in cloud.ha_config table for example) and 
> > > > alerts
> > > will be
> > > > sent to admin who can do some manual intervention to handle such
> > > situations (if
> > > > you've email/smtp enabled, you should see alert emails).
> > > >
> > > >
> > > > We can discuss how to improve and have a workaround for the case 
> > > > you've
> > > hit,
> > > > thanks for sharing.
> > > >
> > > >
> > > > - Rohit
> > > >
> > > > ________________________________
> > > > From: Nux! <nu...@li.nux.ro>
> > > > Sent: Tuesday, January 16, 2018 10:42:35 PM
> > > > To: dev
> > > > Subject: Re: HA issues
> > > >
> > > > Ok, reinstalled and re-tested.
> > > >
> > > > What I've learned:
> > > >
> > > > - HA only works now if OOB is configured, the old way HA no 
> > > > longer
> > > applies -
> > > > this can be good and bad, not everyone has IPMIs
> > > >
> > > > - HA only works if IPMI is reachable. I've pulled the cord on a 
> > > > HV and
> > > HA failed
> > > > to do its thing, leaving me with a HV down along with all the 
> > > > VMs
> > running
> > > > there. That's bad.
> > > > I've opened this ticket for it:
> > > > https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> > > >
> > > > Let me know if you need any extra info or stuff to test.
> > > >
> > > > Regards,
> > > > Lucian
> > > >
> > > > --
> > > > Sent from the Delta quadrant using Borg technology!
> > > >
> > > > Nux!
> > > > www.nux.ro
> > > >
> > > > ----- Original Message -----
> > > >> From: "Nux!" <nu...@li.nux.ro>
> > > >> To: "dev" <de...@cloudstack.apache.org>
> > > >> Sent: Tuesday, 16 January, 2018 11:35:58
> > > >> Subject: Re: HA issues
> > > >
> > > >> I'll reinstall my setup and try again, just to be sure I'm 
> > > >> working on
> > a
> > > clean
> > > >> slate.
> > > >>
> > > >> --
> > > >> Sent from the Delta quadrant using Borg technology!
> > > >>
> > > >> Nux!
> > > >> www.nux.ro
> > > >>
> > > >> ----- Original Message -----
> > > >>> From: "Rohit Yadav" <ro...@shapeblue.com>
> > > >>> To: "dev" <de...@cloudstack.apache.org>
> > > >>> Sent: Tuesday, 16 January, 2018 11:29:51
> > > >>> Subject: Re: HA issues
> > > >>
> > > >>> Hi Lucian,
> > > >>>
> > > >>>
> > > >>> If you're talking about the new HostHA feature (with
> > > >>> KVM+nfs+ipmi),
> > > please refer
> > > >>> to following docs:
> > > >>>
> > > >>>
> > > http://docs.cloudstack.apache.org/projects/cloudstack-
> > administration/en/latest/hosts.html#out-of-band-management
> > > >>>
> > > >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
> > > >>>
> > > >>>
> > > >>> We'll need to you look at logs perhaps create a JIRA ticket 
> > > >>> with the
> > > logs and
> > > >>> details? If you saw ipmi based reboot, then host-ha indeed 
> > > >>> tried to
> > > recover
> > > >>> i.e. reboot the host, once hostha has done its work it would 
> > > >>> schedule
> > > HA for VM
> > > >>> as soon as the recovery operation succeeds (we've simulator 
> > > >>> and kvm
> > > based
> > > >>> marvin tests for such scenarios).
> > > >>>
> > > >>>
> > > >>> Can you see it making attempt to schedule VM ha in logs, or 
> > > >>> any
> > > failure?
> > > >>>
> > > >>>
> > > >>> - Rohit
> > > >>>
> > > >>> <https://cloudstack.apache.org>
> > > >>>
> > > >>>
> > > >>>
> > > >>> ________________________________
> > > >>> From: Nux! <nu...@li.nux.ro>
> > > >>> Sent: Tuesday, January 16, 2018 12:47:56 AM
> > > >>> To: dev
> > > >>> Subject: [4.11] HA issues
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> I see there's a new HA engine for KVM and IPMI support which 
> > > >>> is
> > really
> > > nice,
> > > >>> however it seems hit and miss.
> > > >>> I have created an instance with HA offering, kernel panicked 
> > > >>> one of
> > the
> > > >>> hypervisors - after a while the server was rebooted via IPMI
> > probably,
> > > but the
> > > >>> instance never moved to a running hypervisor and even after 
> > > >>> the o
> > > <https://maps.google.com/?q=to+a+running+hypervisor+and+
> > even+after+the+o&entry=gmail&source=g>
> > > riginal
> > > >>> hypervisor came back it was still left in Stopped state.
> > > >>> Is there any extra things I need to set up to have proper HA?
> > > >>>
> > > >>> Regards,
> > > >>> Lucian
> > > >>>
> > > >>> --
> > > >>> Sent from the Delta quadrant using Borg technology!
> > > >>>
> > > >>> Nux!
> > > >>> www.nux.ro
> > > >>>
> > > >>> rohit.yadav@shapeblue.com
> > > >>> www.shapeblue.com<http://www.shapeblue.com>
> > > >>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > > @shapeblue
> > >
> >
>
>
>
> --
>
> Andrija Panić
>



-- 

Andrija Panić

Re: HA issues

Posted by Andrija Panic <an...@gmail.com>.
That is good to hear ( no NFS issues causing Agent Disconnect).

I assume you are using "normal" NFS solution with proper HA and no ZFS
(kernel panic etc), but anyway be aware of this one

https://github.com/apache/cloudstack/blob/e532b574ddb186a117da638fb6059356fe7c266c/scripts/vm/hypervisor/kvm/kvmheartbeat.sh#L161



we used to comment this line, because we did have some issues with
communication link, and this commented line saved our a$$ few times :)

CHeers

On 20 February 2018 at 20:50, Sean Lair <sl...@ippathways.com> wrote:

> Hi Andrija
>
> We are currently running XenServer in production.  We are working on
> moving to KVM and have it deployed in a development environment.
>
> The team is putting CloudStack + KVM through its paces and that is when it
> was discovered how broken VM HA is in 4.9.3.  Initially our patches fixed
> VM HA, but just caused VMs to get started on two hosts during failure
> testing.  The libvirt lockd has solved that issue thus far.
>
> Short answer to you question is :-), we were not having problems with
> Agent Disconnects in a production environment.  It was our testing/QA that
> revealed the issues.  Our NFS has been stable so far, no issues with the
> agent crashing/stopping that wasn't initiated by the team's testing.
>
> Thanks
> Sean
>
>
> -----Original Message-----
> From: Andrija Panic [mailto:andrija.panic@gmail.com]
> Sent: Saturday, February 17, 2018 1:49 PM
> To: dev <de...@cloudstack.apache.org>
> Subject: Re: HA issues
>
> Hi Sean,
>
> (we have 2 threads interleaving on the libvirt lockd..) - so, did you
> manage to understand what can cause the Agent Disconnect in most cases, for
> you specifically? Is there any software (CloudStack) root cause
> (disregarding i.e. networking issues etc)
>
> Just our examples, which you should probably not have:
>
> We had CEPH cluster running (with ACS), and there any exception in librbd
> would crash JVM and the agent, but this has been fixed mostly - Now get
> i.e. agent disconnect when ACS try to delete volume on CEPH (and for some
> reason not succeed withing 30 minutes, volume deletion fails) - then
> libvirt get's completety stuck (virsh list even dont work)...so  agent
> get's disconnect eventually.
>
> It would be good to get rid of agent disconnections in general, obviously
> :) so that is why I'm asking (you are on NFS, so would like to see your
> experience here).
>
> Thanks
>
> On 16 February 2018 at 21:52, Sean Lair <sl...@ippathways.com> wrote:
>
> > We were in the same situation as Nux.
> >
> > In our test environment we hit the issue with VMs not getting fenced and
> > coming up on two hosts because of VM HA.   However, we updated some of
> the
> > logic for VM HA and turned on libvirtd's locking mechanism.  Now we
> > are working great w/o IPMI.  The locking stops the VMs from starting
> > elsewhere, and everything recovers very nicely when the host starts
> responding again.
> >
> > We are on 4.9.3 and haven't started testing with 4.11 yet, but it may
> > work along-side IPMI just fine - it would just have affect the fencing.
> > However, we *currently* prefer how we are doing it now, because if the
> > agent stops responding, but the host is still up, the VMs continue
> > running and no actual downtime is incurred.  Even when VM HA attempts
> > to power on the VMs on another host, it just fails the power-up and
> > the VMs continue to run on the "agent disconnected" host. The host
> > goes into alarm state and our NOC can look into what is wrong the
> > agent on the host.  If IPMI was enabled, it sounds like it would power
> > off the host (fence) and force downtime for us even if the VMs were
> > actually running OK - and just the agent is unreachable.
> >
> > I plan on submitting our updates via a pull request at some point.
> > But I can also send the updated code to anyone that wants to do some
> > testing before then.
> >
> > -----Original Message-----
> > From: Marcus [mailto:shadowsor@gmail.com]
> > Sent: Friday, February 16, 2018 11:27 AM
> > To: dev@cloudstack.apache.org
> > Subject: Re: HA issues
> >
> > From your other emails it sounds as though you do not have IPMI
> > configured, nor host HA enabled, correct? In this case, the correct
> > thing to do is nothing. If CloudStack cannot guarantee the VM state
> > (as is the case with an unreachable hypervisor), it should do nothing,
> > for fear of causing a split brain and corrupting the VM disk (VM running
> on two hosts).
> >
> > Clustering and fencing is a tricky proposition. When CloudStack (or
> > any other cluster manager) is not configured to or cannot guarantee
> > state then things will simply lock up, in this case your HA VM on your
> > broken hypervisor will not run elsewhere. This has been the case for a
> > long time with CloudStack, HA would only start a VM after the original
> > hypervisor agent came back and reported no VM is running.
> >
> > The new feature, from what I gather, simply adds the possibility of
> > CloudStack being able to reach out and shut down the hypervisor to
> > guarantee state. At that point it can start the VM elsewhere. If
> > something fails in that process (IPMI unreachable, for example, or bad
> > credentials), you're still going to be stuck with a VM not coming back.
> >
> > It's the nature of the thing. I'd be wary of any HA solution that does
> > not reach out and guarantee state via host or storage fencing before
> > starting a VM elsewhere, as it will be making assumptions. Its
> > entirely possible a VM might be unreachable or unable to access it
> > storage for a short while, a new instance of the VM is started
> elsewhere, and the original VM comes back.
> >
> > On Wed, Jan 17, 2018 at 9:02 AM Nux! <nu...@li.nux.ro> wrote:
> >
> > > Hi Rohit,
> > >
> > > I've reinstalled and tested. Still no go with VM HA.
> > >
> > > What I did was to kernel panic that particular HV ("echo c >
> > > /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
> > > What happened next is the HV got marked as "Alert", the VM on it was
> > > all the time marked as "Running" and it was not migrated to another HV.
> > > Once the panicked HV has booted back the VM reboots and becomes
> > available.
> > >
> > > I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary
> storage.
> > > The VM has HA enabled service offering.
> > > Host HA or OOBM configuration was not touched.
> > >
> > > Full log http://tmp.nux.ro/W3s-management-server.log
> > >
> > > --
> > > Sent from the Delta quadrant using Borg technology!
> > >
> > > Nux!
> > > www.nux.ro
> > >
> > > ----- Original Message -----
> > > > From: "Rohit Yadav" <ro...@shapeblue.com>
> > > > To: "dev" <de...@cloudstack.apache.org>
> > > > Sent: Wednesday, 17 January, 2018 12:13:33
> > > > Subject: Re: HA issues
> > >
> > > > I performed VM HA sanity checks and was not able to reproduce any
> > > regression
> > > > against two KVM CentOS7 hosts in a cluster.
> > > >
> > > >
> > > > Without the "Host HA" feature, I deployed few HA-enabled VMs on a
> > > > KVM
> > > host2 and
> > > > killed it (powered off). After few minutes of CloudStack
> > > > attempting to
> > > find why
> > > > the host (kvm agent) timed out, CloudStack kicked investigators,
> > > > that eventually led KVM fencers to work and VM HA job kicked to
> > > > start those
> > > few VMs
> > > > on host1 and the KVM host2 was put to "Down" state.
> > > >
> > > >
> > > > - Rohit
> > > >
> > > > <https://cloudstack.apache.org>
> > > >
> > > >
> > > >
> > > > ________________________________
> > > >
> > > > rohit.yadav@shapeblue.com
> > > > www.shapeblue.com
> > > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
> > > >
> > > >
> > > >
> > > > From: Rohit Yadav
> > > > Sent: Wednesday, January 17, 2018 2:39:19 PM
> > > > To: dev
> > > > Subject: Re: HA issues
> > > >
> > > >
> > > > Hi Lucian,
> > > >
> > > >
> > > > The "Host HA" feature is entirely different from VM HA, however,
> > > > they
> > > may work
> > > > in tandem, so please stop using the terms interchangeably as it
> > > > may
> > > cause the
> > > > community to believe a regression has been caused.
> > > >
> > > >
> > > > The "Host HA" feature currently ships with only "Host HA" provider
> > > > for
> > > KVM that
> > > > is strictly tied to out-of-band management (IPMI for fencing, i.e
> > > > power
> > > off and
> > > > recovery, i.e. reboot) and NFS (as primary storage). (We also have
> > > > a
> > > provider
> > > > for simulator, but that's for coverage/testing purposes).
> > > >
> > > >
> > > > Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM
> > > > is
> > > enabled.
> > > > The frameowkr allows interested parties may write their own HA
> > providers
> > > for a
> > > > hypervisor that can use a different strategy/mechanism for
> > > fencing/recovery of
> > > > hosts (including write a non-IPMI based OOBM plugin) and host/disk
> > > activity
> > > > checker that is non-NFS based.
> > > >
> > > >
> > > > The "Host HA" feature ships disabled by default and does not cause
> > > > any interference with VM HA. However, when enabled and configured
> > correctly,
> > > it is
> > > > a known limitation that when it is unable to successfully perform
> > > recovery or
> > > > fencing tasks it may not trigger VM HA. We can discuss how to
> > > > handle
> > > such cases
> > > > (thoughts?). "Host HA" would try couple of times to recover and
> > > > failing
> > > to do
> > > > so, it would eventually trigger a host fencing task. If it's
> > > > unable to
> > > fence a
> > > > host, it will indefinitely attempt to fence the host (the host
> > > > state
> > > will be
> > > > stuck at fencing state in cloud.ha_config table for example) and
> > > > alerts
> > > will be
> > > > sent to admin who can do some manual intervention to handle such
> > > situations (if
> > > > you've email/smtp enabled, you should see alert emails).
> > > >
> > > >
> > > > We can discuss how to improve and have a workaround for the case
> > > > you've
> > > hit,
> > > > thanks for sharing.
> > > >
> > > >
> > > > - Rohit
> > > >
> > > > ________________________________
> > > > From: Nux! <nu...@li.nux.ro>
> > > > Sent: Tuesday, January 16, 2018 10:42:35 PM
> > > > To: dev
> > > > Subject: Re: HA issues
> > > >
> > > > Ok, reinstalled and re-tested.
> > > >
> > > > What I've learned:
> > > >
> > > > - HA only works now if OOB is configured, the old way HA no longer
> > > applies -
> > > > this can be good and bad, not everyone has IPMIs
> > > >
> > > > - HA only works if IPMI is reachable. I've pulled the cord on a HV
> > > > and
> > > HA failed
> > > > to do its thing, leaving me with a HV down along with all the VMs
> > running
> > > > there. That's bad.
> > > > I've opened this ticket for it:
> > > > https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> > > >
> > > > Let me know if you need any extra info or stuff to test.
> > > >
> > > > Regards,
> > > > Lucian
> > > >
> > > > --
> > > > Sent from the Delta quadrant using Borg technology!
> > > >
> > > > Nux!
> > > > www.nux.ro
> > > >
> > > > ----- Original Message -----
> > > >> From: "Nux!" <nu...@li.nux.ro>
> > > >> To: "dev" <de...@cloudstack.apache.org>
> > > >> Sent: Tuesday, 16 January, 2018 11:35:58
> > > >> Subject: Re: HA issues
> > > >
> > > >> I'll reinstall my setup and try again, just to be sure I'm
> > > >> working on
> > a
> > > clean
> > > >> slate.
> > > >>
> > > >> --
> > > >> Sent from the Delta quadrant using Borg technology!
> > > >>
> > > >> Nux!
> > > >> www.nux.ro
> > > >>
> > > >> ----- Original Message -----
> > > >>> From: "Rohit Yadav" <ro...@shapeblue.com>
> > > >>> To: "dev" <de...@cloudstack.apache.org>
> > > >>> Sent: Tuesday, 16 January, 2018 11:29:51
> > > >>> Subject: Re: HA issues
> > > >>
> > > >>> Hi Lucian,
> > > >>>
> > > >>>
> > > >>> If you're talking about the new HostHA feature (with
> > > >>> KVM+nfs+ipmi),
> > > please refer
> > > >>> to following docs:
> > > >>>
> > > >>>
> > > http://docs.cloudstack.apache.org/projects/cloudstack-
> > administration/en/latest/hosts.html#out-of-band-management
> > > >>>
> > > >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
> > > >>>
> > > >>>
> > > >>> We'll need to you look at logs perhaps create a JIRA ticket with
> > > >>> the
> > > logs and
> > > >>> details? If you saw ipmi based reboot, then host-ha indeed tried
> > > >>> to
> > > recover
> > > >>> i.e. reboot the host, once hostha has done its work it would
> > > >>> schedule
> > > HA for VM
> > > >>> as soon as the recovery operation succeeds (we've simulator and
> > > >>> kvm
> > > based
> > > >>> marvin tests for such scenarios).
> > > >>>
> > > >>>
> > > >>> Can you see it making attempt to schedule VM ha in logs, or any
> > > failure?
> > > >>>
> > > >>>
> > > >>> - Rohit
> > > >>>
> > > >>> <https://cloudstack.apache.org>
> > > >>>
> > > >>>
> > > >>>
> > > >>> ________________________________
> > > >>> From: Nux! <nu...@li.nux.ro>
> > > >>> Sent: Tuesday, January 16, 2018 12:47:56 AM
> > > >>> To: dev
> > > >>> Subject: [4.11] HA issues
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> I see there's a new HA engine for KVM and IPMI support which is
> > really
> > > nice,
> > > >>> however it seems hit and miss.
> > > >>> I have created an instance with HA offering, kernel panicked one
> > > >>> of
> > the
> > > >>> hypervisors - after a while the server was rebooted via IPMI
> > probably,
> > > but the
> > > >>> instance never moved to a running hypervisor and even after the
> > > >>> o
> > > <https://maps.google.com/?q=to+a+running+hypervisor+and+
> > even+after+the+o&entry=gmail&source=g>
> > > riginal
> > > >>> hypervisor came back it was still left in Stopped state.
> > > >>> Is there any extra things I need to set up to have proper HA?
> > > >>>
> > > >>> Regards,
> > > >>> Lucian
> > > >>>
> > > >>> --
> > > >>> Sent from the Delta quadrant using Borg technology!
> > > >>>
> > > >>> Nux!
> > > >>> www.nux.ro
> > > >>>
> > > >>> rohit.yadav@shapeblue.com
> > > >>> www.shapeblue.com<http://www.shapeblue.com>
> > > >>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > > @shapeblue
> > >
> >
>
>
>
> --
>
> Andrija Panić
>



-- 

Andrija Panić

RE: HA issues

Posted by Sean Lair <sl...@ippathways.com>.
Hi Andrija

We are currently running XenServer in production.  We are working on moving to KVM and have it deployed in a development environment.

The team is putting CloudStack + KVM through its paces and that is when it was discovered how broken VM HA is in 4.9.3.  Initially our patches fixed VM HA, but just caused VMs to get started on two hosts during failure testing.  The libvirt lockd has solved that issue thus far.

Short answer to you question is :-), we were not having problems with Agent Disconnects in a production environment.  It was our testing/QA that revealed the issues.  Our NFS has been stable so far, no issues with the agent crashing/stopping that wasn't initiated by the team's testing.

Thanks
Sean


-----Original Message-----
From: Andrija Panic [mailto:andrija.panic@gmail.com] 
Sent: Saturday, February 17, 2018 1:49 PM
To: dev <de...@cloudstack.apache.org>
Subject: Re: HA issues

Hi Sean,

(we have 2 threads interleaving on the libvirt lockd..) - so, did you manage to understand what can cause the Agent Disconnect in most cases, for you specifically? Is there any software (CloudStack) root cause (disregarding i.e. networking issues etc)

Just our examples, which you should probably not have:

We had CEPH cluster running (with ACS), and there any exception in librbd would crash JVM and the agent, but this has been fixed mostly - Now get i.e. agent disconnect when ACS try to delete volume on CEPH (and for some reason not succeed withing 30 minutes, volume deletion fails) - then libvirt get's completety stuck (virsh list even dont work)...so  agent get's disconnect eventually.

It would be good to get rid of agent disconnections in general, obviously
:) so that is why I'm asking (you are on NFS, so would like to see your experience here).

Thanks

On 16 February 2018 at 21:52, Sean Lair <sl...@ippathways.com> wrote:

> We were in the same situation as Nux.
>
> In our test environment we hit the issue with VMs not getting fenced and
> coming up on two hosts because of VM HA.   However, we updated some of the
> logic for VM HA and turned on libvirtd's locking mechanism.  Now we 
> are working great w/o IPMI.  The locking stops the VMs from starting 
> elsewhere, and everything recovers very nicely when the host starts responding again.
>
> We are on 4.9.3 and haven't started testing with 4.11 yet, but it may 
> work along-side IPMI just fine - it would just have affect the fencing.
> However, we *currently* prefer how we are doing it now, because if the 
> agent stops responding, but the host is still up, the VMs continue 
> running and no actual downtime is incurred.  Even when VM HA attempts 
> to power on the VMs on another host, it just fails the power-up and 
> the VMs continue to run on the "agent disconnected" host. The host 
> goes into alarm state and our NOC can look into what is wrong the 
> agent on the host.  If IPMI was enabled, it sounds like it would power 
> off the host (fence) and force downtime for us even if the VMs were 
> actually running OK - and just the agent is unreachable.
>
> I plan on submitting our updates via a pull request at some point.  
> But I can also send the updated code to anyone that wants to do some 
> testing before then.
>
> -----Original Message-----
> From: Marcus [mailto:shadowsor@gmail.com]
> Sent: Friday, February 16, 2018 11:27 AM
> To: dev@cloudstack.apache.org
> Subject: Re: HA issues
>
> From your other emails it sounds as though you do not have IPMI 
> configured, nor host HA enabled, correct? In this case, the correct 
> thing to do is nothing. If CloudStack cannot guarantee the VM state 
> (as is the case with an unreachable hypervisor), it should do nothing, 
> for fear of causing a split brain and corrupting the VM disk (VM running on two hosts).
>
> Clustering and fencing is a tricky proposition. When CloudStack (or 
> any other cluster manager) is not configured to or cannot guarantee 
> state then things will simply lock up, in this case your HA VM on your 
> broken hypervisor will not run elsewhere. This has been the case for a 
> long time with CloudStack, HA would only start a VM after the original 
> hypervisor agent came back and reported no VM is running.
>
> The new feature, from what I gather, simply adds the possibility of 
> CloudStack being able to reach out and shut down the hypervisor to 
> guarantee state. At that point it can start the VM elsewhere. If 
> something fails in that process (IPMI unreachable, for example, or bad 
> credentials), you're still going to be stuck with a VM not coming back.
>
> It's the nature of the thing. I'd be wary of any HA solution that does 
> not reach out and guarantee state via host or storage fencing before 
> starting a VM elsewhere, as it will be making assumptions. Its 
> entirely possible a VM might be unreachable or unable to access it 
> storage for a short while, a new instance of the VM is started elsewhere, and the original VM comes back.
>
> On Wed, Jan 17, 2018 at 9:02 AM Nux! <nu...@li.nux.ro> wrote:
>
> > Hi Rohit,
> >
> > I've reinstalled and tested. Still no go with VM HA.
> >
> > What I did was to kernel panic that particular HV ("echo c > 
> > /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
> > What happened next is the HV got marked as "Alert", the VM on it was 
> > all the time marked as "Running" and it was not migrated to another HV.
> > Once the panicked HV has booted back the VM reboots and becomes
> available.
> >
> > I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary storage.
> > The VM has HA enabled service offering.
> > Host HA or OOBM configuration was not touched.
> >
> > Full log http://tmp.nux.ro/W3s-management-server.log
> >
> > --
> > Sent from the Delta quadrant using Borg technology!
> >
> > Nux!
> > www.nux.ro
> >
> > ----- Original Message -----
> > > From: "Rohit Yadav" <ro...@shapeblue.com>
> > > To: "dev" <de...@cloudstack.apache.org>
> > > Sent: Wednesday, 17 January, 2018 12:13:33
> > > Subject: Re: HA issues
> >
> > > I performed VM HA sanity checks and was not able to reproduce any
> > regression
> > > against two KVM CentOS7 hosts in a cluster.
> > >
> > >
> > > Without the "Host HA" feature, I deployed few HA-enabled VMs on a 
> > > KVM
> > host2 and
> > > killed it (powered off). After few minutes of CloudStack 
> > > attempting to
> > find why
> > > the host (kvm agent) timed out, CloudStack kicked investigators, 
> > > that eventually led KVM fencers to work and VM HA job kicked to 
> > > start those
> > few VMs
> > > on host1 and the KVM host2 was put to "Down" state.
> > >
> > >
> > > - Rohit
> > >
> > > <https://cloudstack.apache.org>
> > >
> > >
> > >
> > > ________________________________
> > >
> > > rohit.yadav@shapeblue.com
> > > www.shapeblue.com
> > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
> > >
> > >
> > >
> > > From: Rohit Yadav
> > > Sent: Wednesday, January 17, 2018 2:39:19 PM
> > > To: dev
> > > Subject: Re: HA issues
> > >
> > >
> > > Hi Lucian,
> > >
> > >
> > > The "Host HA" feature is entirely different from VM HA, however, 
> > > they
> > may work
> > > in tandem, so please stop using the terms interchangeably as it 
> > > may
> > cause the
> > > community to believe a regression has been caused.
> > >
> > >
> > > The "Host HA" feature currently ships with only "Host HA" provider 
> > > for
> > KVM that
> > > is strictly tied to out-of-band management (IPMI for fencing, i.e 
> > > power
> > off and
> > > recovery, i.e. reboot) and NFS (as primary storage). (We also have 
> > > a
> > provider
> > > for simulator, but that's for coverage/testing purposes).
> > >
> > >
> > > Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM 
> > > is
> > enabled.
> > > The frameowkr allows interested parties may write their own HA
> providers
> > for a
> > > hypervisor that can use a different strategy/mechanism for
> > fencing/recovery of
> > > hosts (including write a non-IPMI based OOBM plugin) and host/disk
> > activity
> > > checker that is non-NFS based.
> > >
> > >
> > > The "Host HA" feature ships disabled by default and does not cause 
> > > any interference with VM HA. However, when enabled and configured
> correctly,
> > it is
> > > a known limitation that when it is unable to successfully perform
> > recovery or
> > > fencing tasks it may not trigger VM HA. We can discuss how to 
> > > handle
> > such cases
> > > (thoughts?). "Host HA" would try couple of times to recover and 
> > > failing
> > to do
> > > so, it would eventually trigger a host fencing task. If it's 
> > > unable to
> > fence a
> > > host, it will indefinitely attempt to fence the host (the host 
> > > state
> > will be
> > > stuck at fencing state in cloud.ha_config table for example) and 
> > > alerts
> > will be
> > > sent to admin who can do some manual intervention to handle such
> > situations (if
> > > you've email/smtp enabled, you should see alert emails).
> > >
> > >
> > > We can discuss how to improve and have a workaround for the case 
> > > you've
> > hit,
> > > thanks for sharing.
> > >
> > >
> > > - Rohit
> > >
> > > ________________________________
> > > From: Nux! <nu...@li.nux.ro>
> > > Sent: Tuesday, January 16, 2018 10:42:35 PM
> > > To: dev
> > > Subject: Re: HA issues
> > >
> > > Ok, reinstalled and re-tested.
> > >
> > > What I've learned:
> > >
> > > - HA only works now if OOB is configured, the old way HA no longer
> > applies -
> > > this can be good and bad, not everyone has IPMIs
> > >
> > > - HA only works if IPMI is reachable. I've pulled the cord on a HV 
> > > and
> > HA failed
> > > to do its thing, leaving me with a HV down along with all the VMs
> running
> > > there. That's bad.
> > > I've opened this ticket for it:
> > > https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> > >
> > > Let me know if you need any extra info or stuff to test.
> > >
> > > Regards,
> > > Lucian
> > >
> > > --
> > > Sent from the Delta quadrant using Borg technology!
> > >
> > > Nux!
> > > www.nux.ro
> > >
> > > ----- Original Message -----
> > >> From: "Nux!" <nu...@li.nux.ro>
> > >> To: "dev" <de...@cloudstack.apache.org>
> > >> Sent: Tuesday, 16 January, 2018 11:35:58
> > >> Subject: Re: HA issues
> > >
> > >> I'll reinstall my setup and try again, just to be sure I'm 
> > >> working on
> a
> > clean
> > >> slate.
> > >>
> > >> --
> > >> Sent from the Delta quadrant using Borg technology!
> > >>
> > >> Nux!
> > >> www.nux.ro
> > >>
> > >> ----- Original Message -----
> > >>> From: "Rohit Yadav" <ro...@shapeblue.com>
> > >>> To: "dev" <de...@cloudstack.apache.org>
> > >>> Sent: Tuesday, 16 January, 2018 11:29:51
> > >>> Subject: Re: HA issues
> > >>
> > >>> Hi Lucian,
> > >>>
> > >>>
> > >>> If you're talking about the new HostHA feature (with 
> > >>> KVM+nfs+ipmi),
> > please refer
> > >>> to following docs:
> > >>>
> > >>>
> > http://docs.cloudstack.apache.org/projects/cloudstack-
> administration/en/latest/hosts.html#out-of-band-management
> > >>>
> > >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
> > >>>
> > >>>
> > >>> We'll need to you look at logs perhaps create a JIRA ticket with 
> > >>> the
> > logs and
> > >>> details? If you saw ipmi based reboot, then host-ha indeed tried 
> > >>> to
> > recover
> > >>> i.e. reboot the host, once hostha has done its work it would 
> > >>> schedule
> > HA for VM
> > >>> as soon as the recovery operation succeeds (we've simulator and 
> > >>> kvm
> > based
> > >>> marvin tests for such scenarios).
> > >>>
> > >>>
> > >>> Can you see it making attempt to schedule VM ha in logs, or any
> > failure?
> > >>>
> > >>>
> > >>> - Rohit
> > >>>
> > >>> <https://cloudstack.apache.org>
> > >>>
> > >>>
> > >>>
> > >>> ________________________________
> > >>> From: Nux! <nu...@li.nux.ro>
> > >>> Sent: Tuesday, January 16, 2018 12:47:56 AM
> > >>> To: dev
> > >>> Subject: [4.11] HA issues
> > >>>
> > >>> Hi,
> > >>>
> > >>> I see there's a new HA engine for KVM and IPMI support which is
> really
> > nice,
> > >>> however it seems hit and miss.
> > >>> I have created an instance with HA offering, kernel panicked one 
> > >>> of
> the
> > >>> hypervisors - after a while the server was rebooted via IPMI
> probably,
> > but the
> > >>> instance never moved to a running hypervisor and even after the 
> > >>> o
> > <https://maps.google.com/?q=to+a+running+hypervisor+and+
> even+after+the+o&entry=gmail&source=g>
> > riginal
> > >>> hypervisor came back it was still left in Stopped state.
> > >>> Is there any extra things I need to set up to have proper HA?
> > >>>
> > >>> Regards,
> > >>> Lucian
> > >>>
> > >>> --
> > >>> Sent from the Delta quadrant using Borg technology!
> > >>>
> > >>> Nux!
> > >>> www.nux.ro
> > >>>
> > >>> rohit.yadav@shapeblue.com
> > >>> www.shapeblue.com<http://www.shapeblue.com>
> > >>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > @shapeblue
> >
>



-- 

Andrija Panić

Re: HA issues

Posted by Andrija Panic <an...@gmail.com>.
Hi Sean,

(we have 2 threads interleaving on the libvirt lockd..) - so, did you
manage to understand what can cause the Agent Disconnect in most cases, for
you specifically? Is there any software (CloudStack) root cause
(disregarding i.e. networking issues etc)

Just our examples, which you should probably not have:

We had CEPH cluster running (with ACS), and there any exception in librbd
would crash JVM and the agent, but this has been fixed mostly -
Now get i.e. agent disconnect when ACS try to delete volume on CEPH (and
for some reason not succeed withing 30 minutes, volume deletion fails) -
then libvirt get's completety stuck (virsh list even dont work)...so  agent
get's disconnect eventually.

It would be good to get rid of agent disconnections in general, obviously
:) so that is why I'm asking (you are on NFS, so would like to see your
experience here).

Thanks

On 16 February 2018 at 21:52, Sean Lair <sl...@ippathways.com> wrote:

> We were in the same situation as Nux.
>
> In our test environment we hit the issue with VMs not getting fenced and
> coming up on two hosts because of VM HA.   However, we updated some of the
> logic for VM HA and turned on libvirtd's locking mechanism.  Now we are
> working great w/o IPMI.  The locking stops the VMs from starting elsewhere,
> and everything recovers very nicely when the host starts responding again.
>
> We are on 4.9.3 and haven't started testing with 4.11 yet, but it may work
> along-side IPMI just fine - it would just have affect the fencing.
> However, we *currently* prefer how we are doing it now, because if the
> agent stops responding, but the host is still up, the VMs continue running
> and no actual downtime is incurred.  Even when VM HA attempts to power on
> the VMs on another host, it just fails the power-up and the VMs continue to
> run on the "agent disconnected" host. The host goes into alarm state and
> our NOC can look into what is wrong the agent on the host.  If IPMI was
> enabled, it sounds like it would power off the host (fence) and force
> downtime for us even if the VMs were actually running OK - and just the
> agent is unreachable.
>
> I plan on submitting our updates via a pull request at some point.  But I
> can also send the updated code to anyone that wants to do some testing
> before then.
>
> -----Original Message-----
> From: Marcus [mailto:shadowsor@gmail.com]
> Sent: Friday, February 16, 2018 11:27 AM
> To: dev@cloudstack.apache.org
> Subject: Re: HA issues
>
> From your other emails it sounds as though you do not have IPMI
> configured, nor host HA enabled, correct? In this case, the correct thing
> to do is nothing. If CloudStack cannot guarantee the VM state (as is the
> case with an unreachable hypervisor), it should do nothing, for fear of
> causing a split brain and corrupting the VM disk (VM running on two hosts).
>
> Clustering and fencing is a tricky proposition. When CloudStack (or any
> other cluster manager) is not configured to or cannot guarantee state then
> things will simply lock up, in this case your HA VM on your broken
> hypervisor will not run elsewhere. This has been the case for a long time
> with CloudStack, HA would only start a VM after the original hypervisor
> agent came back and reported no VM is running.
>
> The new feature, from what I gather, simply adds the possibility of
> CloudStack being able to reach out and shut down the hypervisor to
> guarantee state. At that point it can start the VM elsewhere. If something
> fails in that process (IPMI unreachable, for example, or bad credentials),
> you're still going to be stuck with a VM not coming back.
>
> It's the nature of the thing. I'd be wary of any HA solution that does not
> reach out and guarantee state via host or storage fencing before starting a
> VM elsewhere, as it will be making assumptions. Its entirely possible a VM
> might be unreachable or unable to access it storage for a short while, a
> new instance of the VM is started elsewhere, and the original VM comes back.
>
> On Wed, Jan 17, 2018 at 9:02 AM Nux! <nu...@li.nux.ro> wrote:
>
> > Hi Rohit,
> >
> > I've reinstalled and tested. Still no go with VM HA.
> >
> > What I did was to kernel panic that particular HV ("echo c >
> > /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
> > What happened next is the HV got marked as "Alert", the VM on it was
> > all the time marked as "Running" and it was not migrated to another HV.
> > Once the panicked HV has booted back the VM reboots and becomes
> available.
> >
> > I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary storage.
> > The VM has HA enabled service offering.
> > Host HA or OOBM configuration was not touched.
> >
> > Full log http://tmp.nux.ro/W3s-management-server.log
> >
> > --
> > Sent from the Delta quadrant using Borg technology!
> >
> > Nux!
> > www.nux.ro
> >
> > ----- Original Message -----
> > > From: "Rohit Yadav" <ro...@shapeblue.com>
> > > To: "dev" <de...@cloudstack.apache.org>
> > > Sent: Wednesday, 17 January, 2018 12:13:33
> > > Subject: Re: HA issues
> >
> > > I performed VM HA sanity checks and was not able to reproduce any
> > regression
> > > against two KVM CentOS7 hosts in a cluster.
> > >
> > >
> > > Without the "Host HA" feature, I deployed few HA-enabled VMs on a
> > > KVM
> > host2 and
> > > killed it (powered off). After few minutes of CloudStack attempting
> > > to
> > find why
> > > the host (kvm agent) timed out, CloudStack kicked investigators,
> > > that eventually led KVM fencers to work and VM HA job kicked to
> > > start those
> > few VMs
> > > on host1 and the KVM host2 was put to "Down" state.
> > >
> > >
> > > - Rohit
> > >
> > > <https://cloudstack.apache.org>
> > >
> > >
> > >
> > > ________________________________
> > >
> > > rohit.yadav@shapeblue.com
> > > www.shapeblue.com
> > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
> > >
> > >
> > >
> > > From: Rohit Yadav
> > > Sent: Wednesday, January 17, 2018 2:39:19 PM
> > > To: dev
> > > Subject: Re: HA issues
> > >
> > >
> > > Hi Lucian,
> > >
> > >
> > > The "Host HA" feature is entirely different from VM HA, however, they
> > may work
> > > in tandem, so please stop using the terms interchangeably as it may
> > cause the
> > > community to believe a regression has been caused.
> > >
> > >
> > > The "Host HA" feature currently ships with only "Host HA" provider for
> > KVM that
> > > is strictly tied to out-of-band management (IPMI for fencing, i.e power
> > off and
> > > recovery, i.e. reboot) and NFS (as primary storage). (We also have a
> > provider
> > > for simulator, but that's for coverage/testing purposes).
> > >
> > >
> > > Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is
> > enabled.
> > > The frameowkr allows interested parties may write their own HA
> providers
> > for a
> > > hypervisor that can use a different strategy/mechanism for
> > fencing/recovery of
> > > hosts (including write a non-IPMI based OOBM plugin) and host/disk
> > activity
> > > checker that is non-NFS based.
> > >
> > >
> > > The "Host HA" feature ships disabled by default and does not cause any
> > > interference with VM HA. However, when enabled and configured
> correctly,
> > it is
> > > a known limitation that when it is unable to successfully perform
> > recovery or
> > > fencing tasks it may not trigger VM HA. We can discuss how to handle
> > such cases
> > > (thoughts?). "Host HA" would try couple of times to recover and failing
> > to do
> > > so, it would eventually trigger a host fencing task. If it's unable to
> > fence a
> > > host, it will indefinitely attempt to fence the host (the host state
> > will be
> > > stuck at fencing state in cloud.ha_config table for example) and alerts
> > will be
> > > sent to admin who can do some manual intervention to handle such
> > situations (if
> > > you've email/smtp enabled, you should see alert emails).
> > >
> > >
> > > We can discuss how to improve and have a workaround for the case you've
> > hit,
> > > thanks for sharing.
> > >
> > >
> > > - Rohit
> > >
> > > ________________________________
> > > From: Nux! <nu...@li.nux.ro>
> > > Sent: Tuesday, January 16, 2018 10:42:35 PM
> > > To: dev
> > > Subject: Re: HA issues
> > >
> > > Ok, reinstalled and re-tested.
> > >
> > > What I've learned:
> > >
> > > - HA only works now if OOB is configured, the old way HA no longer
> > applies -
> > > this can be good and bad, not everyone has IPMIs
> > >
> > > - HA only works if IPMI is reachable. I've pulled the cord on a HV and
> > HA failed
> > > to do its thing, leaving me with a HV down along with all the VMs
> running
> > > there. That's bad.
> > > I've opened this ticket for it:
> > > https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> > >
> > > Let me know if you need any extra info or stuff to test.
> > >
> > > Regards,
> > > Lucian
> > >
> > > --
> > > Sent from the Delta quadrant using Borg technology!
> > >
> > > Nux!
> > > www.nux.ro
> > >
> > > ----- Original Message -----
> > >> From: "Nux!" <nu...@li.nux.ro>
> > >> To: "dev" <de...@cloudstack.apache.org>
> > >> Sent: Tuesday, 16 January, 2018 11:35:58
> > >> Subject: Re: HA issues
> > >
> > >> I'll reinstall my setup and try again, just to be sure I'm working on
> a
> > clean
> > >> slate.
> > >>
> > >> --
> > >> Sent from the Delta quadrant using Borg technology!
> > >>
> > >> Nux!
> > >> www.nux.ro
> > >>
> > >> ----- Original Message -----
> > >>> From: "Rohit Yadav" <ro...@shapeblue.com>
> > >>> To: "dev" <de...@cloudstack.apache.org>
> > >>> Sent: Tuesday, 16 January, 2018 11:29:51
> > >>> Subject: Re: HA issues
> > >>
> > >>> Hi Lucian,
> > >>>
> > >>>
> > >>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi),
> > please refer
> > >>> to following docs:
> > >>>
> > >>>
> > http://docs.cloudstack.apache.org/projects/cloudstack-
> administration/en/latest/hosts.html#out-of-band-management
> > >>>
> > >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
> > >>>
> > >>>
> > >>> We'll need to you look at logs perhaps create a JIRA ticket with the
> > logs and
> > >>> details? If you saw ipmi based reboot, then host-ha indeed tried to
> > recover
> > >>> i.e. reboot the host, once hostha has done its work it would schedule
> > HA for VM
> > >>> as soon as the recovery operation succeeds (we've simulator and kvm
> > based
> > >>> marvin tests for such scenarios).
> > >>>
> > >>>
> > >>> Can you see it making attempt to schedule VM ha in logs, or any
> > failure?
> > >>>
> > >>>
> > >>> - Rohit
> > >>>
> > >>> <https://cloudstack.apache.org>
> > >>>
> > >>>
> > >>>
> > >>> ________________________________
> > >>> From: Nux! <nu...@li.nux.ro>
> > >>> Sent: Tuesday, January 16, 2018 12:47:56 AM
> > >>> To: dev
> > >>> Subject: [4.11] HA issues
> > >>>
> > >>> Hi,
> > >>>
> > >>> I see there's a new HA engine for KVM and IPMI support which is
> really
> > nice,
> > >>> however it seems hit and miss.
> > >>> I have created an instance with HA offering, kernel panicked one of
> the
> > >>> hypervisors - after a while the server was rebooted via IPMI
> probably,
> > but the
> > >>> instance never moved to a running hypervisor and even after the o
> > <https://maps.google.com/?q=to+a+running+hypervisor+and+
> even+after+the+o&entry=gmail&source=g>
> > riginal
> > >>> hypervisor came back it was still left in Stopped state.
> > >>> Is there any extra things I need to set up to have proper HA?
> > >>>
> > >>> Regards,
> > >>> Lucian
> > >>>
> > >>> --
> > >>> Sent from the Delta quadrant using Borg technology!
> > >>>
> > >>> Nux!
> > >>> www.nux.ro
> > >>>
> > >>> rohit.yadav@shapeblue.com
> > >>> www.shapeblue.com<http://www.shapeblue.com>
> > >>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > @shapeblue
> >
>



-- 

Andrija Panić

RE: HA issues

Posted by Sean Lair <sl...@ippathways.com>.
We were in the same situation as Nux.

In our test environment we hit the issue with VMs not getting fenced and coming up on two hosts because of VM HA.   However, we updated some of the logic for VM HA and turned on libvirtd's locking mechanism.  Now we are working great w/o IPMI.  The locking stops the VMs from starting elsewhere, and everything recovers very nicely when the host starts responding again.  

We are on 4.9.3 and haven't started testing with 4.11 yet, but it may work along-side IPMI just fine - it would just have affect the fencing.  However, we *currently* prefer how we are doing it now, because if the agent stops responding, but the host is still up, the VMs continue running and no actual downtime is incurred.  Even when VM HA attempts to power on the VMs on another host, it just fails the power-up and the VMs continue to run on the "agent disconnected" host. The host goes into alarm state and our NOC can look into what is wrong the agent on the host.  If IPMI was enabled, it sounds like it would power off the host (fence) and force downtime for us even if the VMs were actually running OK - and just the agent is unreachable.

I plan on submitting our updates via a pull request at some point.  But I can also send the updated code to anyone that wants to do some testing before then.

-----Original Message-----
From: Marcus [mailto:shadowsor@gmail.com] 
Sent: Friday, February 16, 2018 11:27 AM
To: dev@cloudstack.apache.org
Subject: Re: HA issues

From your other emails it sounds as though you do not have IPMI configured, nor host HA enabled, correct? In this case, the correct thing to do is nothing. If CloudStack cannot guarantee the VM state (as is the case with an unreachable hypervisor), it should do nothing, for fear of causing a split brain and corrupting the VM disk (VM running on two hosts).

Clustering and fencing is a tricky proposition. When CloudStack (or any other cluster manager) is not configured to or cannot guarantee state then things will simply lock up, in this case your HA VM on your broken hypervisor will not run elsewhere. This has been the case for a long time with CloudStack, HA would only start a VM after the original hypervisor agent came back and reported no VM is running.

The new feature, from what I gather, simply adds the possibility of CloudStack being able to reach out and shut down the hypervisor to guarantee state. At that point it can start the VM elsewhere. If something fails in that process (IPMI unreachable, for example, or bad credentials), you're still going to be stuck with a VM not coming back.

It's the nature of the thing. I'd be wary of any HA solution that does not reach out and guarantee state via host or storage fencing before starting a VM elsewhere, as it will be making assumptions. Its entirely possible a VM might be unreachable or unable to access it storage for a short while, a new instance of the VM is started elsewhere, and the original VM comes back.

On Wed, Jan 17, 2018 at 9:02 AM Nux! <nu...@li.nux.ro> wrote:

> Hi Rohit,
>
> I've reinstalled and tested. Still no go with VM HA.
>
> What I did was to kernel panic that particular HV ("echo c > 
> /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
> What happened next is the HV got marked as "Alert", the VM on it was 
> all the time marked as "Running" and it was not migrated to another HV.
> Once the panicked HV has booted back the VM reboots and becomes available.
>
> I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary storage.
> The VM has HA enabled service offering.
> Host HA or OOBM configuration was not touched.
>
> Full log http://tmp.nux.ro/W3s-management-server.log
>
> --
> Sent from the Delta quadrant using Borg technology!
>
> Nux!
> www.nux.ro
>
> ----- Original Message -----
> > From: "Rohit Yadav" <ro...@shapeblue.com>
> > To: "dev" <de...@cloudstack.apache.org>
> > Sent: Wednesday, 17 January, 2018 12:13:33
> > Subject: Re: HA issues
>
> > I performed VM HA sanity checks and was not able to reproduce any
> regression
> > against two KVM CentOS7 hosts in a cluster.
> >
> >
> > Without the "Host HA" feature, I deployed few HA-enabled VMs on a 
> > KVM
> host2 and
> > killed it (powered off). After few minutes of CloudStack attempting 
> > to
> find why
> > the host (kvm agent) timed out, CloudStack kicked investigators, 
> > that eventually led KVM fencers to work and VM HA job kicked to 
> > start those
> few VMs
> > on host1 and the KVM host2 was put to "Down" state.
> >
> >
> > - Rohit
> >
> > <https://cloudstack.apache.org>
> >
> >
> >
> > ________________________________
> >
> > rohit.yadav@shapeblue.com
> > www.shapeblue.com
> > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
> >
> >
> >
> > From: Rohit Yadav
> > Sent: Wednesday, January 17, 2018 2:39:19 PM
> > To: dev
> > Subject: Re: HA issues
> >
> >
> > Hi Lucian,
> >
> >
> > The "Host HA" feature is entirely different from VM HA, however, they
> may work
> > in tandem, so please stop using the terms interchangeably as it may
> cause the
> > community to believe a regression has been caused.
> >
> >
> > The "Host HA" feature currently ships with only "Host HA" provider for
> KVM that
> > is strictly tied to out-of-band management (IPMI for fencing, i.e power
> off and
> > recovery, i.e. reboot) and NFS (as primary storage). (We also have a
> provider
> > for simulator, but that's for coverage/testing purposes).
> >
> >
> > Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is
> enabled.
> > The frameowkr allows interested parties may write their own HA providers
> for a
> > hypervisor that can use a different strategy/mechanism for
> fencing/recovery of
> > hosts (including write a non-IPMI based OOBM plugin) and host/disk
> activity
> > checker that is non-NFS based.
> >
> >
> > The "Host HA" feature ships disabled by default and does not cause any
> > interference with VM HA. However, when enabled and configured correctly,
> it is
> > a known limitation that when it is unable to successfully perform
> recovery or
> > fencing tasks it may not trigger VM HA. We can discuss how to handle
> such cases
> > (thoughts?). "Host HA" would try couple of times to recover and failing
> to do
> > so, it would eventually trigger a host fencing task. If it's unable to
> fence a
> > host, it will indefinitely attempt to fence the host (the host state
> will be
> > stuck at fencing state in cloud.ha_config table for example) and alerts
> will be
> > sent to admin who can do some manual intervention to handle such
> situations (if
> > you've email/smtp enabled, you should see alert emails).
> >
> >
> > We can discuss how to improve and have a workaround for the case you've
> hit,
> > thanks for sharing.
> >
> >
> > - Rohit
> >
> > ________________________________
> > From: Nux! <nu...@li.nux.ro>
> > Sent: Tuesday, January 16, 2018 10:42:35 PM
> > To: dev
> > Subject: Re: HA issues
> >
> > Ok, reinstalled and re-tested.
> >
> > What I've learned:
> >
> > - HA only works now if OOB is configured, the old way HA no longer
> applies -
> > this can be good and bad, not everyone has IPMIs
> >
> > - HA only works if IPMI is reachable. I've pulled the cord on a HV and
> HA failed
> > to do its thing, leaving me with a HV down along with all the VMs running
> > there. That's bad.
> > I've opened this ticket for it:
> > https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> >
> > Let me know if you need any extra info or stuff to test.
> >
> > Regards,
> > Lucian
> >
> > --
> > Sent from the Delta quadrant using Borg technology!
> >
> > Nux!
> > www.nux.ro
> >
> > ----- Original Message -----
> >> From: "Nux!" <nu...@li.nux.ro>
> >> To: "dev" <de...@cloudstack.apache.org>
> >> Sent: Tuesday, 16 January, 2018 11:35:58
> >> Subject: Re: HA issues
> >
> >> I'll reinstall my setup and try again, just to be sure I'm working on a
> clean
> >> slate.
> >>
> >> --
> >> Sent from the Delta quadrant using Borg technology!
> >>
> >> Nux!
> >> www.nux.ro
> >>
> >> ----- Original Message -----
> >>> From: "Rohit Yadav" <ro...@shapeblue.com>
> >>> To: "dev" <de...@cloudstack.apache.org>
> >>> Sent: Tuesday, 16 January, 2018 11:29:51
> >>> Subject: Re: HA issues
> >>
> >>> Hi Lucian,
> >>>
> >>>
> >>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi),
> please refer
> >>> to following docs:
> >>>
> >>>
> http://docs.cloudstack.apache.org/projects/cloudstack-administration/en/latest/hosts.html#out-of-band-management
> >>>
> >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
> >>>
> >>>
> >>> We'll need to you look at logs perhaps create a JIRA ticket with the
> logs and
> >>> details? If you saw ipmi based reboot, then host-ha indeed tried to
> recover
> >>> i.e. reboot the host, once hostha has done its work it would schedule
> HA for VM
> >>> as soon as the recovery operation succeeds (we've simulator and kvm
> based
> >>> marvin tests for such scenarios).
> >>>
> >>>
> >>> Can you see it making attempt to schedule VM ha in logs, or any
> failure?
> >>>
> >>>
> >>> - Rohit
> >>>
> >>> <https://cloudstack.apache.org>
> >>>
> >>>
> >>>
> >>> ________________________________
> >>> From: Nux! <nu...@li.nux.ro>
> >>> Sent: Tuesday, January 16, 2018 12:47:56 AM
> >>> To: dev
> >>> Subject: [4.11] HA issues
> >>>
> >>> Hi,
> >>>
> >>> I see there's a new HA engine for KVM and IPMI support which is really
> nice,
> >>> however it seems hit and miss.
> >>> I have created an instance with HA offering, kernel panicked one of the
> >>> hypervisors - after a while the server was rebooted via IPMI probably,
> but the
> >>> instance never moved to a running hypervisor and even after the o
> <https://maps.google.com/?q=to+a+running+hypervisor+and+even+after+the+o&entry=gmail&source=g>
> riginal
> >>> hypervisor came back it was still left in Stopped state.
> >>> Is there any extra things I need to set up to have proper HA?
> >>>
> >>> Regards,
> >>> Lucian
> >>>
> >>> --
> >>> Sent from the Delta quadrant using Borg technology!
> >>>
> >>> Nux!
> >>> www.nux.ro
> >>>
> >>> rohit.yadav@shapeblue.com
> >>> www.shapeblue.com<http://www.shapeblue.com>
> >>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > @shapeblue
>

Re: HA issues

Posted by Marcus <sh...@gmail.com>.
From your other emails it sounds as though you do not have IPMI configured,
nor host HA enabled, correct? In this case, the correct thing to do is
nothing. If CloudStack cannot guarantee the VM state (as is the case with
an unreachable hypervisor), it should do nothing, for fear of causing a
split brain and corrupting the VM disk (VM running on two hosts).

Clustering and fencing is a tricky proposition. When CloudStack (or any
other cluster manager) is not configured to or cannot guarantee state then
things will simply lock up, in this case your HA VM on your broken
hypervisor will not run elsewhere. This has been the case for a long time
with CloudStack, HA would only start a VM after the original hypervisor
agent came back and reported no VM is running.

The new feature, from what I gather, simply adds the possibility of
CloudStack being able to reach out and shut down the hypervisor to
guarantee state. At that point it can start the VM elsewhere. If something
fails in that process (IPMI unreachable, for example, or bad credentials),
you're still going to be stuck with a VM not coming back.

It's the nature of the thing. I'd be wary of any HA solution that does not
reach out and guarantee state via host or storage fencing before starting a
VM elsewhere, as it will be making assumptions. Its entirely possible a VM
might be unreachable or unable to access it storage for a short while, a
new instance of the VM is started elsewhere, and the original VM comes back.

On Wed, Jan 17, 2018 at 9:02 AM Nux! <nu...@li.nux.ro> wrote:

> Hi Rohit,
>
> I've reinstalled and tested. Still no go with VM HA.
>
> What I did was to kernel panic that particular HV ("echo c >
> /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
> What happened next is the HV got marked as "Alert", the VM on it was all
> the time marked as "Running" and it was not migrated to another HV.
> Once the panicked HV has booted back the VM reboots and becomes available.
>
> I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary storage.
> The VM has HA enabled service offering.
> Host HA or OOBM configuration was not touched.
>
> Full log http://tmp.nux.ro/W3s-management-server.log
>
> --
> Sent from the Delta quadrant using Borg technology!
>
> Nux!
> www.nux.ro
>
> ----- Original Message -----
> > From: "Rohit Yadav" <ro...@shapeblue.com>
> > To: "dev" <de...@cloudstack.apache.org>
> > Sent: Wednesday, 17 January, 2018 12:13:33
> > Subject: Re: HA issues
>
> > I performed VM HA sanity checks and was not able to reproduce any
> regression
> > against two KVM CentOS7 hosts in a cluster.
> >
> >
> > Without the "Host HA" feature, I deployed few HA-enabled VMs on a KVM
> host2 and
> > killed it (powered off). After few minutes of CloudStack attempting to
> find why
> > the host (kvm agent) timed out, CloudStack kicked investigators, that
> > eventually led KVM fencers to work and VM HA job kicked to start those
> few VMs
> > on host1 and the KVM host2 was put to "Down" state.
> >
> >
> > - Rohit
> >
> > <https://cloudstack.apache.org>
> >
> >
> >
> > ________________________________
> >
> > rohit.yadav@shapeblue.com
> > www.shapeblue.com
> > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > @shapeblue
> >
> >
> >
> > From: Rohit Yadav
> > Sent: Wednesday, January 17, 2018 2:39:19 PM
> > To: dev
> > Subject: Re: HA issues
> >
> >
> > Hi Lucian,
> >
> >
> > The "Host HA" feature is entirely different from VM HA, however, they
> may work
> > in tandem, so please stop using the terms interchangeably as it may
> cause the
> > community to believe a regression has been caused.
> >
> >
> > The "Host HA" feature currently ships with only "Host HA" provider for
> KVM that
> > is strictly tied to out-of-band management (IPMI for fencing, i.e power
> off and
> > recovery, i.e. reboot) and NFS (as primary storage). (We also have a
> provider
> > for simulator, but that's for coverage/testing purposes).
> >
> >
> > Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is
> enabled.
> > The frameowkr allows interested parties may write their own HA providers
> for a
> > hypervisor that can use a different strategy/mechanism for
> fencing/recovery of
> > hosts (including write a non-IPMI based OOBM plugin) and host/disk
> activity
> > checker that is non-NFS based.
> >
> >
> > The "Host HA" feature ships disabled by default and does not cause any
> > interference with VM HA. However, when enabled and configured correctly,
> it is
> > a known limitation that when it is unable to successfully perform
> recovery or
> > fencing tasks it may not trigger VM HA. We can discuss how to handle
> such cases
> > (thoughts?). "Host HA" would try couple of times to recover and failing
> to do
> > so, it would eventually trigger a host fencing task. If it's unable to
> fence a
> > host, it will indefinitely attempt to fence the host (the host state
> will be
> > stuck at fencing state in cloud.ha_config table for example) and alerts
> will be
> > sent to admin who can do some manual intervention to handle such
> situations (if
> > you've email/smtp enabled, you should see alert emails).
> >
> >
> > We can discuss how to improve and have a workaround for the case you've
> hit,
> > thanks for sharing.
> >
> >
> > - Rohit
> >
> > ________________________________
> > From: Nux! <nu...@li.nux.ro>
> > Sent: Tuesday, January 16, 2018 10:42:35 PM
> > To: dev
> > Subject: Re: HA issues
> >
> > Ok, reinstalled and re-tested.
> >
> > What I've learned:
> >
> > - HA only works now if OOB is configured, the old way HA no longer
> applies -
> > this can be good and bad, not everyone has IPMIs
> >
> > - HA only works if IPMI is reachable. I've pulled the cord on a HV and
> HA failed
> > to do its thing, leaving me with a HV down along with all the VMs running
> > there. That's bad.
> > I've opened this ticket for it:
> > https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> >
> > Let me know if you need any extra info or stuff to test.
> >
> > Regards,
> > Lucian
> >
> > --
> > Sent from the Delta quadrant using Borg technology!
> >
> > Nux!
> > www.nux.ro
> >
> > ----- Original Message -----
> >> From: "Nux!" <nu...@li.nux.ro>
> >> To: "dev" <de...@cloudstack.apache.org>
> >> Sent: Tuesday, 16 January, 2018 11:35:58
> >> Subject: Re: HA issues
> >
> >> I'll reinstall my setup and try again, just to be sure I'm working on a
> clean
> >> slate.
> >>
> >> --
> >> Sent from the Delta quadrant using Borg technology!
> >>
> >> Nux!
> >> www.nux.ro
> >>
> >> ----- Original Message -----
> >>> From: "Rohit Yadav" <ro...@shapeblue.com>
> >>> To: "dev" <de...@cloudstack.apache.org>
> >>> Sent: Tuesday, 16 January, 2018 11:29:51
> >>> Subject: Re: HA issues
> >>
> >>> Hi Lucian,
> >>>
> >>>
> >>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi),
> please refer
> >>> to following docs:
> >>>
> >>>
> http://docs.cloudstack.apache.org/projects/cloudstack-administration/en/latest/hosts.html#out-of-band-management
> >>>
> >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
> >>>
> >>>
> >>> We'll need to you look at logs perhaps create a JIRA ticket with the
> logs and
> >>> details? If you saw ipmi based reboot, then host-ha indeed tried to
> recover
> >>> i.e. reboot the host, once hostha has done its work it would schedule
> HA for VM
> >>> as soon as the recovery operation succeeds (we've simulator and kvm
> based
> >>> marvin tests for such scenarios).
> >>>
> >>>
> >>> Can you see it making attempt to schedule VM ha in logs, or any
> failure?
> >>>
> >>>
> >>> - Rohit
> >>>
> >>> <https://cloudstack.apache.org>
> >>>
> >>>
> >>>
> >>> ________________________________
> >>> From: Nux! <nu...@li.nux.ro>
> >>> Sent: Tuesday, January 16, 2018 12:47:56 AM
> >>> To: dev
> >>> Subject: [4.11] HA issues
> >>>
> >>> Hi,
> >>>
> >>> I see there's a new HA engine for KVM and IPMI support which is really
> nice,
> >>> however it seems hit and miss.
> >>> I have created an instance with HA offering, kernel panicked one of the
> >>> hypervisors - after a while the server was rebooted via IPMI probably,
> but the
> >>> instance never moved to a running hypervisor and even after the o
> <https://maps.google.com/?q=to+a+running+hypervisor+and+even+after+the+o&entry=gmail&source=g>
> riginal
> >>> hypervisor came back it was still left in Stopped state.
> >>> Is there any extra things I need to set up to have proper HA?
> >>>
> >>> Regards,
> >>> Lucian
> >>>
> >>> --
> >>> Sent from the Delta quadrant using Borg technology!
> >>>
> >>> Nux!
> >>> www.nux.ro
> >>>
> >>> rohit.yadav@shapeblue.com
> >>> www.shapeblue.com<http://www.shapeblue.com>
> >>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > @shapeblue
>

Re: HA issues

Posted by Nux! <nu...@li.nux.ro>.
Thanks Rohit, 

I'll do more tests, try to figure it out. This thing is happening to me consistently on this setup, I'll use another and basic networking, see if it yields different results.

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Rohit Yadav" <ro...@shapeblue.com>
> To: "dev" <de...@cloudstack.apache.org>, "Daan Hoogland" <da...@shapeblue.com>, "Nicolas Vazquez"
> <Ni...@shapeblue.com>, "Boris Stoyanov" <bo...@shapeblue.com>
> Sent: Friday, 19 January, 2018 08:59:00
> Subject: Re: HA issues

> Hi Lucian,
> 
> 
> Thanks for sharing, I still could not reproduce the issue. In my case, the KVM
> host went to "Down" state and VMs were started on other hosts. Given this may
> not be a generally reproducible issue, it could be marked Critical but may be
> not a blocker?
> 
> 
> Please open/update JIRA ticket with the details. /cc @Daan
> Hoogland<ma...@shapeblue.com> @Nicolas
> Vazquez<ma...@shapeblue.com> @Boris
> Stoyanov<ma...@shapeblue.com> and others
> 
> 
> - Rohit
> 
> <https://cloudstack.apache.org>
> 
> 
> 
> ________________________________
> From: Nux! <nu...@li.nux.ro>
> Sent: Wednesday, January 17, 2018 10:32:00 PM
> To: dev
> Subject: Re: HA issues
> 
> Hi Rohit,
> 
> I've reinstalled and tested. Still no go with VM HA.
> 
> What I did was to kernel panic that particular HV ("echo c >
> /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
> What happened next is the HV got marked as "Alert", the VM on it was all the
> time marked as "Running" and it was not migrated to another HV.
> Once the panicked HV has booted back the VM reboots and becomes available.
> 
> I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary storage. The VM
> has HA enabled service offering.
> Host HA or OOBM configuration was not touched.
> 
> Full log http://tmp.nux.ro/W3s-management-server.log
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> 
> rohit.yadav@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>  
> 
> 
> ----- Original Message -----
>> From: "Rohit Yadav" <ro...@shapeblue.com>
>> To: "dev" <de...@cloudstack.apache.org>
>> Sent: Wednesday, 17 January, 2018 12:13:33
>> Subject: Re: HA issues
> 
>> I performed VM HA sanity checks and was not able to reproduce any regression
>> against two KVM CentOS7 hosts in a cluster.
>>
>>
>> Without the "Host HA" feature, I deployed few HA-enabled VMs on a KVM host2 and
>> killed it (powered off). After few minutes of CloudStack attempting to find why
>> the host (kvm agent) timed out, CloudStack kicked investigators, that
>> eventually led KVM fencers to work and VM HA job kicked to start those few VMs
>> on host1 and the KVM host2 was put to "Down" state.
>>
>>
>> - Rohit
>>
>> <https://cloudstack.apache.org>
>>
>>
>>
>> ________________________________
>>
>> rohit.yadav@shapeblue.com
>> www.shapeblue.com<http://www.shapeblue.com>
>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
>> @shapeblue
>>
>>
>>
>> From: Rohit Yadav
>> Sent: Wednesday, January 17, 2018 2:39:19 PM
>> To: dev
>> Subject: Re: HA issues
>>
>>
>> Hi Lucian,
>>
>>
>> The "Host HA" feature is entirely different from VM HA, however, they may work
>> in tandem, so please stop using the terms interchangeably as it may cause the
>> community to believe a regression has been caused.
>>
>>
>> The "Host HA" feature currently ships with only "Host HA" provider for KVM that
>> is strictly tied to out-of-band management (IPMI for fencing, i.e power off and
>> recovery, i.e. reboot) and NFS (as primary storage). (We also have a provider
>> for simulator, but that's for coverage/testing purposes).
>>
>>
>> Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is enabled.
>> The frameowkr allows interested parties may write their own HA providers for a
>> hypervisor that can use a different strategy/mechanism for fencing/recovery of
>> hosts (including write a non-IPMI based OOBM plugin) and host/disk activity
>> checker that is non-NFS based.
>>
>>
>> The "Host HA" feature ships disabled by default and does not cause any
>> interference with VM HA. However, when enabled and configured correctly, it is
>> a known limitation that when it is unable to successfully perform recovery or
>> fencing tasks it may not trigger VM HA. We can discuss how to handle such cases
>> (thoughts?). "Host HA" would try couple of times to recover and failing to do
>> so, it would eventually trigger a host fencing task. If it's unable to fence a
>> host, it will indefinitely attempt to fence the host (the host state will be
>> stuck at fencing state in cloud.ha_config table for example) and alerts will be
>> sent to admin who can do some manual intervention to handle such situations (if
>> you've email/smtp enabled, you should see alert emails).
>>
>>
>> We can discuss how to improve and have a workaround for the case you've hit,
>> thanks for sharing.
>>
>>
>> - Rohit
>>
>> ________________________________
>> From: Nux! <nu...@li.nux.ro>
>> Sent: Tuesday, January 16, 2018 10:42:35 PM
>> To: dev
>> Subject: Re: HA issues
>>
>> Ok, reinstalled and re-tested.
>>
>> What I've learned:
>>
>> - HA only works now if OOB is configured, the old way HA no longer applies -
>> this can be good and bad, not everyone has IPMIs
>>
>> - HA only works if IPMI is reachable. I've pulled the cord on a HV and HA failed
>> to do its thing, leaving me with a HV down along with all the VMs running
>> there. That's bad.
>> I've opened this ticket for it:
>> https://issues.apache.org/jira/browse/CLOUDSTACK-10234
>>
>> Let me know if you need any extra info or stuff to test.
>>
>> Regards,
>> Lucian
>>
>> --
>> Sent from the Delta quadrant using Borg technology!
>>
>> Nux!
>> www.nux.ro
>>
>> ----- Original Message -----
>>> From: "Nux!" <nu...@li.nux.ro>
>>> To: "dev" <de...@cloudstack.apache.org>
>>> Sent: Tuesday, 16 January, 2018 11:35:58
>>> Subject: Re: HA issues
>>
>>> I'll reinstall my setup and try again, just to be sure I'm working on a clean
>>> slate.
>>>
>>> --
>>> Sent from the Delta quadrant using Borg technology!
>>>
>>> Nux!
>>> www.nux.ro
>>>
>>> ----- Original Message -----
>>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>>> To: "dev" <de...@cloudstack.apache.org>
>>>> Sent: Tuesday, 16 January, 2018 11:29:51
>>>> Subject: Re: HA issues
>>>
>>>> Hi Lucian,
>>>>
>>>>
>>>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi), please refer
>>>> to following docs:
>>>>
>>>> http://docs.cloudstack.apache.org/projects/cloudstack-administration/en/latest/hosts.html#out-of-band-management
>>>>
>>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
>>>>
>>>>
>>>> We'll need to you look at logs perhaps create a JIRA ticket with the logs and
>>>> details? If you saw ipmi based reboot, then host-ha indeed tried to recover
>>>> i.e. reboot the host, once hostha has done its work it would schedule HA for VM
>>>> as soon as the recovery operation succeeds (we've simulator and kvm based
>>>> marvin tests for such scenarios).
>>>>
>>>>
>>>> Can you see it making attempt to schedule VM ha in logs, or any failure?
>>>>
>>>>
>>>> - Rohit
>>>>
>>>> <https://cloudstack.apache.org>
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Nux! <nu...@li.nux.ro>
>>>> Sent: Tuesday, January 16, 2018 12:47:56 AM
>>>> To: dev
>>>> Subject: [4.11] HA issues
>>>>
>>>> Hi,
>>>>
>>>> I see there's a new HA engine for KVM and IPMI support which is really nice,
>>>> however it seems hit and miss.
>>>> I have created an instance with HA offering, kernel panicked one of the
>>>> hypervisors - after a while the server was rebooted via IPMI probably, but the
>>>> instance never moved to a running hypervisor and even after the original
>>>> hypervisor came back it was still left in Stopped state.
>>>> Is there any extra things I need to set up to have proper HA?
>>>>
>>>> Regards,
>>>> Lucian
>>>>
>>>> --
>>>> Sent from the Delta quadrant using Borg technology!
>>>>
>>>> Nux!
>>>> www.nux.ro
>>>>
>>>> rohit.yadav@shapeblue.com
>>>> www.shapeblue.com<http://www.shapeblue.com>
>>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > @shapeblue

Re: HA issues

Posted by Rohit Yadav <ro...@shapeblue.com>.
Hi Lucian,


Thanks for sharing, I still could not reproduce the issue. In my case, the KVM host went to "Down" state and VMs were started on other hosts. Given this may not be a generally reproducible issue, it could be marked Critical but may be not a blocker?


Please open/update JIRA ticket with the details. /cc @Daan Hoogland<ma...@shapeblue.com> @Nicolas Vazquez<ma...@shapeblue.com> @Boris Stoyanov<ma...@shapeblue.com> and others


- Rohit

<https://cloudstack.apache.org>



________________________________
From: Nux! <nu...@li.nux.ro>
Sent: Wednesday, January 17, 2018 10:32:00 PM
To: dev
Subject: Re: HA issues

Hi Rohit,

I've reinstalled and tested. Still no go with VM HA.

What I did was to kernel panic that particular HV ("echo c > /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
What happened next is the HV got marked as "Alert", the VM on it was all the time marked as "Running" and it was not migrated to another HV.
Once the panicked HV has booted back the VM reboots and becomes available.

I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary storage. The VM has HA enabled service offering.
Host HA or OOBM configuration was not touched.

Full log http://tmp.nux.ro/W3s-management-server.log

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro


rohit.yadav@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 

----- Original Message -----
> From: "Rohit Yadav" <ro...@shapeblue.com>
> To: "dev" <de...@cloudstack.apache.org>
> Sent: Wednesday, 17 January, 2018 12:13:33
> Subject: Re: HA issues

> I performed VM HA sanity checks and was not able to reproduce any regression
> against two KVM CentOS7 hosts in a cluster.
>
>
> Without the "Host HA" feature, I deployed few HA-enabled VMs on a KVM host2 and
> killed it (powered off). After few minutes of CloudStack attempting to find why
> the host (kvm agent) timed out, CloudStack kicked investigators, that
> eventually led KVM fencers to work and VM HA job kicked to start those few VMs
> on host1 and the KVM host2 was put to "Down" state.
>
>
> - Rohit
>
> <https://cloudstack.apache.org>
>
>
>
> ________________________________
>
> rohit.yadav@shapeblue.com
> www.shapeblue.com<http://www.shapeblue.com>
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>
>
>
> From: Rohit Yadav
> Sent: Wednesday, January 17, 2018 2:39:19 PM
> To: dev
> Subject: Re: HA issues
>
>
> Hi Lucian,
>
>
> The "Host HA" feature is entirely different from VM HA, however, they may work
> in tandem, so please stop using the terms interchangeably as it may cause the
> community to believe a regression has been caused.
>
>
> The "Host HA" feature currently ships with only "Host HA" provider for KVM that
> is strictly tied to out-of-band management (IPMI for fencing, i.e power off and
> recovery, i.e. reboot) and NFS (as primary storage). (We also have a provider
> for simulator, but that's for coverage/testing purposes).
>
>
> Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is enabled.
> The frameowkr allows interested parties may write their own HA providers for a
> hypervisor that can use a different strategy/mechanism for fencing/recovery of
> hosts (including write a non-IPMI based OOBM plugin) and host/disk activity
> checker that is non-NFS based.
>
>
> The "Host HA" feature ships disabled by default and does not cause any
> interference with VM HA. However, when enabled and configured correctly, it is
> a known limitation that when it is unable to successfully perform recovery or
> fencing tasks it may not trigger VM HA. We can discuss how to handle such cases
> (thoughts?). "Host HA" would try couple of times to recover and failing to do
> so, it would eventually trigger a host fencing task. If it's unable to fence a
> host, it will indefinitely attempt to fence the host (the host state will be
> stuck at fencing state in cloud.ha_config table for example) and alerts will be
> sent to admin who can do some manual intervention to handle such situations (if
> you've email/smtp enabled, you should see alert emails).
>
>
> We can discuss how to improve and have a workaround for the case you've hit,
> thanks for sharing.
>
>
> - Rohit
>
> ________________________________
> From: Nux! <nu...@li.nux.ro>
> Sent: Tuesday, January 16, 2018 10:42:35 PM
> To: dev
> Subject: Re: HA issues
>
> Ok, reinstalled and re-tested.
>
> What I've learned:
>
> - HA only works now if OOB is configured, the old way HA no longer applies -
> this can be good and bad, not everyone has IPMIs
>
> - HA only works if IPMI is reachable. I've pulled the cord on a HV and HA failed
> to do its thing, leaving me with a HV down along with all the VMs running
> there. That's bad.
> I've opened this ticket for it:
> https://issues.apache.org/jira/browse/CLOUDSTACK-10234
>
> Let me know if you need any extra info or stuff to test.
>
> Regards,
> Lucian
>
> --
> Sent from the Delta quadrant using Borg technology!
>
> Nux!
> www.nux.ro
>
> ----- Original Message -----
>> From: "Nux!" <nu...@li.nux.ro>
>> To: "dev" <de...@cloudstack.apache.org>
>> Sent: Tuesday, 16 January, 2018 11:35:58
>> Subject: Re: HA issues
>
>> I'll reinstall my setup and try again, just to be sure I'm working on a clean
>> slate.
>>
>> --
>> Sent from the Delta quadrant using Borg technology!
>>
>> Nux!
>> www.nux.ro
>>
>> ----- Original Message -----
>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>> To: "dev" <de...@cloudstack.apache.org>
>>> Sent: Tuesday, 16 January, 2018 11:29:51
>>> Subject: Re: HA issues
>>
>>> Hi Lucian,
>>>
>>>
>>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi), please refer
>>> to following docs:
>>>
>>> http://docs.cloudstack.apache.org/projects/cloudstack-administration/en/latest/hosts.html#out-of-band-management
>>>
>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
>>>
>>>
>>> We'll need to you look at logs perhaps create a JIRA ticket with the logs and
>>> details? If you saw ipmi based reboot, then host-ha indeed tried to recover
>>> i.e. reboot the host, once hostha has done its work it would schedule HA for VM
>>> as soon as the recovery operation succeeds (we've simulator and kvm based
>>> marvin tests for such scenarios).
>>>
>>>
>>> Can you see it making attempt to schedule VM ha in logs, or any failure?
>>>
>>>
>>> - Rohit
>>>
>>> <https://cloudstack.apache.org>
>>>
>>>
>>>
>>> ________________________________
>>> From: Nux! <nu...@li.nux.ro>
>>> Sent: Tuesday, January 16, 2018 12:47:56 AM
>>> To: dev
>>> Subject: [4.11] HA issues
>>>
>>> Hi,
>>>
>>> I see there's a new HA engine for KVM and IPMI support which is really nice,
>>> however it seems hit and miss.
>>> I have created an instance with HA offering, kernel panicked one of the
>>> hypervisors - after a while the server was rebooted via IPMI probably, but the
>>> instance never moved to a running hypervisor and even after the original
>>> hypervisor came back it was still left in Stopped state.
>>> Is there any extra things I need to set up to have proper HA?
>>>
>>> Regards,
>>> Lucian
>>>
>>> --
>>> Sent from the Delta quadrant using Borg technology!
>>>
>>> Nux!
>>> www.nux.ro
>>>
>>> rohit.yadav@shapeblue.com
>>> www.shapeblue.com<http://www.shapeblue.com>
>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > @shapeblue

Re: HA issues

Posted by Nux! <nu...@li.nux.ro>.
Hi Rohit,

I've reinstalled and tested. Still no go with VM HA.

What I did was to kernel panic that particular HV ("echo c > /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
What happened next is the HV got marked as "Alert", the VM on it was all the time marked as "Running" and it was not migrated to another HV.
Once the panicked HV has booted back the VM reboots and becomes available.

I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary storage. The VM has HA enabled service offering.
Host HA or OOBM configuration was not touched.

Full log http://tmp.nux.ro/W3s-management-server.log

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Rohit Yadav" <ro...@shapeblue.com>
> To: "dev" <de...@cloudstack.apache.org>
> Sent: Wednesday, 17 January, 2018 12:13:33
> Subject: Re: HA issues

> I performed VM HA sanity checks and was not able to reproduce any regression
> against two KVM CentOS7 hosts in a cluster.
> 
> 
> Without the "Host HA" feature, I deployed few HA-enabled VMs on a KVM host2 and
> killed it (powered off). After few minutes of CloudStack attempting to find why
> the host (kvm agent) timed out, CloudStack kicked investigators, that
> eventually led KVM fencers to work and VM HA job kicked to start those few VMs
> on host1 and the KVM host2 was put to "Down" state.
> 
> 
> - Rohit
> 
> <https://cloudstack.apache.org>
> 
> 
> 
> ________________________________
> 
> rohit.yadav@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>  
> 
> 
> From: Rohit Yadav
> Sent: Wednesday, January 17, 2018 2:39:19 PM
> To: dev
> Subject: Re: HA issues
> 
> 
> Hi Lucian,
> 
> 
> The "Host HA" feature is entirely different from VM HA, however, they may work
> in tandem, so please stop using the terms interchangeably as it may cause the
> community to believe a regression has been caused.
> 
> 
> The "Host HA" feature currently ships with only "Host HA" provider for KVM that
> is strictly tied to out-of-band management (IPMI for fencing, i.e power off and
> recovery, i.e. reboot) and NFS (as primary storage). (We also have a provider
> for simulator, but that's for coverage/testing purposes).
> 
> 
> Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is enabled.
> The frameowkr allows interested parties may write their own HA providers for a
> hypervisor that can use a different strategy/mechanism for fencing/recovery of
> hosts (including write a non-IPMI based OOBM plugin) and host/disk activity
> checker that is non-NFS based.
> 
> 
> The "Host HA" feature ships disabled by default and does not cause any
> interference with VM HA. However, when enabled and configured correctly, it is
> a known limitation that when it is unable to successfully perform recovery or
> fencing tasks it may not trigger VM HA. We can discuss how to handle such cases
> (thoughts?). "Host HA" would try couple of times to recover and failing to do
> so, it would eventually trigger a host fencing task. If it's unable to fence a
> host, it will indefinitely attempt to fence the host (the host state will be
> stuck at fencing state in cloud.ha_config table for example) and alerts will be
> sent to admin who can do some manual intervention to handle such situations (if
> you've email/smtp enabled, you should see alert emails).
> 
> 
> We can discuss how to improve and have a workaround for the case you've hit,
> thanks for sharing.
> 
> 
> - Rohit
> 
> ________________________________
> From: Nux! <nu...@li.nux.ro>
> Sent: Tuesday, January 16, 2018 10:42:35 PM
> To: dev
> Subject: Re: HA issues
> 
> Ok, reinstalled and re-tested.
> 
> What I've learned:
> 
> - HA only works now if OOB is configured, the old way HA no longer applies -
> this can be good and bad, not everyone has IPMIs
> 
> - HA only works if IPMI is reachable. I've pulled the cord on a HV and HA failed
> to do its thing, leaving me with a HV down along with all the VMs running
> there. That's bad.
> I've opened this ticket for it:
> https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> 
> Let me know if you need any extra info or stuff to test.
> 
> Regards,
> Lucian
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> ----- Original Message -----
>> From: "Nux!" <nu...@li.nux.ro>
>> To: "dev" <de...@cloudstack.apache.org>
>> Sent: Tuesday, 16 January, 2018 11:35:58
>> Subject: Re: HA issues
> 
>> I'll reinstall my setup and try again, just to be sure I'm working on a clean
>> slate.
>>
>> --
>> Sent from the Delta quadrant using Borg technology!
>>
>> Nux!
>> www.nux.ro
>>
>> ----- Original Message -----
>>> From: "Rohit Yadav" <ro...@shapeblue.com>
>>> To: "dev" <de...@cloudstack.apache.org>
>>> Sent: Tuesday, 16 January, 2018 11:29:51
>>> Subject: Re: HA issues
>>
>>> Hi Lucian,
>>>
>>>
>>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi), please refer
>>> to following docs:
>>>
>>> http://docs.cloudstack.apache.org/projects/cloudstack-administration/en/latest/hosts.html#out-of-band-management
>>>
>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
>>>
>>>
>>> We'll need to you look at logs perhaps create a JIRA ticket with the logs and
>>> details? If you saw ipmi based reboot, then host-ha indeed tried to recover
>>> i.e. reboot the host, once hostha has done its work it would schedule HA for VM
>>> as soon as the recovery operation succeeds (we've simulator and kvm based
>>> marvin tests for such scenarios).
>>>
>>>
>>> Can you see it making attempt to schedule VM ha in logs, or any failure?
>>>
>>>
>>> - Rohit
>>>
>>> <https://cloudstack.apache.org>
>>>
>>>
>>>
>>> ________________________________
>>> From: Nux! <nu...@li.nux.ro>
>>> Sent: Tuesday, January 16, 2018 12:47:56 AM
>>> To: dev
>>> Subject: [4.11] HA issues
>>>
>>> Hi,
>>>
>>> I see there's a new HA engine for KVM and IPMI support which is really nice,
>>> however it seems hit and miss.
>>> I have created an instance with HA offering, kernel panicked one of the
>>> hypervisors - after a while the server was rebooted via IPMI probably, but the
>>> instance never moved to a running hypervisor and even after the original
>>> hypervisor came back it was still left in Stopped state.
>>> Is there any extra things I need to set up to have proper HA?
>>>
>>> Regards,
>>> Lucian
>>>
>>> --
>>> Sent from the Delta quadrant using Borg technology!
>>>
>>> Nux!
>>> www.nux.ro
>>>
>>> rohit.yadav@shapeblue.com
>>> www.shapeblue.com<http://www.shapeblue.com>
>>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > @shapeblue

Re: HA issues

Posted by Rohit Yadav <ro...@shapeblue.com>.
I performed VM HA sanity checks and was not able to reproduce any regression against two KVM CentOS7 hosts in a cluster.


Without the "Host HA" feature, I deployed few HA-enabled VMs on a KVM host2 and killed it (powered off). After few minutes of CloudStack attempting to find why the host (kvm agent) timed out, CloudStack kicked investigators, that eventually led KVM fencers to work and VM HA job kicked to start those few VMs on host1 and the KVM host2 was put to "Down" state.


- Rohit

<https://cloudstack.apache.org>



________________________________

rohit.yadav@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 

From: Rohit Yadav
Sent: Wednesday, January 17, 2018 2:39:19 PM
To: dev
Subject: Re: HA issues


Hi Lucian,


The "Host HA" feature is entirely different from VM HA, however, they may work in tandem, so please stop using the terms interchangeably as it may cause the community to believe a regression has been caused.


The "Host HA" feature currently ships with only "Host HA" provider for KVM that is strictly tied to out-of-band management (IPMI for fencing, i.e power off and recovery, i.e. reboot) and NFS (as primary storage). (We also have a provider for simulator, but that's for coverage/testing purposes).


Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is enabled. The frameowkr allows interested parties may write their own HA providers for a hypervisor that can use a different strategy/mechanism for fencing/recovery of hosts (including write a non-IPMI based OOBM plugin) and host/disk activity checker that is non-NFS based.


The "Host HA" feature ships disabled by default and does not cause any interference with VM HA. However, when enabled and configured correctly, it is a known limitation that when it is unable to successfully perform recovery or fencing tasks it may not trigger VM HA. We can discuss how to handle such cases (thoughts?). "Host HA" would try couple of times to recover and failing to do so, it would eventually trigger a host fencing task. If it's unable to fence a host, it will indefinitely attempt to fence the host (the host state will be stuck at fencing state in cloud.ha_config table for example) and alerts will be sent to admin who can do some manual intervention to handle such situations (if you've email/smtp enabled, you should see alert emails).


We can discuss how to improve and have a workaround for the case you've hit, thanks for sharing.


- Rohit

________________________________
From: Nux! <nu...@li.nux.ro>
Sent: Tuesday, January 16, 2018 10:42:35 PM
To: dev
Subject: Re: HA issues

Ok, reinstalled and re-tested.

What I've learned:

- HA only works now if OOB is configured, the old way HA no longer applies - this can be good and bad, not everyone has IPMIs

- HA only works if IPMI is reachable. I've pulled the cord on a HV and HA failed to do its thing, leaving me with a HV down along with all the VMs running there. That's bad.
I've opened this ticket for it:
https://issues.apache.org/jira/browse/CLOUDSTACK-10234

Let me know if you need any extra info or stuff to test.

Regards,
Lucian

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Nux!" <nu...@li.nux.ro>
> To: "dev" <de...@cloudstack.apache.org>
> Sent: Tuesday, 16 January, 2018 11:35:58
> Subject: Re: HA issues

> I'll reinstall my setup and try again, just to be sure I'm working on a clean
> slate.
>
> --
> Sent from the Delta quadrant using Borg technology!
>
> Nux!
> www.nux.ro
>
> ----- Original Message -----
>> From: "Rohit Yadav" <ro...@shapeblue.com>
>> To: "dev" <de...@cloudstack.apache.org>
>> Sent: Tuesday, 16 January, 2018 11:29:51
>> Subject: Re: HA issues
>
>> Hi Lucian,
>>
>>
>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi), please refer
>> to following docs:
>>
>> http://docs.cloudstack.apache.org/projects/cloudstack-administration/en/latest/hosts.html#out-of-band-management
>>
>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
>>
>>
>> We'll need to you look at logs perhaps create a JIRA ticket with the logs and
>> details? If you saw ipmi based reboot, then host-ha indeed tried to recover
>> i.e. reboot the host, once hostha has done its work it would schedule HA for VM
>> as soon as the recovery operation succeeds (we've simulator and kvm based
>> marvin tests for such scenarios).
>>
>>
>> Can you see it making attempt to schedule VM ha in logs, or any failure?
>>
>>
>> - Rohit
>>
>> <https://cloudstack.apache.org>
>>
>>
>>
>> ________________________________
>> From: Nux! <nu...@li.nux.ro>
>> Sent: Tuesday, January 16, 2018 12:47:56 AM
>> To: dev
>> Subject: [4.11] HA issues
>>
>> Hi,
>>
>> I see there's a new HA engine for KVM and IPMI support which is really nice,
>> however it seems hit and miss.
>> I have created an instance with HA offering, kernel panicked one of the
>> hypervisors - after a while the server was rebooted via IPMI probably, but the
>> instance never moved to a running hypervisor and even after the original
>> hypervisor came back it was still left in Stopped state.
>> Is there any extra things I need to set up to have proper HA?
>>
>> Regards,
>> Lucian
>>
>> --
>> Sent from the Delta quadrant using Borg technology!
>>
>> Nux!
>> www.nux.ro
>>
>> rohit.yadav@shapeblue.com
>> www.shapeblue.com<http://www.shapeblue.com>
>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > @shapeblue

Re: HA issues

Posted by Rohit Yadav <ro...@shapeblue.com>.
Hi Lucian,


The "Host HA" feature is entirely different from VM HA, however, they may work in tandem, so please stop using the terms interchangeably as it may cause the community to believe a regression has been caused.


The "Host HA" feature currently ships with only "Host HA" provider for KVM that is strictly tied to out-of-band management (IPMI for fencing, i.e power off and recovery, i.e. reboot) and NFS (as primary storage). (We also have a provider for simulator, but that's for coverage/testing purposes).


Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is enabled. The frameowkr allows interested parties may write their own HA providers for a hypervisor that can use a different strategy/mechanism for fencing/recovery of hosts (including write a non-IPMI based OOBM plugin) and host/disk activity checker that is non-NFS based.


The "Host HA" feature ships disabled by default and does not cause any interference with VM HA. However, when enabled and configured correctly, it is a known limitation that when it is unable to successfully perform recovery or fencing tasks it may not trigger VM HA. We can discuss how to handle such cases (thoughts?). "Host HA" would try couple of times to recover and failing to do so, it would eventually trigger a host fencing task. If it's unable to fence a host, it will indefinitely attempt to fence the host (the host state will be stuck at fencing state in cloud.ha_config table for example) and alerts will be sent to admin who can do some manual intervention to handle such situations (if you've email/smtp enabled, you should see alert emails).


We can discuss how to improve and have a workaround for the case you've hit, thanks for sharing.


- Rohit

________________________________
From: Nux! <nu...@li.nux.ro>
Sent: Tuesday, January 16, 2018 10:42:35 PM
To: dev
Subject: Re: HA issues

Ok, reinstalled and re-tested.

What I've learned:

- HA only works now if OOB is configured, the old way HA no longer applies - this can be good and bad, not everyone has IPMIs

- HA only works if IPMI is reachable. I've pulled the cord on a HV and HA failed to do its thing, leaving me with a HV down along with all the VMs running there. That's bad.
I've opened this ticket for it:
https://issues.apache.org/jira/browse/CLOUDSTACK-10234

Let me know if you need any extra info or stuff to test.

Regards,
Lucian

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro


rohit.yadav@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 

----- Original Message -----
> From: "Nux!" <nu...@li.nux.ro>
> To: "dev" <de...@cloudstack.apache.org>
> Sent: Tuesday, 16 January, 2018 11:35:58
> Subject: Re: HA issues

> I'll reinstall my setup and try again, just to be sure I'm working on a clean
> slate.
>
> --
> Sent from the Delta quadrant using Borg technology!
>
> Nux!
> www.nux.ro
>
> ----- Original Message -----
>> From: "Rohit Yadav" <ro...@shapeblue.com>
>> To: "dev" <de...@cloudstack.apache.org>
>> Sent: Tuesday, 16 January, 2018 11:29:51
>> Subject: Re: HA issues
>
>> Hi Lucian,
>>
>>
>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi), please refer
>> to following docs:
>>
>> http://docs.cloudstack.apache.org/projects/cloudstack-administration/en/latest/hosts.html#out-of-band-management
>>
>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
>>
>>
>> We'll need to you look at logs perhaps create a JIRA ticket with the logs and
>> details? If you saw ipmi based reboot, then host-ha indeed tried to recover
>> i.e. reboot the host, once hostha has done its work it would schedule HA for VM
>> as soon as the recovery operation succeeds (we've simulator and kvm based
>> marvin tests for such scenarios).
>>
>>
>> Can you see it making attempt to schedule VM ha in logs, or any failure?
>>
>>
>> - Rohit
>>
>> <https://cloudstack.apache.org>
>>
>>
>>
>> ________________________________
>> From: Nux! <nu...@li.nux.ro>
>> Sent: Tuesday, January 16, 2018 12:47:56 AM
>> To: dev
>> Subject: [4.11] HA issues
>>
>> Hi,
>>
>> I see there's a new HA engine for KVM and IPMI support which is really nice,
>> however it seems hit and miss.
>> I have created an instance with HA offering, kernel panicked one of the
>> hypervisors - after a while the server was rebooted via IPMI probably, but the
>> instance never moved to a running hypervisor and even after the original
>> hypervisor came back it was still left in Stopped state.
>> Is there any extra things I need to set up to have proper HA?
>>
>> Regards,
>> Lucian
>>
>> --
>> Sent from the Delta quadrant using Borg technology!
>>
>> Nux!
>> www.nux.ro
>>
>> rohit.yadav@shapeblue.com
>> www.shapeblue.com<http://www.shapeblue.com>
>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > @shapeblue

Re: HA issues

Posted by Nux! <nu...@li.nux.ro>.
Ok, reinstalled and re-tested.

What I've learned:

- HA only works now if OOB is configured, the old way HA no longer applies - this can be good and bad, not everyone has IPMIs

- HA only works if IPMI is reachable. I've pulled the cord on a HV and HA failed to do its thing, leaving me with a HV down along with all the VMs running there. That's bad.
I've opened this ticket for it:
https://issues.apache.org/jira/browse/CLOUDSTACK-10234

Let me know if you need any extra info or stuff to test.

Regards,
Lucian

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Nux!" <nu...@li.nux.ro>
> To: "dev" <de...@cloudstack.apache.org>
> Sent: Tuesday, 16 January, 2018 11:35:58
> Subject: Re: HA issues

> I'll reinstall my setup and try again, just to be sure I'm working on a clean
> slate.
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> ----- Original Message -----
>> From: "Rohit Yadav" <ro...@shapeblue.com>
>> To: "dev" <de...@cloudstack.apache.org>
>> Sent: Tuesday, 16 January, 2018 11:29:51
>> Subject: Re: HA issues
> 
>> Hi Lucian,
>> 
>> 
>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi), please refer
>> to following docs:
>> 
>> http://docs.cloudstack.apache.org/projects/cloudstack-administration/en/latest/hosts.html#out-of-band-management
>> 
>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
>> 
>> 
>> We'll need to you look at logs perhaps create a JIRA ticket with the logs and
>> details? If you saw ipmi based reboot, then host-ha indeed tried to recover
>> i.e. reboot the host, once hostha has done its work it would schedule HA for VM
>> as soon as the recovery operation succeeds (we've simulator and kvm based
>> marvin tests for such scenarios).
>> 
>> 
>> Can you see it making attempt to schedule VM ha in logs, or any failure?
>> 
>> 
>> - Rohit
>> 
>> <https://cloudstack.apache.org>
>> 
>> 
>> 
>> ________________________________
>> From: Nux! <nu...@li.nux.ro>
>> Sent: Tuesday, January 16, 2018 12:47:56 AM
>> To: dev
>> Subject: [4.11] HA issues
>> 
>> Hi,
>> 
>> I see there's a new HA engine for KVM and IPMI support which is really nice,
>> however it seems hit and miss.
>> I have created an instance with HA offering, kernel panicked one of the
>> hypervisors - after a while the server was rebooted via IPMI probably, but the
>> instance never moved to a running hypervisor and even after the original
>> hypervisor came back it was still left in Stopped state.
>> Is there any extra things I need to set up to have proper HA?
>> 
>> Regards,
>> Lucian
>> 
>> --
>> Sent from the Delta quadrant using Borg technology!
>> 
>> Nux!
>> www.nux.ro
>> 
>> rohit.yadav@shapeblue.com
>> www.shapeblue.com
>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > @shapeblue

Re: HA issues

Posted by Nux! <nu...@li.nux.ro>.
I'll reinstall my setup and try again, just to be sure I'm working on a clean slate.

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Rohit Yadav" <ro...@shapeblue.com>
> To: "dev" <de...@cloudstack.apache.org>
> Sent: Tuesday, 16 January, 2018 11:29:51
> Subject: Re: HA issues

> Hi Lucian,
> 
> 
> If you're talking about the new HostHA feature (with KVM+nfs+ipmi), please refer
> to following docs:
> 
> http://docs.cloudstack.apache.org/projects/cloudstack-administration/en/latest/hosts.html#out-of-band-management
> 
> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
> 
> 
> We'll need to you look at logs perhaps create a JIRA ticket with the logs and
> details? If you saw ipmi based reboot, then host-ha indeed tried to recover
> i.e. reboot the host, once hostha has done its work it would schedule HA for VM
> as soon as the recovery operation succeeds (we've simulator and kvm based
> marvin tests for such scenarios).
> 
> 
> Can you see it making attempt to schedule VM ha in logs, or any failure?
> 
> 
> - Rohit
> 
> <https://cloudstack.apache.org>
> 
> 
> 
> ________________________________
> From: Nux! <nu...@li.nux.ro>
> Sent: Tuesday, January 16, 2018 12:47:56 AM
> To: dev
> Subject: [4.11] HA issues
> 
> Hi,
> 
> I see there's a new HA engine for KVM and IPMI support which is really nice,
> however it seems hit and miss.
> I have created an instance with HA offering, kernel panicked one of the
> hypervisors - after a while the server was rebooted via IPMI probably, but the
> instance never moved to a running hypervisor and even after the original
> hypervisor came back it was still left in Stopped state.
> Is there any extra things I need to set up to have proper HA?
> 
> Regards,
> Lucian
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> rohit.yadav@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue

Re: HA issues

Posted by Rohit Yadav <ro...@shapeblue.com>.
Hi Lucian,


If you're talking about the new HostHA feature (with KVM+nfs+ipmi), please refer to following docs:

http://docs.cloudstack.apache.org/projects/cloudstack-administration/en/latest/hosts.html#out-of-band-management

https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA


We'll need to you look at logs perhaps create a JIRA ticket with the logs and details? If you saw ipmi based reboot, then host-ha indeed tried to recover i.e. reboot the host, once hostha has done its work it would schedule HA for VM as soon as the recovery operation succeeds (we've simulator and kvm based marvin tests for such scenarios).


Can you see it making attempt to schedule VM ha in logs, or any failure?


- Rohit

<https://cloudstack.apache.org>



________________________________
From: Nux! <nu...@li.nux.ro>
Sent: Tuesday, January 16, 2018 12:47:56 AM
To: dev
Subject: [4.11] HA issues

Hi,

I see there's a new HA engine for KVM and IPMI support which is really nice, however it seems hit and miss.
I have created an instance with HA offering, kernel panicked one of the hypervisors - after a while the server was rebooted via IPMI probably, but the instance never moved to a running hypervisor and even after the original hypervisor came back it was still left in Stopped state.
Is there any extra things I need to set up to have proper HA?

Regards,
Lucian

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

rohit.yadav@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue