You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Sean Lair <sl...@ippathways.com> on 2018/03/01 21:07:47 UTC

RE: HA issues

Based on your note we made the following change:

https://github.com/apache/cloudstack/pull/2472

It adds a sleep between retries and then stops the cloudstack-agent if it still can't write the heartbeat file after the retries...  At least this way an alert is raised instead of a hard reboot.  Also, it allows HA to kick-in and handle correctly.


-----Original Message-----
From: Andrija Panic [mailto:andrija.panic@gmail.com] 
Sent: Tuesday, February 20, 2018 5:16 PM
To: dev <de...@cloudstack.apache.org>
Subject: Re: HA issues

That is good to hear ( no NFS issues causing Agent Disconnect).

I assume you are using "normal" NFS solution with proper HA and no ZFS (kernel panic etc), but anyway be aware of this one

https://github.com/apache/cloudstack/blob/e532b574ddb186a117da638fb6059356fe7c266c/scripts/vm/hypervisor/kvm/kvmheartbeat.sh#L161



we used to comment this line, because we did have some issues with communication link, and this commented line saved our a$$ few times :)

CHeers

On 20 February 2018 at 20:50, Sean Lair <sl...@ippathways.com> wrote:

> Hi Andrija
>
> We are currently running XenServer in production.  We are working on 
> moving to KVM and have it deployed in a development environment.
>
> The team is putting CloudStack + KVM through its paces and that is 
> when it was discovered how broken VM HA is in 4.9.3.  Initially our 
> patches fixed VM HA, but just caused VMs to get started on two hosts 
> during failure testing.  The libvirt lockd has solved that issue thus far.
>
> Short answer to you question is :-), we were not having problems with 
> Agent Disconnects in a production environment.  It was our testing/QA 
> that revealed the issues.  Our NFS has been stable so far, no issues 
> with the agent crashing/stopping that wasn't initiated by the team's testing.
>
> Thanks
> Sean
>
>
> -----Original Message-----
> From: Andrija Panic [mailto:andrija.panic@gmail.com]
> Sent: Saturday, February 17, 2018 1:49 PM
> To: dev <de...@cloudstack.apache.org>
> Subject: Re: HA issues
>
> Hi Sean,
>
> (we have 2 threads interleaving on the libvirt lockd..) - so, did you 
> manage to understand what can cause the Agent Disconnect in most 
> cases, for you specifically? Is there any software (CloudStack) root 
> cause (disregarding i.e. networking issues etc)
>
> Just our examples, which you should probably not have:
>
> We had CEPH cluster running (with ACS), and there any exception in 
> librbd would crash JVM and the agent, but this has been fixed mostly - 
> Now get i.e. agent disconnect when ACS try to delete volume on CEPH 
> (and for some reason not succeed withing 30 minutes, volume deletion 
> fails) - then libvirt get's completety stuck (virsh list even dont 
> work)...so  agent get's disconnect eventually.
>
> It would be good to get rid of agent disconnections in general, 
> obviously
> :) so that is why I'm asking (you are on NFS, so would like to see 
> your experience here).
>
> Thanks
>
> On 16 February 2018 at 21:52, Sean Lair <sl...@ippathways.com> wrote:
>
> > We were in the same situation as Nux.
> >
> > In our test environment we hit the issue with VMs not getting fenced and
> > coming up on two hosts because of VM HA.   However, we updated some of
> the
> > logic for VM HA and turned on libvirtd's locking mechanism.  Now we 
> > are working great w/o IPMI.  The locking stops the VMs from starting 
> > elsewhere, and everything recovers very nicely when the host starts
> responding again.
> >
> > We are on 4.9.3 and haven't started testing with 4.11 yet, but it 
> > may work along-side IPMI just fine - it would just have affect the fencing.
> > However, we *currently* prefer how we are doing it now, because if 
> > the agent stops responding, but the host is still up, the VMs 
> > continue running and no actual downtime is incurred.  Even when VM 
> > HA attempts to power on the VMs on another host, it just fails the 
> > power-up and the VMs continue to run on the "agent disconnected" 
> > host. The host goes into alarm state and our NOC can look into what 
> > is wrong the agent on the host.  If IPMI was enabled, it sounds like 
> > it would power off the host (fence) and force downtime for us even 
> > if the VMs were actually running OK - and just the agent is unreachable.
> >
> > I plan on submitting our updates via a pull request at some point.
> > But I can also send the updated code to anyone that wants to do some 
> > testing before then.
> >
> > -----Original Message-----
> > From: Marcus [mailto:shadowsor@gmail.com]
> > Sent: Friday, February 16, 2018 11:27 AM
> > To: dev@cloudstack.apache.org
> > Subject: Re: HA issues
> >
> > From your other emails it sounds as though you do not have IPMI 
> > configured, nor host HA enabled, correct? In this case, the correct 
> > thing to do is nothing. If CloudStack cannot guarantee the VM state 
> > (as is the case with an unreachable hypervisor), it should do 
> > nothing, for fear of causing a split brain and corrupting the VM 
> > disk (VM running
> on two hosts).
> >
> > Clustering and fencing is a tricky proposition. When CloudStack (or 
> > any other cluster manager) is not configured to or cannot guarantee 
> > state then things will simply lock up, in this case your HA VM on 
> > your broken hypervisor will not run elsewhere. This has been the 
> > case for a long time with CloudStack, HA would only start a VM after 
> > the original hypervisor agent came back and reported no VM is running.
> >
> > The new feature, from what I gather, simply adds the possibility of 
> > CloudStack being able to reach out and shut down the hypervisor to 
> > guarantee state. At that point it can start the VM elsewhere. If 
> > something fails in that process (IPMI unreachable, for example, or 
> > bad credentials), you're still going to be stuck with a VM not coming back.
> >
> > It's the nature of the thing. I'd be wary of any HA solution that 
> > does not reach out and guarantee state via host or storage fencing 
> > before starting a VM elsewhere, as it will be making assumptions. 
> > Its entirely possible a VM might be unreachable or unable to access 
> > it storage for a short while, a new instance of the VM is started
> elsewhere, and the original VM comes back.
> >
> > On Wed, Jan 17, 2018 at 9:02 AM Nux! <nu...@li.nux.ro> wrote:
> >
> > > Hi Rohit,
> > >
> > > I've reinstalled and tested. Still no go with VM HA.
> > >
> > > What I did was to kernel panic that particular HV ("echo c > 
> > > /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
> > > What happened next is the HV got marked as "Alert", the VM on it 
> > > was all the time marked as "Running" and it was not migrated to another HV.
> > > Once the panicked HV has booted back the VM reboots and becomes
> > available.
> > >
> > > I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary
> storage.
> > > The VM has HA enabled service offering.
> > > Host HA or OOBM configuration was not touched.
> > >
> > > Full log http://tmp.nux.ro/W3s-management-server.log
> > >
> > > --
> > > Sent from the Delta quadrant using Borg technology!
> > >
> > > Nux!
> > > www.nux.ro
> > >
> > > ----- Original Message -----
> > > > From: "Rohit Yadav" <ro...@shapeblue.com>
> > > > To: "dev" <de...@cloudstack.apache.org>
> > > > Sent: Wednesday, 17 January, 2018 12:13:33
> > > > Subject: Re: HA issues
> > >
> > > > I performed VM HA sanity checks and was not able to reproduce 
> > > > any
> > > regression
> > > > against two KVM CentOS7 hosts in a cluster.
> > > >
> > > >
> > > > Without the "Host HA" feature, I deployed few HA-enabled VMs on 
> > > > a KVM
> > > host2 and
> > > > killed it (powered off). After few minutes of CloudStack 
> > > > attempting to
> > > find why
> > > > the host (kvm agent) timed out, CloudStack kicked investigators, 
> > > > that eventually led KVM fencers to work and VM HA job kicked to 
> > > > start those
> > > few VMs
> > > > on host1 and the KVM host2 was put to "Down" state.
> > > >
> > > >
> > > > - Rohit
> > > >
> > > > <https://cloudstack.apache.org>
> > > >
> > > >
> > > >
> > > > ________________________________
> > > >
> > > > rohit.yadav@shapeblue.com
> > > > www.shapeblue.com
> > > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
> > > >
> > > >
> > > >
> > > > From: Rohit Yadav
> > > > Sent: Wednesday, January 17, 2018 2:39:19 PM
> > > > To: dev
> > > > Subject: Re: HA issues
> > > >
> > > >
> > > > Hi Lucian,
> > > >
> > > >
> > > > The "Host HA" feature is entirely different from VM HA, however, 
> > > > they
> > > may work
> > > > in tandem, so please stop using the terms interchangeably as it 
> > > > may
> > > cause the
> > > > community to believe a regression has been caused.
> > > >
> > > >
> > > > The "Host HA" feature currently ships with only "Host HA" 
> > > > provider for
> > > KVM that
> > > > is strictly tied to out-of-band management (IPMI for fencing, 
> > > > i.e power
> > > off and
> > > > recovery, i.e. reboot) and NFS (as primary storage). (We also 
> > > > have a
> > > provider
> > > > for simulator, but that's for coverage/testing purposes).
> > > >
> > > >
> > > > Therefore, "Host HA" for KVM (+nfs) currently works only when 
> > > > OOBM is
> > > enabled.
> > > > The frameowkr allows interested parties may write their own HA
> > providers
> > > for a
> > > > hypervisor that can use a different strategy/mechanism for
> > > fencing/recovery of
> > > > hosts (including write a non-IPMI based OOBM plugin) and 
> > > > host/disk
> > > activity
> > > > checker that is non-NFS based.
> > > >
> > > >
> > > > The "Host HA" feature ships disabled by default and does not 
> > > > cause any interference with VM HA. However, when enabled and 
> > > > configured
> > correctly,
> > > it is
> > > > a known limitation that when it is unable to successfully 
> > > > perform
> > > recovery or
> > > > fencing tasks it may not trigger VM HA. We can discuss how to 
> > > > handle
> > > such cases
> > > > (thoughts?). "Host HA" would try couple of times to recover and 
> > > > failing
> > > to do
> > > > so, it would eventually trigger a host fencing task. If it's 
> > > > unable to
> > > fence a
> > > > host, it will indefinitely attempt to fence the host (the host 
> > > > state
> > > will be
> > > > stuck at fencing state in cloud.ha_config table for example) and 
> > > > alerts
> > > will be
> > > > sent to admin who can do some manual intervention to handle such
> > > situations (if
> > > > you've email/smtp enabled, you should see alert emails).
> > > >
> > > >
> > > > We can discuss how to improve and have a workaround for the case 
> > > > you've
> > > hit,
> > > > thanks for sharing.
> > > >
> > > >
> > > > - Rohit
> > > >
> > > > ________________________________
> > > > From: Nux! <nu...@li.nux.ro>
> > > > Sent: Tuesday, January 16, 2018 10:42:35 PM
> > > > To: dev
> > > > Subject: Re: HA issues
> > > >
> > > > Ok, reinstalled and re-tested.
> > > >
> > > > What I've learned:
> > > >
> > > > - HA only works now if OOB is configured, the old way HA no 
> > > > longer
> > > applies -
> > > > this can be good and bad, not everyone has IPMIs
> > > >
> > > > - HA only works if IPMI is reachable. I've pulled the cord on a 
> > > > HV and
> > > HA failed
> > > > to do its thing, leaving me with a HV down along with all the 
> > > > VMs
> > running
> > > > there. That's bad.
> > > > I've opened this ticket for it:
> > > > https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> > > >
> > > > Let me know if you need any extra info or stuff to test.
> > > >
> > > > Regards,
> > > > Lucian
> > > >
> > > > --
> > > > Sent from the Delta quadrant using Borg technology!
> > > >
> > > > Nux!
> > > > www.nux.ro
> > > >
> > > > ----- Original Message -----
> > > >> From: "Nux!" <nu...@li.nux.ro>
> > > >> To: "dev" <de...@cloudstack.apache.org>
> > > >> Sent: Tuesday, 16 January, 2018 11:35:58
> > > >> Subject: Re: HA issues
> > > >
> > > >> I'll reinstall my setup and try again, just to be sure I'm 
> > > >> working on
> > a
> > > clean
> > > >> slate.
> > > >>
> > > >> --
> > > >> Sent from the Delta quadrant using Borg technology!
> > > >>
> > > >> Nux!
> > > >> www.nux.ro
> > > >>
> > > >> ----- Original Message -----
> > > >>> From: "Rohit Yadav" <ro...@shapeblue.com>
> > > >>> To: "dev" <de...@cloudstack.apache.org>
> > > >>> Sent: Tuesday, 16 January, 2018 11:29:51
> > > >>> Subject: Re: HA issues
> > > >>
> > > >>> Hi Lucian,
> > > >>>
> > > >>>
> > > >>> If you're talking about the new HostHA feature (with
> > > >>> KVM+nfs+ipmi),
> > > please refer
> > > >>> to following docs:
> > > >>>
> > > >>>
> > > http://docs.cloudstack.apache.org/projects/cloudstack-
> > administration/en/latest/hosts.html#out-of-band-management
> > > >>>
> > > >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
> > > >>>
> > > >>>
> > > >>> We'll need to you look at logs perhaps create a JIRA ticket 
> > > >>> with the
> > > logs and
> > > >>> details? If you saw ipmi based reboot, then host-ha indeed 
> > > >>> tried to
> > > recover
> > > >>> i.e. reboot the host, once hostha has done its work it would 
> > > >>> schedule
> > > HA for VM
> > > >>> as soon as the recovery operation succeeds (we've simulator 
> > > >>> and kvm
> > > based
> > > >>> marvin tests for such scenarios).
> > > >>>
> > > >>>
> > > >>> Can you see it making attempt to schedule VM ha in logs, or 
> > > >>> any
> > > failure?
> > > >>>
> > > >>>
> > > >>> - Rohit
> > > >>>
> > > >>> <https://cloudstack.apache.org>
> > > >>>
> > > >>>
> > > >>>
> > > >>> ________________________________
> > > >>> From: Nux! <nu...@li.nux.ro>
> > > >>> Sent: Tuesday, January 16, 2018 12:47:56 AM
> > > >>> To: dev
> > > >>> Subject: [4.11] HA issues
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> I see there's a new HA engine for KVM and IPMI support which 
> > > >>> is
> > really
> > > nice,
> > > >>> however it seems hit and miss.
> > > >>> I have created an instance with HA offering, kernel panicked 
> > > >>> one of
> > the
> > > >>> hypervisors - after a while the server was rebooted via IPMI
> > probably,
> > > but the
> > > >>> instance never moved to a running hypervisor and even after 
> > > >>> the o
> > > <https://maps.google.com/?q=to+a+running+hypervisor+and+
> > even+after+the+o&entry=gmail&source=g>
> > > riginal
> > > >>> hypervisor came back it was still left in Stopped state.
> > > >>> Is there any extra things I need to set up to have proper HA?
> > > >>>
> > > >>> Regards,
> > > >>> Lucian
> > > >>>
> > > >>> --
> > > >>> Sent from the Delta quadrant using Borg technology!
> > > >>>
> > > >>> Nux!
> > > >>> www.nux.ro
> > > >>>
> > > >>> rohit.yadav@shapeblue.com
> > > >>> www.shapeblue.com<http://www.shapeblue.com>
> > > >>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > > @shapeblue
> > >
> >
>
>
>
> --
>
> Andrija Panić
>



-- 

Andrija Panić