You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@cloudstack.apache.org by Jon Marshall <jm...@hotmail.co.uk> on 2018/03/27 08:18:54 UTC

Failover for VMs

After 3 weeks of trying multiple different setups I still have not managed to get a VM to failover between compute nodes and am just running out of ideas.


I have 3 compute nodes each with 3 NICS (management, VMs traffic, storage), one management node with just a single NIC connection in the management network and a separate NFS server.


I have tried with and without the new Host HA KVM in CS v4.11 as from what I have read even without enabling the new Host HA KVM when you power off or reboot a compute node your VMs should still migrate.


I have tried powering off a compute node, pulling the power lead, removing the management and NFS network cables and the management server just seems to carry on as if nothing has happened.


Could someone explain exactly how HA is meant to work so I can look at where it is going wrong.

Re: Failover for VMs

Posted by Jon Marshall <jm...@hotmail.co.uk>.

Paul


I did some more testing today and am not sure what some of the states mean.


The first test was the easiest ie. "echo c > /proc/sysrq-trigger" which crashes the server.  In my setup the VMs on the crashed node never migrate because the server is rebooted and it comes backup before CS tries to migrate any servers.  It takes approx 4 mins for server to recover.


The next tests were by doing a hard reset on the server and then modifying timers -


I did 4 tests and the quickest I got the VMs to failover was approx  5 and half minutes (see below for test details).


So I have two questions really from all this -


1) why does it go from Suspect to Degraded and back to Suspect once I started changing timers.  According to the docs Degraded means a successful activity check but the server was down so it can't have passed. And noticeably without modifying any timers it never goes to Degraded at all.


2) what is a sensible fail over time in your experience ie. what in your experience is a reasonable failover time ?


Thanks for any help you can give.


Tests -


1)  default timers -

0:00 Suspect
9:00 recovery/Fenced
10:15 VM migrated

2)  kvm.ha.activity.check.max.attempts  3 (default = 10)

0:00 Suspect
2:00 Degraded
7:00 Suspect
9:00 Recovery/Fenced
10:20 VM migrated

3)  kvm.ha.activity.check.max.attempts 3  (default = 10)
     kvm.ha.degraded.max.period         120 seconds (default = 300)

0:00 Suspect
2:00 Degraded
4:00 Suspect
6:00 Checking/Fenced
7:21 VM migrated

4)   kvm.ha.activity.check.max.attempts 3  (default = 10)
      kvm.ha.degraded.max.period         120 seconds (default = 300)
      kvm.ha.activity.check.interval     30 seconds (default = 60)

0:00 Suspect
1:10 Degraded
3:10 Suspect
4:20 Recovering/Fenced
5:30 VM migrated


________________________________
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: 29 March 2018 09:40
To: users@cloudstack.apache.org
Subject: Re: Failover for VMs

Hi Paul


I did make some progress with this and seem to remember that after it said Recovered it then went back to Suspect and finally Fenced.


I am going to rerun a lot of the tests after changing some of the kvm_ha_ timers to try and speed things up a bit.


Will update here after I have run tests to check if that is what I should be seeing.


Many thanks


Jon


________________________________
From: Paul Angus <pa...@shapeblue.com>
Sent: 28 March 2018 20:01
To: users@cloudstack.apache.org
Subject: Re: Failover for VMs

Ah.

Did you wait after the node said recovered?

That message is spurious. I've seen it also. It should say recovering.   at that time.

________________________________
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: Tuesday, 27 March 2018 10:42 am
To: users@cloudstack.apache.org
Subject: Re: Failover for VMs

Just as an update to this before I forget what I did :) -


I used "echo c /proc/sysrq-trigger" on one of the compute nodes and there was no VM failover.  Instead HA reported suspect and then IPMI rebooted the machine, it came back online and the VM started responding to pings again.  IPMI is out of band so that seems to be reasonable behaviour but no use in testing HA.


Next I just pulled all 3 NIC cables  from the same compute node and again HA reported suspect.  Again IPMI rebooted but then HA state changed to "Recovered" which I don't understand as the NIC cables were still disconnected so VM was not reachable and no failover.


I don't understand how it can think the node is recovered as apart from the IPMI out of band connection there are no network connections to this server.


Finally pulled power lead and this time HA went from suspect to Fencing and then stayed that way. Again no VM failover.   This makes sense as no power means IPMI cannot reboot server so it never moves to Fenced I assume. Again no failover.


I am wondering if it is to do with out of band IPMI or the way I have the NICs setup.  The management node only has one NIC in the management network but I assume this is okay.


I may try reloading with CS v4.9 and just try failover without the new HA KVM to see if I see anything different.



Jon


________________________________
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: 27 March 2018 10:10
To: users@cloudstack.apache.org
Subject: Re: Failover for VMs


Thanks Paul, will pick up after Easter break.

Doing some more testing with HA KVM at the moment so any progress will update this thread


i
________________________________
From: Paul Angus <paul.angus@shapeblue.com
Sent: 27 March 2018 10:07
To: users@cloudstack.apache.org
Subject: RE: Failover for VMr
Jon,

I've been updating the Ansible to move our physical hosts from Centos6 to Centos7, now that's done I'll run through an HA setup and post answers (probably after easter break).

paul.angus@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised CloudStack powered IaaS cloud for small production deployments, or medium scale POCs or pilots.



[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com<http://www.shapeblue.com>
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...



[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com<http://www.shapeblue.com>
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...



[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com<http://www.shapeblue.com>
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...



53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue




paul.angus@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue




-----Original Message-----
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: 27 March 2018 09:19
To: users@cloudstack.apache.org
Subject: Failover for VMs

After 3 weeks of trying multiple different setups I still have not managed to get a VM to failover between compute nodes and am just running out of ideas.


I have 3 compute nodes each with 3 NICS (management, VMs traffic, storage), one management node with just a single NIC connection in the management network and a separate NFS server.


I have tried with and without the new Host HA KVM in CS v4.11 as from what I have read even without enabling the new Host HA KVM when you power off or reboot a compute node your VMs should still migrate.


I have tried powering off a compute node, pulling the power lead, removing the management and NFS network cables and the management server just seems to carry on as if nothing has happened.


Could someone explain exactly how HA is meant to work so I can look at where it is going wrong.

Re: Failover for VMs

Posted by Jon Marshall <jm...@hotmail.co.uk>.

Hi Paul


I did make some progress with this and seem to remember that after it said Recovered it then went back to Suspect and finally Fenced.


I am going to rerun a lot of the tests after changing some of the kvm_ha_ timers to try and speed things up a bit.


Will update here after I have run tests to check if that is what I should be seeing.


Many thanks


Jon


________________________________
From: Paul Angus <pa...@shapeblue.com>
Sent: 28 March 2018 20:01
To: users@cloudstack.apache.org
Subject: Re: Failover for VMs

Ah.

Did you wait after the node said recovered?

That message is spurious. I've seen it also. It should say recovering.   at that time.

________________________________
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: Tuesday, 27 March 2018 10:42 am
To: users@cloudstack.apache.org
Subject: Re: Failover for VMs

Just as an update to this before I forget what I did :) -


I used "echo c /proc/sysrq-trigger" on one of the compute nodes and there was no VM failover.  Instead HA reported suspect and then IPMI rebooted the machine, it came back online and the VM started responding to pings again.  IPMI is out of band so that seems to be reasonable behaviour but no use in testing HA.


Next I just pulled all 3 NIC cables  from the same compute node and again HA reported suspect.  Again IPMI rebooted but then HA state changed to "Recovered" which I don't understand as the NIC cables were still disconnected so VM was not reachable and no failover.


I don't understand how it can think the node is recovered as apart from the IPMI out of band connection there are no network connections to this server.


Finally pulled power lead and this time HA went from suspect to Fencing and then stayed that way. Again no VM failover.   This makes sense as no power means IPMI cannot reboot server so it never moves to Fenced I assume. Again no failover.


I am wondering if it is to do with out of band IPMI or the way I have the NICs setup.  The management node only has one NIC in the management network but I assume this is okay.


I may try reloading with CS v4.9 and just try failover without the new HA KVM to see if I see anything different.



Jon


________________________________
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: 27 March 2018 10:10
To: users@cloudstack.apache.org
Subject: Re: Failover for VMs


Thanks Paul, will pick up after Easter break.

Doing some more testing with HA KVM at the moment so any progress will update this thread


i
________________________________
From: Paul Angus <paul.angus@shapeblue.com
Sent: 27 March 2018 10:07
To: users@cloudstack.apache.org
Subject: RE: Failover for VMr
Jon,

I've been updating the Ansible to move our physical hosts from Centos6 to Centos7, now that's done I'll run through an HA setup and post answers (probably after easter break).

paul.angus@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...



[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com<http://www.shapeblue.com>
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...



[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com<http://www.shapeblue.com>
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...



53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue




paul.angus@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue




-----Original Message-----
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: 27 March 2018 09:19
To: users@cloudstack.apache.org
Subject: Failover for VMs

After 3 weeks of trying multiple different setups I still have not managed to get a VM to failover between compute nodes and am just running out of ideas.


I have 3 compute nodes each with 3 NICS (management, VMs traffic, storage), one management node with just a single NIC connection in the management network and a separate NFS server.


I have tried with and without the new Host HA KVM in CS v4.11 as from what I have read even without enabling the new Host HA KVM when you power off or reboot a compute node your VMs should still migrate.


I have tried powering off a compute node, pulling the power lead, removing the management and NFS network cables and the management server just seems to carry on as if nothing has happened.


Could someone explain exactly how HA is meant to work so I can look at where it is going wrong.

Re: Failover for VMs

Posted by Paul Angus <pa...@shapeblue.com>.

Ah.

Did you wait after the node said recovered?

That message is spurious. I've seen it also. It should say recovering.   at that time.

________________________________
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: Tuesday, 27 March 2018 10:42 am
To: users@cloudstack.apache.org
Subject: Re: Failover for VMs

Just as an update to this before I forget what I did :) -

I used "echo c /proc/sysrq-trigger" on one of the compute nodes and there was no VM failover.  Instead HA reported suspect and then IPMI rebooted the machine, it came back online and the VM started responding to pings again.  IPMI is out of band so that seems to be reasonable behaviour but no use in testing HA.

Next I just pulled all 3 NIC cables  from the same compute node and again HA reported suspect.  Again IPMI rebooted but then HA state changed to "Recovered" which I don't understand as the NIC cables were still disconnected so VM was not reachable and no failover.

I don't understand how it can think the node is recovered as apart from the IPMI out of band connection there are no network connections to this server.

Finally pulled power lead and this time HA went from suspect to Fencing and then stayed that way. Again no VM failover.   This makes sense as no power means IPMI cannot reboot server so it never moves to Fenced I assume. Again no failover.

I am wondering if it is to do with out of band IPMI or the way I have the NICs setup.  The management node only has one NIC in the management network but I assume this is okay.

I may try reloading with CS v4.9 and just try failover without the new HA KVM to see if I see anything different.

Jon

________________________________
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: 27 March 2018 10:10
To: users@cloudstack.apache.org
Subject: Re: Failover for VMs

Thanks Paul, will pick up after Easter break.

Doing some more testing with HA KVM at the moment so any progress will update this thread

i
________________________________
From: Paul Angus <paul.angus@shapeblue.com
Sent: 27 March 2018 10:07
To: users@cloudstack.apache.org
Subject: RE: Failover for VMr
Jon,

I've been updating the Ansible to move our physical hosts from Centos6 to Centos7, now that's done I'll run through an HA setup and post answers (probably after easter break).

paul.angus@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com<http://www.shapeblue.com>
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...

[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com<http://www.shapeblue.com>
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...

53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue

paul.angus@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue

-----Original Message-----
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: 27 March 2018 09:19
To: users@cloudstack.apache.org
Subject: Failover for VMs

After 3 weeks of trying multiple different setups I still have not managed to get a VM to failover between compute nodes and am just running out of ideas.

I have 3 compute nodes each with 3 NICS (management, VMs traffic, storage), one management node with just a single NIC connection in the management network and a separate NFS server.

I have tried with and without the new Host HA KVM in CS v4.11 as from what I have read even without enabling the new Host HA KVM when you power off or reboot a compute node your VMs should still migrate.

I have tried powering off a compute node, pulling the power lead, removing the management and NFS network cables and the management server just seems to carry on as if nothing has happened.

Could someone explain exactly how HA is meant to work so I can look at where it is going wrong.

Re: Failover for VMs

Posted by Jon Marshall <jm...@hotmail.co.uk>.

Ok, significant progress made with this and have got Host HA KVM failover working for a number of different scenarios.

Will update this thread with tests run etc. and pick up after Easter as suggested by Paul.

________________________________
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: 27 March 2018 11:24
To: users@cloudstack.apache.org
Subject: Re: Failover for VMs

I am just updating as I continue testing -

When i pulled the power lead as discussed below it goes from Suspect to Fencing but never gets to Fenced.  But when I put the power lead back in to the server CS almost immediately puts that server into maintenance mode and then does migrate t
ot sure of the logic but at least I got to see a VM failover
_______________________________
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: 27 March 2018 10:42
To: users@cloudstack.apache.org
Subject: Re: Failover for VMs

Just as an update to this before I forget what I did :) -

I used "echo c /proc/sysrq-trigger" on one of the compute nodes and there was no VM failover.  Instead HA reported suspect and then IPMI rebooted the machine, it came bacVM started responding to pings again.  IPMI is out of band so that seems to be reasonable behaviour but no use in testing HA.

Next I just pulled all 3 NIC cables  from the same compute node and again HA reported suspect.  Again IPMI rebooted but then HA state changed to "Recovered" which I don't understand as the NIC cables were still disconnected so VM was not reachable and no failover.

I don't understand how it can think the node is recovered as apart from the IPMI out of band connection there are no network connections to this server.

Finally pulled power lead and this time HA went from suspect to Fencing and then stayed that way. Again no VM failover.   This makes sense as no power means IPMI cannot reboot server so it never moves to Fenced I assume. Again no failover.

I am wondering if it is to do with out of band IPMI or the way I have the NICs setup.  The management node only has one NIC in the management network but I assume this is okay.

I may try reloading with CS v4.9 and just try failover without the new HA KVM to see if I see anything different.

Jon

________________________________
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: 27 March 2018 10:10
To: users@cloudstack.apache.org
Subject: Re: Failover for VMs

Thanks Paul, will pick up after Easter break.

Doing some more testing with HA KVM at the moment so any progress will update this thread

i
________________________________
From: Paul Angus <paul.angus@shapeblue.com
Sent: 27 March 2018 10:07
To: users@cloudstack.apache.org
Subject: RE: Failover for VMr
Jon,

I've been updating the Ansible to move our physical hosts from Centos6 to Centos7, now that's done I'll run through an HA setup and post answers (probably after easter break).

paul.angus@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...

[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com<http://www.shapeblue.com>
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...

[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com<http://www.shapeblue.com>
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...

[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com<http://www.shapeblue.com>
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...

53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue

-----Original Message-----
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: 27 March 2018 09:19
To: users@cloudstack.apache.org
Subject: Failover for VMs

After 3 weeks of trying multiple different setups I still have not managed to get a VM to failover between compute nodes and am just running out of ideas.

I have 3 compute nodes each with 3 NICS (management, VMs traffic, storage), one management node with just a single NIC connection in the management network and a separate NFS server.

I have tried with and without the new Host HA KVM in CS v4.11 as from what I have read even without enabling the new Host HA KVM when you power off or reboot a compute node your VMs should still migrate.

I have tried powering off a compute node, pulling the power lead, removing the management and NFS network cables and the management server just seems to carry on as if nothing has happened.

Could someone explain exactly how HA is meant to work so I can look at where it is going wrong.

Re: Failover for VMs

Posted by Jon Marshall <jm...@hotmail.co.uk>.

I am just updating as I continue testing -


When i pulled the power lead as discussed below it goes from Suspect to Fencing but never gets to Fenced.  But when I put the power lead back in to the server CS almost immediately puts that server into maintenance mode and then does migrate the VM.


Not sure of the logic but at least I got to see a VM failover :)


________________________________
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: 27 March 2018 10:42
To: users@cloudstack.apache.org
Subject: Re: Failover for VMs

Just as an update to this before I forget what I did :) -


I used "echo c /proc/sysrq-trigger" on one of the compute nodes and there was no VM failover.  Instead HA reported suspect and then IPMI rebooted the machine, it came bacVM started responding to pings again.  IPMI is out of band so that seems to be reasonable behaviour but no use in testing HA.


Next I just pulled all 3 NIC cables  from the same compute node and again HA reported suspect.  Again IPMI rebooted but then HA state changed to "Recovered" which I don't understand as the NIC cables were still disconnected so VM was not reachable and no failover.


I don't understand how it can think the node is recovered as apart from the IPMI out of band connection there are no network connections to this server.


Finally pulled power lead and this time HA went from suspect to Fencing and then stayed that way. Again no VM failover.   This makes sense as no power means IPMI cannot reboot server so it never moves to Fenced I assume. Again no failover.


I am wondering if it is to do with out of band IPMI or the way I have the NICs setup.  The management node only has one NIC in the management network but I assume this is okay.


I may try reloading with CS v4.9 and just try failover without the new HA KVM to see if I see anything different.



Jon


________________________________
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: 27 March 2018 10:10
To: users@cloudstack.apache.org
Subject: Re: Failover for VMs


Thanks Paul, will pick up after Easter break.

Doing some more testing with HA KVM at the moment so any progress will update this thread


i
________________________________
From: Paul Angus <paul.angus@shapeblue.com
Sent: 27 March 2018 10:07
To: users@cloudstack.apache.org
Subject: RE: Failover for VMr
Jon,

I've been updating the Ansible to move our physical hosts from Centos6 to Centos7, now that's done I'll run through an HA setup and post answers (probably after easter break).

paul.angus@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...



[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com<http://www.shapeblue.com>
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...



[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com<http://www.shapeblue.com>
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...



53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue




-----Original Message-----
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: 27 March 2018 09:19
To: users@cloudstack.apache.org
Subject: Failover for VMs

After 3 weeks of trying multiple different setups I still have not managed to get a VM to failover between compute nodes and am just running out of ideas.


I have 3 compute nodes each with 3 NICS (management, VMs traffic, storage), one management node with just a single NIC connection in the management network and a separate NFS server.


I have tried with and without the new Host HA KVM in CS v4.11 as from what I have read even without enabling the new Host HA KVM when you power off or reboot a compute node your VMs should still migrate.


I have tried powering off a compute node, pulling the power lead, removing the management and NFS network cables and the management server just seems to carry on as if nothing has happened.


Could someone explain exactly how HA is meant to work so I can look at where it is going wrong.

Re: Failover for VMs

Posted by Jon Marshall <jm...@hotmail.co.uk>.

Just as an update to this before I forget what I did :) -


I used "echo c /proc/sysrq-trigger" on one of the compute nodes and there was no VM failover.  Instead HA reported suspect and then IPMI rebooted the machine, it came back online and the VM started responding to pings again.  IPMI is out of band so that seems to be reasonable behaviour but no use in testing HA.


Next I just pulled all 3 NIC cables  from the same compute node and again HA reported suspect.  Again IPMI rebooted but then HA state changed to "Recovered" which I don't understand as the NIC cables were still disconnected so VM was not reachable and no failover.


I don't understand how it can think the node is recovered as apart from the IPMI out of band connection there are no network connections to this server.


Finally pulled power lead and this time HA went from suspect to Fencing and then stayed that way. Again no VM failover.   This makes sense as no power means IPMI cannot reboot server so it never moves to Fenced I assume. Again no failover.


I am wondering if it is to do with out of band IPMI or the way I have the NICs setup.  The management node only has one NIC in the management network but I assume this is okay.


I may try reloading with CS v4.9 and just try failover without the new HA KVM to see if I see anything different.



Jon


________________________________
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: 27 March 2018 10:10
To: users@cloudstack.apache.org
Subject: Re: Failover for VMs


Thanks Paul, will pick up after Easter break.

Doing some more testing with HA KVM at the moment so any progress will update this thread


i
________________________________
From: Paul Angus <paul.angus@shapeblue.com
Sent: 27 March 2018 10:07
To: users@cloudstack.apache.org
Subject: RE: Failover for VMr
Jon,

I've been updating the Ansible to move our physical hosts from Centos6 to Centos7, now that's done I'll run through an HA setup and post answers (probably after easter break).

paul.angus@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...



[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com<http://www.shapeblue.com>
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...



53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue




-----Original Message-----
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: 27 March 2018 09:19
To: users@cloudstack.apache.org
Subject: Failover for VMs

After 3 weeks of trying multiple different setups I still have not managed to get a VM to failover between compute nodes and am just running out of ideas.


I have 3 compute nodes each with 3 NICS (management, VMs traffic, storage), one management node with just a single NIC connection in the management network and a separate NFS server.


I have tried with and without the new Host HA KVM in CS v4.11 as from what I have read even without enabling the new Host HA KVM when you power off or reboot a compute node your VMs should still migrate.


I have tried powering off a compute node, pulling the power lead, removing the management and NFS network cables and the management server just seems to carry on as if nothing has happened.


Could someone explain exactly how HA is meant to work so I can look at where it is going wrong.

Re: Failover for VMs

Posted by Jon Marshall <jm...@hotmail.co.uk>.

Thanks Paul, will pick up after Easter break.

Doing some more testing with HA KVM at the moment so any progress will update this thread

i
________________________________
From: Paul Angus <paul.angus@shapeblue.com
Sent: 27 March 2018 10:07
To: users@cloudstack.apache.org
Subject: RE: Failover for VMr
Jon,

I've been updating the Ansible to move our physical hosts from Centos6 to Centos7, now that's done I'll run through an HA setup and post answers (probably after easter break).

paul.angus@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
[http://www.shapeblue.com/wp-content/uploads/2017/06/logo.png]<http://www.shapeblue.com/>

Shapeblue - The CloudStack Company<http://www.shapeblue.com/>
www.shapeblue.com
Rapid deployment framework for Apache CloudStack IaaS Clouds. CSForge is a framework developed by ShapeBlue to deliver the rapid deployment of a standardised ...

53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue

-----Original Message-----
From: Jon Marshall <jm...@hotmail.co.uk>
Sent: 27 March 2018 09:19
To: users@cloudstack.apache.org
Subject: Failover for VMs

After 3 weeks of trying multiple different setups I still have not managed to get a VM to failover between compute nodes and am just running out of ideas.

I have 3 compute nodes each with 3 NICS (management, VMs traffic, storage), one management node with just a single NIC connection in the management network and a separate NFS server.

I have tried with and without the new Host HA KVM in CS v4.11 as from what I have read even without enabling the new Host HA KVM when you power off or reboot a compute node your VMs should still migrate.

I have tried powering off a compute node, pulling the power lead, removing the management and NFS network cables and the management server just seems to carry on as if nothing has happened.

Could someone explain exactly how HA is meant to work so I can look at where it is going wrong.

RE: Failover for VMs

Posted by Paul Angus <pa...@shapeblue.com>.

Jon,

I've been updating the Ansible to move our physical hosts from Centos6 to Centos7, now that's done I'll run through an HA setup and post answers (probably after easter break).

paul.angus@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 


-----Original Message-----
From: Jon Marshall <jm...@hotmail.co.uk> 
Sent: 27 March 2018 09:19
To: users@cloudstack.apache.org
Subject: Failover for VMs 

After 3 weeks of trying multiple different setups I still have not managed to get a VM to failover between compute nodes and am just running out of ideas.


I have 3 compute nodes each with 3 NICS (management, VMs traffic, storage), one management node with just a single NIC connection in the management network and a separate NFS server.


I have tried with and without the new Host HA KVM in CS v4.11 as from what I have read even without enabling the new Host HA KVM when you power off or reboot a compute node your VMs should still migrate.


I have tried powering off a compute node, pulling the power lead, removing the management and NFS network cables and the management server just seems to carry on as if nothing has happened.


Could someone explain exactly how HA is meant to work so I can look at where it is going wrong.