You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cloudstack.apache.org by Dean Kamali <de...@gmail.com> on 2013/07/03 20:13:35 UTC

Primary storage failure

Hello everyone

I'm testing failure scenarios, and I have noticed that as soon as the
primary storage gets offline.

cloudstack management server seems to think that the hypervisor is not
responding and it will reboot the node, if you have number of of nodes it
will eventually reboot all of them. (losing everything  .. fun! )

What if I have multiple primary storage and one of them failed? it will
reboot all of my hypervisors? it doesn't seems right to me.

Is there is a way to control this behavior?

it seems that cloud stack management server needs to be a little smarter.

Re: Primary storage failure

Posted by Dean Kamali <de...@gmail.com>.
Thanks for your help :)


On Wed, Jul 3, 2013 at 5:04 PM, Geoff Higginbottom <
geoff.higginbottom@shapeblue.com> wrote:

> Hi Dean,
>
> This will not affect the HA failover in the event of a Host failure
>
> Regards
>
> Geoff Higginbottom
> CTO / Cloud Architect
>
>
> D: +44(0)20 3603 0542<tel:+442036030542> | S: +44(0)20 3603 0540<tel:
> +442036030540> | M: +44(0)7968161581<tel:+447968161581>
>
> geoff.higginbottom@shapeblue.com<ma...@shapeblue.com>
> | www.shapeblue.com
>
> ShapeBlue Ltd, 53 Chandos Place, Covent Garden, London, WC2N 4HS
>
>
>
> On 3 Jul 2013, at 19:51, "Dean Kamali" <dean.kamali@gmail.com<mailto:
> dean.kamali@gmail.com>> wrote:
>
> Geoff  thanks for your help, just wondering if this change will have any
> impact on HA operations that cloudstack offers for HA instances (if one of
> the nodes dies, vm will restart on another node).
>
> Thanks again for your help
>
>
> On Wed, Jul 3, 2013 at 2:39 PM, David Nalley <david@gnsa.us<mailto:
> david@gnsa.us>> wrote:
>
> This warrants a bug IMO.
>
> --David
>
> On Wed, Jul 3, 2013 at 2:38 PM, Geoff Higginbottom
> <ge...@shapeblue.com>>
> wrote:
> Dean,
>
> I am guessing you are using NFS for your Primary Storage.
>
> This is actually 'by design'.  The logic is that if the storage goes
> offline, then all VMs must have also failed, and a 'forced' reboot of the
> Host 'might' automatically fix things.
>
> This is great if you only have one Primary Storage, but typically you
> have more than one, so whilst the reboot might fix the failed storage, it
> will also kill off all the perfectly good VMs which were still happily
> running.
>
> The fix for XenServer Hosts is to:
>
> 1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts,
> commenting out the two entries which have "reboot -f"
>
> 2. Identify the PID of the script  - pidof -x xenheartbeat.sh
>
> 3. Restart the Script  - kill <pid>
>
> 4. Force reconnect Host from the UI,  the script will then re-launch on
> reconnect
>
> If you running KVM, I'm guessing there is a similar script, but I have
> not tried this yet for anything other than XenSever (it does not apply to
> ESXi)
>
> Regards
>
> Geoff Higginbottom
>
> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581
>
> geoff.higginbottom@shapeblue.com<ma...@shapeblue.com>
>
>
> -----Original Message-----
> From: Dean Kamali [mailto:dean.kamali@gmail.com]
> Sent: 03 July 2013 19:14
> To: users@cloudstack.apache.org<ma...@cloudstack.apache.org>
> Subject: Primary storage failure
>
> Hello everyone
>
> I'm testing failure scenarios, and I have noticed that as soon as the
> primary storage gets offline.
>
> cloudstack management server seems to think that the hypervisor is not
> responding and it will reboot the node, if you have number of of nodes it
> will eventually reboot all of them. (losing everything  .. fun! )
>
> What if I have multiple primary storage and one of them failed? it will
> reboot all of my hypervisors? it doesn't seems right to me.
>
> Is there is a way to control this behavior?
>
> it seems that cloud stack management server needs to be a little smarter.
> This email and any attachments to it may be confidential and are
> intended solely for the use of the individual to whom it is addressed. Any
> views or opinions expressed are solely those of the author and do not
> necessarily represent those of Shape Blue Ltd or related companies. If you
> are not the intended recipient of this email, you must neither take any
> action based upon its contents, nor copy or show it to anyone. Please
> contact the sender if you believe you have received this email in error.
> Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue
> Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue
> is a registered trademark.
>
>
> This email and any attachments to it may be confidential and are intended
> solely for the use of the individual to whom it is addressed. Any views or
> opinions expressed are solely those of the author and do not necessarily
> represent those of Shape Blue Ltd or related companies. If you are not the
> intended recipient of this email, you must neither take any action based
> upon its contents, nor copy or show it to anyone. Please contact the sender
> if you believe you have received this email in error. Shape Blue Ltd is a
> company incorporated in England & Wales. ShapeBlue Services India LLP is
> operated under license from Shape Blue Ltd. ShapeBlue is a registered
> trademark.
>

Re: Primary storage failure

Posted by Geoff Higginbottom <ge...@shapeblue.com>.
Hi Dean,

This will not affect the HA failover in the event of a Host failure

Regards

Geoff Higginbottom
CTO / Cloud Architect


D: +44(0)20 3603 0542<tel:+442036030542> | S: +44(0)20 3603 0540<tel:+442036030540> | M: +44(0)7968161581<tel:+447968161581>

geoff.higginbottom@shapeblue.com<ma...@shapeblue.com> | www.shapeblue.com

ShapeBlue Ltd, 53 Chandos Place, Covent Garden, London, WC2N 4HS



On 3 Jul 2013, at 19:51, "Dean Kamali" <de...@gmail.com>> wrote:

Geoff  thanks for your help, just wondering if this change will have any
impact on HA operations that cloudstack offers for HA instances (if one of
the nodes dies, vm will restart on another node).

Thanks again for your help


On Wed, Jul 3, 2013 at 2:39 PM, David Nalley <da...@gnsa.us>> wrote:

This warrants a bug IMO.

--David

On Wed, Jul 3, 2013 at 2:38 PM, Geoff Higginbottom
<ge...@shapeblue.com>> wrote:
Dean,

I am guessing you are using NFS for your Primary Storage.

This is actually 'by design'.  The logic is that if the storage goes
offline, then all VMs must have also failed, and a 'forced' reboot of the
Host 'might' automatically fix things.

This is great if you only have one Primary Storage, but typically you
have more than one, so whilst the reboot might fix the failed storage, it
will also kill off all the perfectly good VMs which were still happily
running.

The fix for XenServer Hosts is to:

1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts,
commenting out the two entries which have "reboot -f"

2. Identify the PID of the script  - pidof -x xenheartbeat.sh

3. Restart the Script  - kill <pid>

4. Force reconnect Host from the UI,  the script will then re-launch on
reconnect

If you running KVM, I'm guessing there is a similar script, but I have
not tried this yet for anything other than XenSever (it does not apply to
ESXi)

Regards

Geoff Higginbottom

D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581

geoff.higginbottom@shapeblue.com<ma...@shapeblue.com>


-----Original Message-----
From: Dean Kamali [mailto:dean.kamali@gmail.com]
Sent: 03 July 2013 19:14
To: users@cloudstack.apache.org<ma...@cloudstack.apache.org>
Subject: Primary storage failure

Hello everyone

I'm testing failure scenarios, and I have noticed that as soon as the
primary storage gets offline.

cloudstack management server seems to think that the hypervisor is not
responding and it will reboot the node, if you have number of of nodes it
will eventually reboot all of them. (losing everything  .. fun! )

What if I have multiple primary storage and one of them failed? it will
reboot all of my hypervisors? it doesn't seems right to me.

Is there is a way to control this behavior?

it seems that cloud stack management server needs to be a little smarter.
This email and any attachments to it may be confidential and are
intended solely for the use of the individual to whom it is addressed. Any
views or opinions expressed are solely those of the author and do not
necessarily represent those of Shape Blue Ltd or related companies. If you
are not the intended recipient of this email, you must neither take any
action based upon its contents, nor copy or show it to anyone. Please
contact the sender if you believe you have received this email in error.
Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue
Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue
is a registered trademark.


This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Shape Blue Ltd or related companies. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue is a registered trademark.

Re: Primary storage failure

Posted by Dean Kamali <de...@gmail.com>.
Geoff  thanks for your help, just wondering if this change will have any
impact on HA operations that cloudstack offers for HA instances (if one of
the nodes dies, vm will restart on another node).

Thanks again for your help


On Wed, Jul 3, 2013 at 2:39 PM, David Nalley <da...@gnsa.us> wrote:

> This warrants a bug IMO.
>
> --David
>
> On Wed, Jul 3, 2013 at 2:38 PM, Geoff Higginbottom
> <ge...@shapeblue.com> wrote:
> > Dean,
> >
> > I am guessing you are using NFS for your Primary Storage.
> >
> > This is actually 'by design'.  The logic is that if the storage goes
> offline, then all VMs must have also failed, and a 'forced' reboot of the
> Host 'might' automatically fix things.
> >
> > This is great if you only have one Primary Storage, but typically you
> have more than one, so whilst the reboot might fix the failed storage, it
> will also kill off all the perfectly good VMs which were still happily
> running.
> >
> > The fix for XenServer Hosts is to:
> >
> > 1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts,
> commenting out the two entries which have "reboot -f"
> >
> > 2. Identify the PID of the script  - pidof -x xenheartbeat.sh
> >
> > 3. Restart the Script  - kill <pid>
> >
> > 4. Force reconnect Host from the UI,  the script will then re-launch on
> reconnect
> >
> > If you running KVM, I'm guessing there is a similar script, but I have
> not tried this yet for anything other than XenSever (it does not apply to
> ESXi)
> >
> > Regards
> >
> > Geoff Higginbottom
> >
> > D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581
> >
> > geoff.higginbottom@shapeblue.com
> >
> >
> > -----Original Message-----
> > From: Dean Kamali [mailto:dean.kamali@gmail.com]
> > Sent: 03 July 2013 19:14
> > To: users@cloudstack.apache.org
> > Subject: Primary storage failure
> >
> > Hello everyone
> >
> > I'm testing failure scenarios, and I have noticed that as soon as the
> primary storage gets offline.
> >
> > cloudstack management server seems to think that the hypervisor is not
> responding and it will reboot the node, if you have number of of nodes it
> will eventually reboot all of them. (losing everything  .. fun! )
> >
> > What if I have multiple primary storage and one of them failed? it will
> reboot all of my hypervisors? it doesn't seems right to me.
> >
> > Is there is a way to control this behavior?
> >
> > it seems that cloud stack management server needs to be a little smarter.
> > This email and any attachments to it may be confidential and are
> intended solely for the use of the individual to whom it is addressed. Any
> views or opinions expressed are solely those of the author and do not
> necessarily represent those of Shape Blue Ltd or related companies. If you
> are not the intended recipient of this email, you must neither take any
> action based upon its contents, nor copy or show it to anyone. Please
> contact the sender if you believe you have received this email in error.
> Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue
> Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue
> is a registered trademark.
> >
>

Re: Primary storage failure

Posted by France <ma...@isg.si>.
I've submitted a bug:
https://issues.apache.org/jira/browse/CLOUDSTACK-3367

On 3/7/13 8:39 PM, David Nalley wrote:
> This warrants a bug IMO.
>
> --David
>
> On Wed, Jul 3, 2013 at 2:38 PM, Geoff Higginbottom
> <ge...@shapeblue.com> wrote:
>> Dean,
>>
>> I am guessing you are using NFS for your Primary Storage.
>>
>> This is actually 'by design'.  The logic is that if the storage goes offline, then all VMs must have also failed, and a 'forced' reboot of the Host 'might' automatically fix things.
>>
>> This is great if you only have one Primary Storage, but typically you have more than one, so whilst the reboot might fix the failed storage, it will also kill off all the perfectly good VMs which were still happily running.
>>
>> The fix for XenServer Hosts is to:
>>
>> 1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts, commenting out the two entries which have "reboot -f"
>>
>> 2. Identify the PID of the script  - pidof -x xenheartbeat.sh
>>
>> 3. Restart the Script  - kill <pid>
>>
>> 4. Force reconnect Host from the UI,  the script will then re-launch on reconnect
>>
>> If you running KVM, I'm guessing there is a similar script, but I have not tried this yet for anything other than XenSever (it does not apply to ESXi)
>>
>> Regards
>>
>> Geoff Higginbottom
>>
>> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581
>>
>> geoff.higginbottom@shapeblue.com
>>
>>
>> -----Original Message-----
>> From: Dean Kamali [mailto:dean.kamali@gmail.com]
>> Sent: 03 July 2013 19:14
>> To: users@cloudstack.apache.org
>> Subject: Primary storage failure
>>
>> Hello everyone
>>
>> I'm testing failure scenarios, and I have noticed that as soon as the primary storage gets offline.
>>
>> cloudstack management server seems to think that the hypervisor is not responding and it will reboot the node, if you have number of of nodes it will eventually reboot all of them. (losing everything  .. fun! )
>>
>> What if I have multiple primary storage and one of them failed? it will reboot all of my hypervisors? it doesn't seems right to me.
>>
>> Is there is a way to control this behavior?
>>
>> it seems that cloud stack management server needs to be a little smarter.
>> This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Shape Blue Ltd or related companies. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue is a registered trademark.
>>


Re: Primary storage failure

Posted by Geoff Higginbottom <ge...@shapeblue.com>.
Hi David,

I recall reading somewhere a while back that they were considering including this as an option via Global Settings, where you could turn it on or off, but I'm not sure if it ever got anywhere.

Regards

Geoff Higginbottom
CTO / Cloud Architect


D: +44(0)20 3603 0542<tel:+442036030542> | S: +44(0)20 3603 0540<tel:+442036030540> | M: +44(0)7968161581<tel:+447968161581>

geoff.higginbottom@shapeblue.com<ma...@shapeblue.com> | www.shapeblue.com

ShapeBlue Ltd, 53 Chandos Place, Covent Garden, London, WC2N 4HS



On 3 Jul 2013, at 19:46, "David Nalley" <da...@gnsa.us>> wrote:

This warrants a bug IMO.

--David

On Wed, Jul 3, 2013 at 2:38 PM, Geoff Higginbottom
<ge...@shapeblue.com>> wrote:
Dean,

I am guessing you are using NFS for your Primary Storage.

This is actually 'by design'.  The logic is that if the storage goes offline, then all VMs must have also failed, and a 'forced' reboot of the Host 'might' automatically fix things.

This is great if you only have one Primary Storage, but typically you have more than one, so whilst the reboot might fix the failed storage, it will also kill off all the perfectly good VMs which were still happily running.

The fix for XenServer Hosts is to:

1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts, commenting out the two entries which have "reboot -f"

2. Identify the PID of the script  - pidof -x xenheartbeat.sh

3. Restart the Script  - kill <pid>

4. Force reconnect Host from the UI,  the script will then re-launch on reconnect

If you running KVM, I'm guessing there is a similar script, but I have not tried this yet for anything other than XenSever (it does not apply to ESXi)

Regards

Geoff Higginbottom

D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581

geoff.higginbottom@shapeblue.com<ma...@shapeblue.com>


-----Original Message-----
From: Dean Kamali [mailto:dean.kamali@gmail.com]
Sent: 03 July 2013 19:14
To: users@cloudstack.apache.org<ma...@cloudstack.apache.org>
Subject: Primary storage failure

Hello everyone

I'm testing failure scenarios, and I have noticed that as soon as the primary storage gets offline.

cloudstack management server seems to think that the hypervisor is not responding and it will reboot the node, if you have number of of nodes it will eventually reboot all of them. (losing everything  .. fun! )

What if I have multiple primary storage and one of them failed? it will reboot all of my hypervisors? it doesn't seems right to me.

Is there is a way to control this behavior?

it seems that cloud stack management server needs to be a little smarter.
This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Shape Blue Ltd or related companies. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue is a registered trademark.


This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Shape Blue Ltd or related companies. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue is a registered trademark.

Re: Primary storage failure

Posted by David Nalley <da...@gnsa.us>.
This warrants a bug IMO.

--David

On Wed, Jul 3, 2013 at 2:38 PM, Geoff Higginbottom
<ge...@shapeblue.com> wrote:
> Dean,
>
> I am guessing you are using NFS for your Primary Storage.
>
> This is actually 'by design'.  The logic is that if the storage goes offline, then all VMs must have also failed, and a 'forced' reboot of the Host 'might' automatically fix things.
>
> This is great if you only have one Primary Storage, but typically you have more than one, so whilst the reboot might fix the failed storage, it will also kill off all the perfectly good VMs which were still happily running.
>
> The fix for XenServer Hosts is to:
>
> 1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts, commenting out the two entries which have "reboot -f"
>
> 2. Identify the PID of the script  - pidof -x xenheartbeat.sh
>
> 3. Restart the Script  - kill <pid>
>
> 4. Force reconnect Host from the UI,  the script will then re-launch on reconnect
>
> If you running KVM, I'm guessing there is a similar script, but I have not tried this yet for anything other than XenSever (it does not apply to ESXi)
>
> Regards
>
> Geoff Higginbottom
>
> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581
>
> geoff.higginbottom@shapeblue.com
>
>
> -----Original Message-----
> From: Dean Kamali [mailto:dean.kamali@gmail.com]
> Sent: 03 July 2013 19:14
> To: users@cloudstack.apache.org
> Subject: Primary storage failure
>
> Hello everyone
>
> I'm testing failure scenarios, and I have noticed that as soon as the primary storage gets offline.
>
> cloudstack management server seems to think that the hypervisor is not responding and it will reboot the node, if you have number of of nodes it will eventually reboot all of them. (losing everything  .. fun! )
>
> What if I have multiple primary storage and one of them failed? it will reboot all of my hypervisors? it doesn't seems right to me.
>
> Is there is a way to control this behavior?
>
> it seems that cloud stack management server needs to be a little smarter.
> This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Shape Blue Ltd or related companies. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue is a registered trademark.
>

RE: Primary storage failure

Posted by Geoff Higginbottom <ge...@shapeblue.com>.
Dean,

I am guessing you are using NFS for your Primary Storage.

This is actually 'by design'.  The logic is that if the storage goes offline, then all VMs must have also failed, and a 'forced' reboot of the Host 'might' automatically fix things.

This is great if you only have one Primary Storage, but typically you have more than one, so whilst the reboot might fix the failed storage, it will also kill off all the perfectly good VMs which were still happily running.

The fix for XenServer Hosts is to:

1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts, commenting out the two entries which have "reboot -f"

2. Identify the PID of the script  - pidof -x xenheartbeat.sh

3. Restart the Script  - kill <pid>

4. Force reconnect Host from the UI,  the script will then re-launch on reconnect

If you running KVM, I'm guessing there is a similar script, but I have not tried this yet for anything other than XenSever (it does not apply to ESXi)

Regards

Geoff Higginbottom

D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581

geoff.higginbottom@shapeblue.com


-----Original Message-----
From: Dean Kamali [mailto:dean.kamali@gmail.com]
Sent: 03 July 2013 19:14
To: users@cloudstack.apache.org
Subject: Primary storage failure

Hello everyone

I'm testing failure scenarios, and I have noticed that as soon as the primary storage gets offline.

cloudstack management server seems to think that the hypervisor is not responding and it will reboot the node, if you have number of of nodes it will eventually reboot all of them. (losing everything  .. fun! )

What if I have multiple primary storage and one of them failed? it will reboot all of my hypervisors? it doesn't seems right to me.

Is there is a way to control this behavior?

it seems that cloud stack management server needs to be a little smarter.
This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Shape Blue Ltd or related companies. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue is a registered trademark.