You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cloudstack.apache.org by Andrija Panic <an...@gmail.com> on 2014/11/14 17:07:28 UTC

Automatic KVM host reboot on Primary Storage failure

Hi guys,

I'm wondering why us there a check
inside /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh
?

I understand that the KVM host checks availability of Primary Storage, and
reboots itself if it can't write to storage.

But, if we have say, 3 NFS in a cluster, then lot of KVM hosts - 1 primary
storage going down (server crashing or whatever) - will bring porbably 99%
of KVM hosts also down for reboot ?
So instead of loosing uptime for 1/3 of my VMs (1 storage out of 3) - I
loose uptime for 99%-100% of my VMs ?

I manually edit this script to disabled reboots - but why is it there in
any case ?
It doesn't make sense to me - unless I'm mising a point (probably)...

Thanks,
-- 

Andrija Panić

Re: Automatic KVM host reboot on Primary Storage failure

Posted by Andrija Panic <an...@gmail.com>.
Hi Marcus, thanks for explaining.

maybe a side question: " like storage/host tags to guarantee each host only
uses one NFS" - what do you mean by this ? that is, how would you implent
this? I know of tags, but I only know how to make sure certain Compute/Disk
offerings use certain Compute/Storage hosts.

Not sure how to make some Hosts use some NFSs... ?
Thanks anyway,
Andrija

On 14 November 2014 18:18, Marcus <sh...@gmail.com> wrote:

> It is there (I believe) because cloudstack is acting as a cluster manager
> for KVM. It is using NFS to determine if it is 'alive' on the network, and
> if it is not, it reboots itself to avoid having a split brain scenario
> where VMs start coming up on other hosts when they are already running on
> this host.  It generally works, if the problem is the host, but as you
> point out, there's a situation where the problem can be the NFS server.
> This fairly rare for enterprise NFS with high availability, but there are a
> fair number of people who have NFS on servers that are relatively low
> availability (non-clustered, or get overloaded and unresponsive).
>
> There's plenty of room for improvement in that script, I agree the original
> implemention seems fairly rudimentary, but we have to be careful in
> thinking about all scenarios and make sure there's no chance of split
> brain. In the mean time, one could also partition the resources such that
> you have more clusters and only one primary storage per cluster (or
> something else, like storage/host tags to guarantee each host only uses one
> NFS).
>
> On Fri, Nov 14, 2014 at 8:07 AM, Andrija Panic <an...@gmail.com>
> wrote:
>
> > Hi guys,
> >
> > I'm wondering why us there a check
> > inside
> > /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh
> > ?
> >
> > I understand that the KVM host checks availability of Primary Storage,
> and
> > reboots itself if it can't write to storage.
> >
> > But, if we have say, 3 NFS in a cluster, then lot of KVM hosts - 1
> primary
> > storage going down (server crashing or whatever) - will bring porbably
> 99%
> > of KVM hosts also down for reboot ?
> > So instead of loosing uptime for 1/3 of my VMs (1 storage out of 3) - I
> > loose uptime for 99%-100% of my VMs ?
> >
> > I manually edit this script to disabled reboots - but why is it there in
> > any case ?
> > It doesn't make sense to me - unless I'm mising a point (probably)...
> >
> > Thanks,
> > --
> >
> > Andrija Panić
> >
>



--

Re: Automatic KVM host reboot on Primary Storage failure

Posted by Andrija Panic <an...@gmail.com>.
Hi Marcus, thanks for explaining.

maybe a side question: " like storage/host tags to guarantee each host only
uses one NFS" - what do you mean by this ? that is, how would you implent
this? I know of tags, but I only know how to make sure certain Compute/Disk
offerings use certain Compute/Storage hosts.

Not sure how to make some Hosts use some NFSs... ?
Thanks anyway,
Andrija

On 14 November 2014 18:18, Marcus <sh...@gmail.com> wrote:

> It is there (I believe) because cloudstack is acting as a cluster manager
> for KVM. It is using NFS to determine if it is 'alive' on the network, and
> if it is not, it reboots itself to avoid having a split brain scenario
> where VMs start coming up on other hosts when they are already running on
> this host.  It generally works, if the problem is the host, but as you
> point out, there's a situation where the problem can be the NFS server.
> This fairly rare for enterprise NFS with high availability, but there are a
> fair number of people who have NFS on servers that are relatively low
> availability (non-clustered, or get overloaded and unresponsive).
>
> There's plenty of room for improvement in that script, I agree the original
> implemention seems fairly rudimentary, but we have to be careful in
> thinking about all scenarios and make sure there's no chance of split
> brain. In the mean time, one could also partition the resources such that
> you have more clusters and only one primary storage per cluster (or
> something else, like storage/host tags to guarantee each host only uses one
> NFS).
>
> On Fri, Nov 14, 2014 at 8:07 AM, Andrija Panic <an...@gmail.com>
> wrote:
>
> > Hi guys,
> >
> > I'm wondering why us there a check
> > inside
> > /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh
> > ?
> >
> > I understand that the KVM host checks availability of Primary Storage,
> and
> > reboots itself if it can't write to storage.
> >
> > But, if we have say, 3 NFS in a cluster, then lot of KVM hosts - 1
> primary
> > storage going down (server crashing or whatever) - will bring porbably
> 99%
> > of KVM hosts also down for reboot ?
> > So instead of loosing uptime for 1/3 of my VMs (1 storage out of 3) - I
> > loose uptime for 99%-100% of my VMs ?
> >
> > I manually edit this script to disabled reboots - but why is it there in
> > any case ?
> > It doesn't make sense to me - unless I'm mising a point (probably)...
> >
> > Thanks,
> > --
> >
> > Andrija Panić
> >
>



--

Re: Automatic KVM host reboot on Primary Storage failure

Posted by Marcus <sh...@gmail.com>.
It is there (I believe) because cloudstack is acting as a cluster manager
for KVM. It is using NFS to determine if it is 'alive' on the network, and
if it is not, it reboots itself to avoid having a split brain scenario
where VMs start coming up on other hosts when they are already running on
this host.  It generally works, if the problem is the host, but as you
point out, there's a situation where the problem can be the NFS server.
This fairly rare for enterprise NFS with high availability, but there are a
fair number of people who have NFS on servers that are relatively low
availability (non-clustered, or get overloaded and unresponsive).

There's plenty of room for improvement in that script, I agree the original
implemention seems fairly rudimentary, but we have to be careful in
thinking about all scenarios and make sure there's no chance of split
brain. In the mean time, one could also partition the resources such that
you have more clusters and only one primary storage per cluster (or
something else, like storage/host tags to guarantee each host only uses one
NFS).

On Fri, Nov 14, 2014 at 8:07 AM, Andrija Panic <an...@gmail.com>
wrote:

> Hi guys,
>
> I'm wondering why us there a check
> inside
> /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh
> ?
>
> I understand that the KVM host checks availability of Primary Storage, and
> reboots itself if it can't write to storage.
>
> But, if we have say, 3 NFS in a cluster, then lot of KVM hosts - 1 primary
> storage going down (server crashing or whatever) - will bring porbably 99%
> of KVM hosts also down for reboot ?
> So instead of loosing uptime for 1/3 of my VMs (1 storage out of 3) - I
> loose uptime for 99%-100% of my VMs ?
>
> I manually edit this script to disabled reboots - but why is it there in
> any case ?
> It doesn't make sense to me - unless I'm mising a point (probably)...
>
> Thanks,
> --
>
> Andrija Panić
>

Re: Automatic KVM host reboot on Primary Storage failure

Posted by Marcus <sh...@gmail.com>.
It is there (I believe) because cloudstack is acting as a cluster manager
for KVM. It is using NFS to determine if it is 'alive' on the network, and
if it is not, it reboots itself to avoid having a split brain scenario
where VMs start coming up on other hosts when they are already running on
this host.  It generally works, if the problem is the host, but as you
point out, there's a situation where the problem can be the NFS server.
This fairly rare for enterprise NFS with high availability, but there are a
fair number of people who have NFS on servers that are relatively low
availability (non-clustered, or get overloaded and unresponsive).

There's plenty of room for improvement in that script, I agree the original
implemention seems fairly rudimentary, but we have to be careful in
thinking about all scenarios and make sure there's no chance of split
brain. In the mean time, one could also partition the resources such that
you have more clusters and only one primary storage per cluster (or
something else, like storage/host tags to guarantee each host only uses one
NFS).

On Fri, Nov 14, 2014 at 8:07 AM, Andrija Panic <an...@gmail.com>
wrote:

> Hi guys,
>
> I'm wondering why us there a check
> inside
> /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh
> ?
>
> I understand that the KVM host checks availability of Primary Storage, and
> reboots itself if it can't write to storage.
>
> But, if we have say, 3 NFS in a cluster, then lot of KVM hosts - 1 primary
> storage going down (server crashing or whatever) - will bring porbably 99%
> of KVM hosts also down for reboot ?
> So instead of loosing uptime for 1/3 of my VMs (1 storage out of 3) - I
> loose uptime for 99%-100% of my VMs ?
>
> I manually edit this script to disabled reboots - but why is it there in
> any case ?
> It doesn't make sense to me - unless I'm mising a point (probably)...
>
> Thanks,
> --
>
> Andrija Panić
>