You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@cloudstack.apache.org by J A Y A N T H <ja...@gmail.com> on 2022/01/01 12:28:51 UTC

VM High Availability

Hello Users,

I've a plan on creating the exact setup as described in
https://github.community/t/apache-cloudstack-high-availability-issues/12778
. Wanted to know the answer to the VM HA question in the above link.

Any help or suggestions are appreciated.

Thanks,
Jayanth

Re: VM High Availability

Posted by Jayanth Reddy <ja...@gmail.com>.

Hello Gabriel,

Thank you for the response. Looks like I'll have to go with this
architecture after your PR is merged. I'll use the NFS Primary Storage pool
for now.

Regards,
Jayanth


On Mon, Jan 3, 2022 at 9:24 PM Gabriel Bräscher <ga...@gmail.com>
wrote:

> Hello Jayanth,
>
> From what I see on the link the issue is that you don't have an NFS storage
> pool.
> Unfortunately, for now, CloudStack supports KVM HA only when there is an
> NFS primary storage pool.
> In your case, you see the host switching between the “suspect” and
> “degraded” states.
> What really happens there is that due to not having the NFS primary pool,
> CloudStack is failing to validate that the NFS pool is not mounted and
> therefore ensure that there is no activity on the primary storage pool.
> Without such validation, it does not fence the host in order to avoid data
> corruption.
>
> For example, if the host loses connection but all is good with the VMs
> running you could trigger data corruption.
> Or ... if CloudStack wrongly assumes that the host is down and spawns the
> HA VM on another host it can end up with 2 VMs running pointing at the same
> root volume disk.
>
> Your question is valid and a bit of our fault.
> The documentation [1] does not make it really clear, so I understand the
> question.
> In the documentation [1] the following quote shows the activity check
> process, which relies on the NFS "heartbeat" script.
>
> > *Activity check operation fails on the resource: Provide a semantic in
> the
> > activity check protocol to express that an error*
>
> *while performing the activity check and a reason for the failure (e.g.
> > unable to access the NFS mount).*
> >
> *If the maximum number of activity check attempts has not been exceeded,
> > the activity check will be retried.*
> >
>
> [1]
>
> https://docs.cloudstack.apache.org/en/latest/adminguide/reliability.html#ha-enabled-hosts
>
> I have a PR open: https://github.com/apache/cloudstack/pull/4978
> It proposes the creation of an agent that checks the KVM node health via
> Libvirt, thus adding a new set of validations that allow KVM HA regardless
> of which storage pool you have running.
> I am going to add a few adjustments to it soon and, hopefully, we might get
> it into the next major release.
>
> Regards,
> Gabriel.
>
> On Mon, Jan 3, 2022 at 9:21 AM J A Y A N T H <ja...@gmail.com>
> wrote:
>
> > Hello Users,
> >
> > I've a plan on creating the exact setup as described in
> >
> https://github.community/t/apache-cloudstack-high-availability-issues/12778
> > . Wanted to know the answer to the VM HA question in the above link.
> >
> > Any help or suggestions are appreciated.
> >
> > Thanks,
> > Jayanth
> >
>

Re: VM High Availability

Posted by Gabriel Bräscher <ga...@gmail.com>.

Hello Jayanth,

From what I see on the link the issue is that you don't have an NFS storage
pool.
Unfortunately, for now, CloudStack supports KVM HA only when there is an
NFS primary storage pool.
In your case, you see the host switching between the “suspect” and
“degraded” states.
What really happens there is that due to not having the NFS primary pool,
CloudStack is failing to validate that the NFS pool is not mounted and
therefore ensure that there is no activity on the primary storage pool.
Without such validation, it does not fence the host in order to avoid data
corruption.

For example, if the host loses connection but all is good with the VMs
running you could trigger data corruption.
Or ... if CloudStack wrongly assumes that the host is down and spawns the
HA VM on another host it can end up with 2 VMs running pointing at the same
root volume disk.

Your question is valid and a bit of our fault.
The documentation [1] does not make it really clear, so I understand the
question.
In the documentation [1] the following quote shows the activity check
process, which relies on the NFS "heartbeat" script.

> *Activity check operation fails on the resource: Provide a semantic in the
> activity check protocol to express that an error*

*while performing the activity check and a reason for the failure (e.g.
> unable to access the NFS mount).*
>
*If the maximum number of activity check attempts has not been exceeded,
> the activity check will be retried.*
>

[1]
https://docs.cloudstack.apache.org/en/latest/adminguide/reliability.html#ha-enabled-hosts

I have a PR open: https://github.com/apache/cloudstack/pull/4978
It proposes the creation of an agent that checks the KVM node health via
Libvirt, thus adding a new set of validations that allow KVM HA regardless
of which storage pool you have running.
I am going to add a few adjustments to it soon and, hopefully, we might get
it into the next major release.

Regards,
Gabriel.

On Mon, Jan 3, 2022 at 9:21 AM J A Y A N T H <ja...@gmail.com>
wrote:

> Hello Users,
>
> I've a plan on creating the exact setup as described in
> https://github.community/t/apache-cloudstack-high-availability-issues/12778
> . Wanted to know the answer to the VM HA question in the above link.
>
> Any help or suggestions are appreciated.
>
> Thanks,
> Jayanth
>