You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cloudstack.apache.org by Andrei Mikhailovsky <an...@arhont.com> on 2015/10/09 11:21:32 UTC

slow nfs = reboot all hosts (((

Hello

My issue is whenever my nfs server becomes slow to respond, ACS just bloody reboots ALL hosts servers, not just the once running vms with volumes attached to the slow nfs server. Recently, i've decided to remove some of the old snapshots to free up some disk space. I've deleted about a dozen snapshots and I was monitoring the nfs server for progress. At no point did the nfs server lost the connectivity, it just became a bit slow and under load. By slow I mean i was still able to list files on the nfs mount point and the ssh session was still working okay. It was just taking a few more seconds to respond when it comes to nfs file listings, creation, deletion, etc. However, the ACS agent has just rebooted every single host server, killing all running guests and system vms. In my case, I only have two guests with volumes on the nfs server. The rest of the vms are running off rbd storage. Yet, all host servers were rebooted, even those which were not running guests with nfs volumes.

Ever since i've started using ACS, it was always pretty dumb in correctly determining if the nfs storage is still alive. I would say it has done the maniac reboot everything type of behaviour at least 5 times in the past 3 years. So, in the previous versions of ACS i've just modified the kvmheartbeat.sh and hashed out the line with "reboot" as these reboots were just pissing everyone off.

After upgrading to ACS 4.5.x that script has no reboot command and I was wondering if it is still possible to instruct the kvmheartbeat script not to reboot the host servers?

Thanks for your advice.

Andrei

Re: slow nfs = reboot all hosts (((

Posted by Andrija Panic <an...@gmail.com>.

Ah sorry you already use this approach...
On Oct 9, 2015 10:25 AM, "Andrija Panic" <an...@gmail.com> wrote:

> I managed this problem the folowing way:
> http://admintweets.com/cloudstack-disable-agent-rebooting-kvm-host/
>
> Cheers
> On Oct 9, 2015 10:21 AM, "Andrei Mikhailovsky" <an...@arhont.com> wrote:
>
>> Hello
>>
>> My issue is whenever my nfs server becomes slow to respond, ACS just
>> bloody reboots ALL hosts servers, not just the once running vms with
>> volumes attached to the slow nfs server. Recently, i've decided to remove
>> some of the old snapshots to free up some disk space. I've deleted about a
>> dozen snapshots and I was monitoring the nfs server for progress. At no
>> point did the nfs server lost the connectivity, it just became a bit slow
>> and under load. By slow I mean i was still able to list files on the nfs
>> mount point and the ssh session was still working okay. It was just taking
>> a few more seconds to respond when it comes to nfs file listings, creation,
>> deletion, etc. However, the ACS agent has just rebooted every single host
>> server, killing all running guests and system vms. In my case, I only have
>> two guests with volumes on the nfs server. The rest of the vms are running
>> off rbd storage. Yet, all host servers were rebooted, even those which were
>> not running guests with nfs volumes.
>>
>> Ever since i've started using ACS, it was always pretty dumb in correctly
>> determining if the nfs storage is still alive. I would say it has done the
>> maniac reboot everything type of behaviour at least 5 times in the past 3
>> years. So, in the previous versions of ACS i've just modified the
>> kvmheartbeat.sh and hashed out the line with "reboot" as these reboots were
>> just pissing everyone off.
>>
>> After upgrading to ACS 4.5.x that script has no reboot command and I was
>> wondering if it is still possible to instruct the kvmheartbeat script not
>> to reboot the host servers?
>>
>> Thanks for your advice.
>>
>> Andrei
>>
>

Re: slow nfs = reboot all hosts (((

Posted by Nux! <nu...@li.nux.ro>.

Ok, I'm gonna make a bit of noise about this. Hope you guys will chip in so we can make some progress re HA in future versions.

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Simon Weller" <sw...@ena.com>
> To: dev@cloudstack.apache.org
> Sent: Friday, 9 October, 2015 23:46:26
> Subject: Re: slow nfs = reboot all hosts (((

> Andrei,
> 
> In a failure scenerio you want to get rid of that problematic server has quickly
> as possible. Effectively this action is fencing the host in question.
> 
> Nux brought up a good point earlier in this thread where ultimately we need to
> figure out a much better way to handling KVM failure conditions. The current
> 'wait until it comes back up' is very much a flawed approach and something
> we've been thinking about internally a lot lately.
> 
> In your case, it sounds like you might need to separate your NFS storage for
> primary and secondary to avoid saturating the primary storage and causing a
> case where the agent believes that the primary NFS is unresponsive.
> 
> We've certainly run into situations previously where the I/O wait state was too
> high on some ISCSI connected hosts and we saw nodes being shot due to access
> times. Our approach to fixing that was reduce the number of VMs being run on
> those hosts and move to higher speed connectivity between the hosts and our
> storage (i.e. FC, 10Gb ethernet).
> 
> - Si
> 
> ________________________________________
> From: Andrei Mikhailovsky <an...@arhont.com>
> Sent: Friday, October 9, 2015 5:37 PM
> To: dev@cloudstack.apache.org
> Subject: Re: slow nfs = reboot all hosts (((
> 
> I think there should be as much REISUB as possible when trying to reboot a
> broken server. Doing only last B bit is a bit dangerous imho.
> 
> Andrei
> ----- Original Message -----
> 
> From: . "Nux!" <nu...@li.nux.ro>
> To: dev@cloudstack.apache.org
> Sent: Friday, 9 October, 2015 6:53:43 PM
> Subject: Re: slow nfs = reboot all hosts (((
> 
> Andrei,
> 
> Yes, that command will just reboot without flushing anything to disk, like
> cutting power.
> It is made because many servers are slow to respond to normal reboot commands
> under load, if at all, this could lead to corrupted data and so on.
> The sysrq switch is a much better choice from this pov.
> 
> We really need to look at a proper way of doing HA with KVM.
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> ----- Original Message -----
>> From: "Andrei Mikhailovsky" <an...@arhont.com>
>> To: dev@cloudstack.apache.org
>> Sent: Friday, 9 October, 2015 16:47:46
>> Subject: Re: slow nfs = reboot all hosts (((
> 
>> Thanks guys, I am not sure how i've missed that. probably the coffee didn't kick
>> in yet )))
>>
>> Anyway, am I right in saying that now the host server reboot is now forced
>> without stopping the services, unmounting filesystems with potentially open and
>> unsync-ed data, etc?
>>
>> Isn't this rather bad and dangerous to perform simply because of
>> slow/unresponsive one of possibly many nfs servers? Not only that, the
>> heartbeat also reboot the servers that are not running vms with nfs volumes? In
>> my case it just rebooted every single host server.
>>
>> Very worrying indeed.
>>
>> Andrei
>>
>>
>> ----- Original Message -----
>>
>> From: "Nux!" <nu...@li.nux.ro>
>> To: dev@cloudstack.apache.org
>> Sent: Friday, 9 October, 2015 12:58:19 PM
>> Subject: Re: slow nfs = reboot all hosts (((
>>
>> Hello,
>>
>> Instead of commenting 'echo b > /proc/sysrq-trigger' and also disabling your HA
>> at the same time, perhaps there's a way to tweak the timeouts to be more
>> generous with lazy NFS servers.
>>
>> Can you go through the logs and see what is happening before the reboot? I am
>> not sure exactly which timeout the script cares about, worth investigating.
>>
>> Lucian
>>
>> --
>> Sent from the Delta quadrant using Borg technology!
>>
>> Nux!
>> www.nux.ro
>>
>> ----- Original Message -----
>>> From: "Andrija Panic" <an...@gmail.com>
>>> To: dev@cloudstack.apache.org
>>> Sent: Friday, 9 October, 2015 10:25:05
>>> Subject: Re: slow nfs = reboot all hosts (((
>>
>>> I managed this problem the folowing way:
>>> http://admintweets.com/cloudstack-disable-agent-rebooting-kvm-host/
>>>
>>> Cheers
>>> On Oct 9, 2015 10:21 AM, "Andrei Mikhailovsky" <an...@arhont.com> wrote:
>>>
>>>> Hello
>>>>
>>>> My issue is whenever my nfs server becomes slow to respond, ACS just
>>>> bloody reboots ALL hosts servers, not just the once running vms with
>>>> volumes attached to the slow nfs server. Recently, i've decided to remove
>>>> some of the old snapshots to free up some disk space. I've deleted about a
>>>> dozen snapshots and I was monitoring the nfs server for progress. At no
>>>> point did the nfs server lost the connectivity, it just became a bit slow
>>>> and under load. By slow I mean i was still able to list files on the nfs
>>>> mount point and the ssh session was still working okay. It was just taking
>>>> a few more seconds to respond when it comes to nfs file listings, creation,
>>>> deletion, etc. However, the ACS agent has just rebooted every single host
>>>> server, killing all running guests and system vms. In my case, I only have
>>>> two guests with volumes on the nfs server. The rest of the vms are running
>>>> off rbd storage. Yet, all host servers were rebooted, even those which were
>>>> not running guests with nfs volumes.
>>>>
>>>> Ever since i've started using ACS, it was always pretty dumb in correctly
>>>> determining if the nfs storage is still alive. I would say it has done the
>>>> maniac reboot everything type of behaviour at least 5 times in the past 3
>>>> years. So, in the previous versions of ACS i've just modified the
>>>> kvmheartbeat.sh and hashed out the line with "reboot" as these reboots were
>>>> just pissing everyone off.
>>>>
>>>> After upgrading to ACS 4.5.x that script has no reboot command and I was
>>>> wondering if it is still possible to instruct the kvmheartbeat script not
>>>> to reboot the host servers?
>>>>
>>>> Thanks for your advice.
>>>>
> > >> Andrei

Re: slow nfs = reboot all hosts (((

Posted by Simon Weller <sw...@ena.com>.

Andrei,

In a failure scenerio you want to get rid of that problematic server has quickly as possible. Effectively this action is fencing the host in question.

Nux brought up a good point earlier in this thread where ultimately we need to figure out a much better way to handling KVM failure conditions. The current 'wait until it comes back up' is very much a flawed approach and something we've been thinking about internally a lot lately.

In your case, it sounds like you might need to separate your NFS storage for primary and secondary to avoid saturating the primary storage and causing a case where the agent believes that the primary NFS is unresponsive.

We've certainly run into situations previously where the I/O wait state was too high on some ISCSI connected hosts and we saw nodes being shot due to access times. Our approach to fixing that was reduce the number of VMs being run on those hosts and move to higher speed connectivity between the hosts and our storage (i.e. FC, 10Gb ethernet).

- Si

________________________________________
From: Andrei Mikhailovsky <an...@arhont.com>
Sent: Friday, October 9, 2015 5:37 PM
To: dev@cloudstack.apache.org
Subject: Re: slow nfs = reboot all hosts (((

I think there should be as much REISUB as possible when trying to reboot a broken server. Doing only last B bit is a bit dangerous imho.

Andrei
----- Original Message -----

From: . "Nux!" <nu...@li.nux.ro>
To: dev@cloudstack.apache.org
Sent: Friday, 9 October, 2015 6:53:43 PM
Subject: Re: slow nfs = reboot all hosts (((

Andrei,

Yes, that command will just reboot without flushing anything to disk, like cutting power.
It is made because many servers are slow to respond to normal reboot commands under load, if at all, this could lead to corrupted data and so on.
The sysrq switch is a much better choice from this pov.

We really need to look at a proper way of doing HA with KVM.

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Andrei Mikhailovsky" <an...@arhont.com>
> To: dev@cloudstack.apache.org
> Sent: Friday, 9 October, 2015 16:47:46
> Subject: Re: slow nfs = reboot all hosts (((

> Thanks guys, I am not sure how i've missed that. probably the coffee didn't kick
> in yet )))
>
> Anyway, am I right in saying that now the host server reboot is now forced
> without stopping the services, unmounting filesystems with potentially open and
> unsync-ed data, etc?
>
> Isn't this rather bad and dangerous to perform simply because of
> slow/unresponsive one of possibly many nfs servers? Not only that, the
> heartbeat also reboot the servers that are not running vms with nfs volumes? In
> my case it just rebooted every single host server.
>
> Very worrying indeed.
>
> Andrei
>
>
> ----- Original Message -----
>
> From: "Nux!" <nu...@li.nux.ro>
> To: dev@cloudstack.apache.org
> Sent: Friday, 9 October, 2015 12:58:19 PM
> Subject: Re: slow nfs = reboot all hosts (((
>
> Hello,
>
> Instead of commenting 'echo b > /proc/sysrq-trigger' and also disabling your HA
> at the same time, perhaps there's a way to tweak the timeouts to be more
> generous with lazy NFS servers.
>
> Can you go through the logs and see what is happening before the reboot? I am
> not sure exactly which timeout the script cares about, worth investigating.
>
> Lucian
>
> --
> Sent from the Delta quadrant using Borg technology!
>
> Nux!
> www.nux.ro
>
> ----- Original Message -----
>> From: "Andrija Panic" <an...@gmail.com>
>> To: dev@cloudstack.apache.org
>> Sent: Friday, 9 October, 2015 10:25:05
>> Subject: Re: slow nfs = reboot all hosts (((
>
>> I managed this problem the folowing way:
>> http://admintweets.com/cloudstack-disable-agent-rebooting-kvm-host/
>>
>> Cheers
>> On Oct 9, 2015 10:21 AM, "Andrei Mikhailovsky" <an...@arhont.com> wrote:
>>
>>> Hello
>>>
>>> My issue is whenever my nfs server becomes slow to respond, ACS just
>>> bloody reboots ALL hosts servers, not just the once running vms with
>>> volumes attached to the slow nfs server. Recently, i've decided to remove
>>> some of the old snapshots to free up some disk space. I've deleted about a
>>> dozen snapshots and I was monitoring the nfs server for progress. At no
>>> point did the nfs server lost the connectivity, it just became a bit slow
>>> and under load. By slow I mean i was still able to list files on the nfs
>>> mount point and the ssh session was still working okay. It was just taking
>>> a few more seconds to respond when it comes to nfs file listings, creation,
>>> deletion, etc. However, the ACS agent has just rebooted every single host
>>> server, killing all running guests and system vms. In my case, I only have
>>> two guests with volumes on the nfs server. The rest of the vms are running
>>> off rbd storage. Yet, all host servers were rebooted, even those which were
>>> not running guests with nfs volumes.
>>>
>>> Ever since i've started using ACS, it was always pretty dumb in correctly
>>> determining if the nfs storage is still alive. I would say it has done the
>>> maniac reboot everything type of behaviour at least 5 times in the past 3
>>> years. So, in the previous versions of ACS i've just modified the
>>> kvmheartbeat.sh and hashed out the line with "reboot" as these reboots were
>>> just pissing everyone off.
>>>
>>> After upgrading to ACS 4.5.x that script has no reboot command and I was
>>> wondering if it is still possible to instruct the kvmheartbeat script not
>>> to reboot the host servers?
>>>
>>> Thanks for your advice.
>>>
> >> Andrei

Re: slow nfs = reboot all hosts (((

Posted by Andrei Mikhailovsky <an...@arhont.com>.

I think there should be as much REISUB as possible when trying to reboot a broken server. Doing only last B bit is a bit dangerous imho. 

Andrei 
----- Original Message -----

From: . "Nux!" <nu...@li.nux.ro> 
To: dev@cloudstack.apache.org 
Sent: Friday, 9 October, 2015 6:53:43 PM 
Subject: Re: slow nfs = reboot all hosts ((( 

Andrei, 

Yes, that command will just reboot without flushing anything to disk, like cutting power. 
It is made because many servers are slow to respond to normal reboot commands under load, if at all, this could lead to corrupted data and so on. 
The sysrq switch is a much better choice from this pov. 

We really need to look at a proper way of doing HA with KVM. 

-- 
Sent from the Delta quadrant using Borg technology! 

Nux! 
www.nux.ro 

----- Original Message ----- 
> From: "Andrei Mikhailovsky" <an...@arhont.com> 
> To: dev@cloudstack.apache.org 
> Sent: Friday, 9 October, 2015 16:47:46 
> Subject: Re: slow nfs = reboot all hosts ((( 

> Thanks guys, I am not sure how i've missed that. probably the coffee didn't kick 
> in yet ))) 
> 
> Anyway, am I right in saying that now the host server reboot is now forced 
> without stopping the services, unmounting filesystems with potentially open and 
> unsync-ed data, etc? 
> 
> Isn't this rather bad and dangerous to perform simply because of 
> slow/unresponsive one of possibly many nfs servers? Not only that, the 
> heartbeat also reboot the servers that are not running vms with nfs volumes? In 
> my case it just rebooted every single host server. 
> 
> Very worrying indeed. 
> 
> Andrei 
> 
> 
> ----- Original Message ----- 
> 
> From: "Nux!" <nu...@li.nux.ro> 
> To: dev@cloudstack.apache.org 
> Sent: Friday, 9 October, 2015 12:58:19 PM 
> Subject: Re: slow nfs = reboot all hosts ((( 
> 
> Hello, 
> 
> Instead of commenting 'echo b > /proc/sysrq-trigger' and also disabling your HA 
> at the same time, perhaps there's a way to tweak the timeouts to be more 
> generous with lazy NFS servers. 
> 
> Can you go through the logs and see what is happening before the reboot? I am 
> not sure exactly which timeout the script cares about, worth investigating. 
> 
> Lucian 
> 
> -- 
> Sent from the Delta quadrant using Borg technology! 
> 
> Nux! 
> www.nux.ro 
> 
> ----- Original Message ----- 
>> From: "Andrija Panic" <an...@gmail.com> 
>> To: dev@cloudstack.apache.org 
>> Sent: Friday, 9 October, 2015 10:25:05 
>> Subject: Re: slow nfs = reboot all hosts ((( 
> 
>> I managed this problem the folowing way: 
>> http://admintweets.com/cloudstack-disable-agent-rebooting-kvm-host/ 
>> 
>> Cheers 
>> On Oct 9, 2015 10:21 AM, "Andrei Mikhailovsky" <an...@arhont.com> wrote: 
>> 
>>> Hello 
>>> 
>>> My issue is whenever my nfs server becomes slow to respond, ACS just 
>>> bloody reboots ALL hosts servers, not just the once running vms with 
>>> volumes attached to the slow nfs server. Recently, i've decided to remove 
>>> some of the old snapshots to free up some disk space. I've deleted about a 
>>> dozen snapshots and I was monitoring the nfs server for progress. At no 
>>> point did the nfs server lost the connectivity, it just became a bit slow 
>>> and under load. By slow I mean i was still able to list files on the nfs 
>>> mount point and the ssh session was still working okay. It was just taking 
>>> a few more seconds to respond when it comes to nfs file listings, creation, 
>>> deletion, etc. However, the ACS agent has just rebooted every single host 
>>> server, killing all running guests and system vms. In my case, I only have 
>>> two guests with volumes on the nfs server. The rest of the vms are running 
>>> off rbd storage. Yet, all host servers were rebooted, even those which were 
>>> not running guests with nfs volumes. 
>>> 
>>> Ever since i've started using ACS, it was always pretty dumb in correctly 
>>> determining if the nfs storage is still alive. I would say it has done the 
>>> maniac reboot everything type of behaviour at least 5 times in the past 3 
>>> years. So, in the previous versions of ACS i've just modified the 
>>> kvmheartbeat.sh and hashed out the line with "reboot" as these reboots were 
>>> just pissing everyone off. 
>>> 
>>> After upgrading to ACS 4.5.x that script has no reboot command and I was 
>>> wondering if it is still possible to instruct the kvmheartbeat script not 
>>> to reboot the host servers? 
>>> 
>>> Thanks for your advice. 
>>> 
> >> Andrei

Re: slow nfs = reboot all hosts (((

Posted by Nux! <nu...@li.nux.ro>.

Andrei,

Yes, that command will just reboot without flushing anything to disk, like cutting power.
It is made because many servers are slow to respond to normal reboot commands under load, if at all, this could lead to corrupted data and so on.
The sysrq switch is a much better choice from this pov.

We really need to look at a proper way of doing HA with KVM.

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Andrei Mikhailovsky" <an...@arhont.com>
> To: dev@cloudstack.apache.org
> Sent: Friday, 9 October, 2015 16:47:46
> Subject: Re: slow nfs = reboot all hosts (((

> Thanks guys, I am not sure how i've missed that. probably the coffee didn't kick
> in yet )))
> 
> Anyway, am I right in saying that now the host server reboot is now forced
> without stopping the services, unmounting filesystems with potentially open and
> unsync-ed data, etc?
> 
> Isn't this rather bad and dangerous to perform simply because of
> slow/unresponsive one of possibly many nfs servers? Not only that, the
> heartbeat also reboot the servers that are not running vms with nfs volumes? In
> my case it just rebooted every single host server.
> 
> Very worrying indeed.
> 
> Andrei
> 
> 
> ----- Original Message -----
> 
> From: "Nux!" <nu...@li.nux.ro>
> To: dev@cloudstack.apache.org
> Sent: Friday, 9 October, 2015 12:58:19 PM
> Subject: Re: slow nfs = reboot all hosts (((
> 
> Hello,
> 
> Instead of commenting 'echo b > /proc/sysrq-trigger' and also disabling your HA
> at the same time, perhaps there's a way to tweak the timeouts to be more
> generous with lazy NFS servers.
> 
> Can you go through the logs and see what is happening before the reboot? I am
> not sure exactly which timeout the script cares about, worth investigating.
> 
> Lucian
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> ----- Original Message -----
>> From: "Andrija Panic" <an...@gmail.com>
>> To: dev@cloudstack.apache.org
>> Sent: Friday, 9 October, 2015 10:25:05
>> Subject: Re: slow nfs = reboot all hosts (((
> 
>> I managed this problem the folowing way:
>> http://admintweets.com/cloudstack-disable-agent-rebooting-kvm-host/
>> 
>> Cheers
>> On Oct 9, 2015 10:21 AM, "Andrei Mikhailovsky" <an...@arhont.com> wrote:
>> 
>>> Hello
>>> 
>>> My issue is whenever my nfs server becomes slow to respond, ACS just
>>> bloody reboots ALL hosts servers, not just the once running vms with
>>> volumes attached to the slow nfs server. Recently, i've decided to remove
>>> some of the old snapshots to free up some disk space. I've deleted about a
>>> dozen snapshots and I was monitoring the nfs server for progress. At no
>>> point did the nfs server lost the connectivity, it just became a bit slow
>>> and under load. By slow I mean i was still able to list files on the nfs
>>> mount point and the ssh session was still working okay. It was just taking
>>> a few more seconds to respond when it comes to nfs file listings, creation,
>>> deletion, etc. However, the ACS agent has just rebooted every single host
>>> server, killing all running guests and system vms. In my case, I only have
>>> two guests with volumes on the nfs server. The rest of the vms are running
>>> off rbd storage. Yet, all host servers were rebooted, even those which were
>>> not running guests with nfs volumes.
>>> 
>>> Ever since i've started using ACS, it was always pretty dumb in correctly
>>> determining if the nfs storage is still alive. I would say it has done the
>>> maniac reboot everything type of behaviour at least 5 times in the past 3
>>> years. So, in the previous versions of ACS i've just modified the
>>> kvmheartbeat.sh and hashed out the line with "reboot" as these reboots were
>>> just pissing everyone off.
>>> 
>>> After upgrading to ACS 4.5.x that script has no reboot command and I was
>>> wondering if it is still possible to instruct the kvmheartbeat script not
>>> to reboot the host servers?
>>> 
>>> Thanks for your advice.
>>> 
> >> Andrei

Re: slow nfs = reboot all hosts (((

Posted by Andrei Mikhailovsky <an...@arhont.com>.

Thanks guys, I am not sure how i've missed that. probably the coffee didn't kick in yet ))) 

Anyway, am I right in saying that now the host server reboot is now forced without stopping the services, unmounting filesystems with potentially open and unsync-ed data, etc? 

Isn't this rather bad and dangerous to perform simply because of slow/unresponsive one of possibly many nfs servers? Not only that, the heartbeat also reboot the servers that are not running vms with nfs volumes? In my case it just rebooted every single host server. 

Very worrying indeed. 

Andrei 


----- Original Message -----

From: "Nux!" <nu...@li.nux.ro> 
To: dev@cloudstack.apache.org 
Sent: Friday, 9 October, 2015 12:58:19 PM 
Subject: Re: slow nfs = reboot all hosts ((( 

Hello, 

Instead of commenting 'echo b > /proc/sysrq-trigger' and also disabling your HA at the same time, perhaps there's a way to tweak the timeouts to be more generous with lazy NFS servers. 

Can you go through the logs and see what is happening before the reboot? I am not sure exactly which timeout the script cares about, worth investigating. 

Lucian 

-- 
Sent from the Delta quadrant using Borg technology! 

Nux! 
www.nux.ro 

----- Original Message ----- 
> From: "Andrija Panic" <an...@gmail.com> 
> To: dev@cloudstack.apache.org 
> Sent: Friday, 9 October, 2015 10:25:05 
> Subject: Re: slow nfs = reboot all hosts ((( 

> I managed this problem the folowing way: 
> http://admintweets.com/cloudstack-disable-agent-rebooting-kvm-host/ 
> 
> Cheers 
> On Oct 9, 2015 10:21 AM, "Andrei Mikhailovsky" <an...@arhont.com> wrote: 
> 
>> Hello 
>> 
>> My issue is whenever my nfs server becomes slow to respond, ACS just 
>> bloody reboots ALL hosts servers, not just the once running vms with 
>> volumes attached to the slow nfs server. Recently, i've decided to remove 
>> some of the old snapshots to free up some disk space. I've deleted about a 
>> dozen snapshots and I was monitoring the nfs server for progress. At no 
>> point did the nfs server lost the connectivity, it just became a bit slow 
>> and under load. By slow I mean i was still able to list files on the nfs 
>> mount point and the ssh session was still working okay. It was just taking 
>> a few more seconds to respond when it comes to nfs file listings, creation, 
>> deletion, etc. However, the ACS agent has just rebooted every single host 
>> server, killing all running guests and system vms. In my case, I only have 
>> two guests with volumes on the nfs server. The rest of the vms are running 
>> off rbd storage. Yet, all host servers were rebooted, even those which were 
>> not running guests with nfs volumes. 
>> 
>> Ever since i've started using ACS, it was always pretty dumb in correctly 
>> determining if the nfs storage is still alive. I would say it has done the 
>> maniac reboot everything type of behaviour at least 5 times in the past 3 
>> years. So, in the previous versions of ACS i've just modified the 
>> kvmheartbeat.sh and hashed out the line with "reboot" as these reboots were 
>> just pissing everyone off. 
>> 
>> After upgrading to ACS 4.5.x that script has no reboot command and I was 
>> wondering if it is still possible to instruct the kvmheartbeat script not 
>> to reboot the host servers? 
>> 
>> Thanks for your advice. 
>> 
>> Andrei

Re: slow nfs = reboot all hosts (((

Posted by Nux! <nu...@li.nux.ro>.

Hello,

Instead of commenting 'echo b > /proc/sysrq-trigger' and also disabling your HA at the same time, perhaps there's a way to tweak the timeouts to be more generous with lazy NFS servers.

Can you go through the logs and see what is happening before the reboot? I am not sure exactly which timeout the script cares about, worth investigating.

Lucian

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Andrija Panic" <an...@gmail.com>
> To: dev@cloudstack.apache.org
> Sent: Friday, 9 October, 2015 10:25:05
> Subject: Re: slow nfs = reboot all hosts (((

> I managed this problem the folowing way:
> http://admintweets.com/cloudstack-disable-agent-rebooting-kvm-host/
> 
> Cheers
> On Oct 9, 2015 10:21 AM, "Andrei Mikhailovsky" <an...@arhont.com> wrote:
> 
>> Hello
>>
>> My issue is whenever my nfs server becomes slow to respond, ACS just
>> bloody reboots ALL hosts servers, not just the once running vms with
>> volumes attached to the slow nfs server. Recently, i've decided to remove
>> some of the old snapshots to free up some disk space. I've deleted about a
>> dozen snapshots and I was monitoring the nfs server for progress. At no
>> point did the nfs server lost the connectivity, it just became a bit slow
>> and under load. By slow I mean i was still able to list files on the nfs
>> mount point and the ssh session was still working okay. It was just taking
>> a few more seconds to respond when it comes to nfs file listings, creation,
>> deletion, etc. However, the ACS agent has just rebooted every single host
>> server, killing all running guests and system vms. In my case, I only have
>> two guests with volumes on the nfs server. The rest of the vms are running
>> off rbd storage. Yet, all host servers were rebooted, even those which were
>> not running guests with nfs volumes.
>>
>> Ever since i've started using ACS, it was always pretty dumb in correctly
>> determining if the nfs storage is still alive. I would say it has done the
>> maniac reboot everything type of behaviour at least 5 times in the past 3
>> years. So, in the previous versions of ACS i've just modified the
>> kvmheartbeat.sh and hashed out the line with "reboot" as these reboots were
>> just pissing everyone off.
>>
>> After upgrading to ACS 4.5.x that script has no reboot command and I was
>> wondering if it is still possible to instruct the kvmheartbeat script not
>> to reboot the host servers?
>>
>> Thanks for your advice.
>>
>> Andrei

Re: slow nfs = reboot all hosts (((

Posted by Andrija Panic <an...@gmail.com>.

I managed this problem the folowing way:
http://admintweets.com/cloudstack-disable-agent-rebooting-kvm-host/

Cheers
On Oct 9, 2015 10:21 AM, "Andrei Mikhailovsky" <an...@arhont.com> wrote:

> Hello
>
> My issue is whenever my nfs server becomes slow to respond, ACS just
> bloody reboots ALL hosts servers, not just the once running vms with
> volumes attached to the slow nfs server. Recently, i've decided to remove
> some of the old snapshots to free up some disk space. I've deleted about a
> dozen snapshots and I was monitoring the nfs server for progress. At no
> point did the nfs server lost the connectivity, it just became a bit slow
> and under load. By slow I mean i was still able to list files on the nfs
> mount point and the ssh session was still working okay. It was just taking
> a few more seconds to respond when it comes to nfs file listings, creation,
> deletion, etc. However, the ACS agent has just rebooted every single host
> server, killing all running guests and system vms. In my case, I only have
> two guests with volumes on the nfs server. The rest of the vms are running
> off rbd storage. Yet, all host servers were rebooted, even those which were
> not running guests with nfs volumes.
>
> Ever since i've started using ACS, it was always pretty dumb in correctly
> determining if the nfs storage is still alive. I would say it has done the
> maniac reboot everything type of behaviour at least 5 times in the past 3
> years. So, in the previous versions of ACS i've just modified the
> kvmheartbeat.sh and hashed out the line with "reboot" as these reboots were
> just pissing everyone off.
>
> After upgrading to ACS 4.5.x that script has no reboot command and I was
> wondering if it is still possible to instruct the kvmheartbeat script not
> to reboot the host servers?
>
> Thanks for your advice.
>
> Andrei
>

Re: slow nfs = reboot all hosts (((

Posted by Wei ZHOU <us...@gmail.com>.

in 4.5, it has been changed from 'reboot' to 'echo b > /proc/sysrq-trigger'
you can hash out the line and test it.



2015-10-09 10:21 GMT+01:00 Andrei Mikhailovsky <an...@arhont.com>:

> Hello
>
> My issue is whenever my nfs server becomes slow to respond, ACS just
> bloody reboots ALL hosts servers, not just the once running vms with
> volumes attached to the slow nfs server. Recently, i've decided to remove
> some of the old snapshots to free up some disk space. I've deleted about a
> dozen snapshots and I was monitoring the nfs server for progress. At no
> point did the nfs server lost the connectivity, it just became a bit slow
> and under load. By slow I mean i was still able to list files on the nfs
> mount point and the ssh session was still working okay. It was just taking
> a few more seconds to respond when it comes to nfs file listings, creation,
> deletion, etc. However, the ACS agent has just rebooted every single host
> server, killing all running guests and system vms. In my case, I only have
> two guests with volumes on the nfs server. The rest of the vms are running
> off rbd storage. Yet, all host servers were rebooted, even those which were
> not running guests with nfs volumes.
>
> Ever since i've started using ACS, it was always pretty dumb in correctly
> determining if the nfs storage is still alive. I would say it has done the
> maniac reboot everything type of behaviour at least 5 times in the past 3
> years. So, in the previous versions of ACS i've just modified the
> kvmheartbeat.sh and hashed out the line with "reboot" as these reboots were
> just pissing everyone off.
>
> After upgrading to ACS 4.5.x that script has no reboot command and I was
> wondering if it is still possible to instruct the kvmheartbeat script not
> to reboot the host servers?
>
> Thanks for your advice.
>
> Andrei
>