You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cloudstack.apache.org by Andrei Mikhailovsky <an...@arhont.com> on 2014/03/02 22:17:48 UTC

ALARM - ACS reboots host servers!!!

Hello guys, 


I've recently came across the bug CLOUDSTACK-5429 which has rebooted all of my host servers without properly shutting down the guest vms. I've simply upgraded and rebooted one of the nfs primary storage servers and a few minutes later, to my horror, i've found out that all of my host servers have been rebooted. Is it just me thinking so, or is this bug should be fixed ASAP and should be a blocker for any new ACS release. I mean not only does it cause downtime, but also possible data loss and server corruption. 


Not sure if one can wait for another year or so until 4.4 is out which may or may not have the fix for this serious issue. From the bug report it seems that this problem is assigned to Edison Su. 


Edison, could you please get in touch to discuss a possible temporary fix to this problem as this is causing a number of issues. 


Does anyone know if you can block the reboot request initiated by the cloudstack agent on the OS level? 


Many thanks for any help 


Andrei

Re: ALARM - ACS reboots host servers!!!

Posted by France <ma...@isg.si>.

On Mar 4, 2014, at 3:38 PM, Marcus wrote:

> On Tue, Mar 4, 2014 at 3:34 AM, France <ma...@isg.si> wrote:
>> Hi Marcus and others.
>> 
>> There is no need to kill of the entire hypervisor, if one of the primary
>> storages fail.
>> You just need to kill the VMs and probably disable SR on XenServer, because
>> all other SRs and VMs have no problems.
>> if you kill those, then you can safely start them elsewhere. On XenServer
>> 6.2 you call destroy the VMs which lost access to NFS without any problems.
> 
> That's a great idea, but as already mentioned, it doesn't work in
> practice. You can't kill a VM that is hanging in D state, waiting on
> storage. I also mentioned that it causes problems for libvirt and much
> of the other system not using the storage.

You can on XS 6.2 as tried in in real life and reported by others as well.

> 
>> 
>> If you really want to still kill the entire host and it's VMs in one go, I
>> would suggest live migrating the VMs which have had not lost their storage
>> off first, and then kill those VMs on a stale NFS by doing hard reboot.
>> Additional time, while migrating working VMs, would even give some grace
>> time for NFS to maybe recover. :-)
> 
> You won't be able to live migrate a VM that is stuck in D state, or
> use libvirt to do so if one of its storage pools is unresponsive,
> anyway.
> 

I dont want to live migrate VMs in D state, just the working VMs. Those stuck can die with hypervisor reboot.


>> 
>> Hard reboot to recover from D state of NFS client can also be avoided by
>> using soft mount options.
> 
> As mentioned, soft and intr very rarely actually work, in my
> experience. I wish they did as I truly have come to loathe NFS for it.
> 
>> 
>> I run a bunch of Pacemaker/Corosync/Cman/Heartbeat/etc clusters and we don't
>> just kill whole nodes but fence services from specific nodes. STONITH is
>> implemented only when the node looses the quorum.
> 
> Sure, but how do you fence a KVM host from an NFS server? I don't
> think we've written a firewall plugin that works to fence hosts from
> any NFS server. Regardless, what CloudStack does is more of a poor
> man's clustering, the mgmt server is the locking in the sense that it
> is managing what's going on, but it's not a real clustering service.
> Heck, it doesn't even STONITH, it tries to clean shutdown, which fails
> as well due to hanging NFS (per the mentioned bug, to fix it they'll
> need IPMI fencing or something like that).

In my case as well as in the case of OP, the hypervisor got rebooted successfully.

> 
> I didn't write the code, I'm just saying that I can completely
> understand why it kills nodes when it deems that their storage has
> gone belly-up. It's dangerous to leave that D state VM hanging around,
> and it will until the NFS storage comes back. In a perfect world you'd
> just stop the VMs that were having the issue, or if there were no VMs
> you'd just de-register the storage from libvirt, I agree.

As previously stated on XS 6.2 you can "destroy" VMs with unaccessible NFS storage. I do not remember if processes were in the D state or whatever, cause i used the GUI, if i remember correctly. I am sure, you can test it yourself too.


> 
>> 
>> Regards,
>> F.
>> 
>> 
>> On 3/3/14 5:35 PM, Marcus wrote:
>>> 
>>> It's the standard clustering problem. Any software that does any sort
>>> of avtive clustering is going to fence nodes that have problems, or
>>> should if it cares about your data. If the risk of losing a host due
>>> to a storage pool outage is too great, you could perhaps look at
>>> rearranging your pool-to-host correlations (certain hosts run vms from
>>> certain pools) via clusters. Note that if you register a storage pool
>>> with a cluster, it will register the pool with libvirt when the pool
>>> is not in maintenance, which, when the storage pool goes down will
>>> cause problems for the host even if no VMs from that storage are
>>> running (fetching storage stats for example will cause agent threads
>>> to hang if its NFS), so you'd need to put ceph in its own cluster and
>>> NFS in its own cluster.
>>> 
>>> It's far more dangerous to leave a host in an unknown/bad state. If a
>>> host loses contact with one of your storage nodes, with HA, cloudstack
>>> will want to start the affected VMs elsewhere. If it does so, and your
>>> original host wakes up from it's NFS hang, you suddenly have a VM
>>> running in two locations, corruption ensues. You might think we could
>>> just stop the affected VMs, but NFS tends to make things that touch it
>>> go into D state, even with 'intr' and other parameters, which affects
>>> libvirt and the agent.
>>> 
>>> We could perhaps open a feature request to disable all HA and just
>>> leave things as-is, disallowing operations when there are outages. If
>>> that sounds useful you can create the feature request on
>>> https://issues.apache.org/jira.
>>> 
>>> 
>>> On Mon, Mar 3, 2014 at 5:37 AM, Andrei Mikhailovsky <an...@arhont.com>
>>> wrote:
>>>> 
>>>> Koushik, I understand that and I will put the storage into the
>>>> maintenance mode next time. However, things happen and servers crash from
>>>> time to time, which is not the reason to reboot all host servers, even those
>>>> which do not have any running vms with volumes on the nfs storage. The
>>>> bloody agent just rebooted every single host server regardless if they were
>>>> running vms with volumes on the rebooted nfs server. 95% of my vms are
>>>> running from ceph and those should have never been effected in the first
>>>> place.
>>>> ----- Original Message -----
>>>> 
>>>> From: "Koushik Das" <ko...@citrix.com>
>>>> To: "<us...@cloudstack.apache.org>" <us...@cloudstack.apache.org>
>>>> Cc: dev@cloudstack.apache.org
>>>> Sent: Monday, 3 March, 2014 5:55:34 AM
>>>> Subject: Re: ALARM - ACS reboots host servers!!!
>>>> 
>>>> The primary storage needs to be put in maintenance before doing any
>>>> upgrade/reboot as mentioned in the previous mails.
>>>> 
>>>> -Koushik
>>>> 
>>>> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
>>>> 
>>>>> Also, please note that in the bug you referenced it doesn't have a
>>>>> problem with the reboot being triggered, but with the fact that reboot
>>>>> never completes due to hanging NFS mount (which is why the reboot
>>>>> occurs, inaccessible primary storage).
>>>>> 
>>>>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>>>>>> 
>>>>>> Or do you mean you have multiple primary storages and this one was not
>>>>>> in use and put into maintenance?
>>>>>> 
>>>>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>>>>>> 
>>>>>>> I'm not sure I understand. How do you expect to reboot your primary
>>>>>>> storage while vms are running? It sounds like the host is being
>>>>>>> fenced since it cannot contact the resources it depends on.
>>>>>>> 
>>>>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>>>>>> 
>>>>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>>>>>> 
>>>>>>>>> Hello guys,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>>>>>>> all of my host servers without properly shutting down the guest vms.
>>>>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>>>>>> servers and a few minutes later, to my horror, i've found out that
>>>>>>>>> all
>>>>>>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>>>>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>>>>>>> ACS release. I mean not only does it cause downtime, but also
>>>>>>>>> possible
>>>>>>>>> data loss and server corruption.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Andrei,
>>>>>>>> 
>>>>>>>> Do you have HA enabled and did you put that primary storage in
>>>>>>>> maintenance
>>>>>>>> mode before rebooting it?
>>>>>>>> It's my understanding that ACS relies on the shared storage to
>>>>>>>> perform HA so
>>>>>>>> if the storage goes it's expected to go berserk. I've noticed similar
>>>>>>>> behaviour in Xenserver pools without ACS.
>>>>>>>> I'd imagine a "cure" for this would be to use network distributed
>>>>>>>> "filesystems" like GlusterFS or CEPH.
>>>>>>>> 
>>>>>>>> Lucian
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>>>> 
>>>>>>>> Nux!
>>>>>>>> www.nux.ro
>>>> 
>>>> 
>>

Re: ALARM - ACS reboots host servers!!!

Posted by France <ma...@isg.si>.

On Mar 4, 2014, at 3:38 PM, Marcus wrote:

> On Tue, Mar 4, 2014 at 3:34 AM, France <ma...@isg.si> wrote:
>> Hi Marcus and others.
>> 
>> There is no need to kill of the entire hypervisor, if one of the primary
>> storages fail.
>> You just need to kill the VMs and probably disable SR on XenServer, because
>> all other SRs and VMs have no problems.
>> if you kill those, then you can safely start them elsewhere. On XenServer
>> 6.2 you call destroy the VMs which lost access to NFS without any problems.
> 
> That's a great idea, but as already mentioned, it doesn't work in
> practice. You can't kill a VM that is hanging in D state, waiting on
> storage. I also mentioned that it causes problems for libvirt and much
> of the other system not using the storage.

You can on XS 6.2 as tried in in real life and reported by others as well.

> 
>> 
>> If you really want to still kill the entire host and it's VMs in one go, I
>> would suggest live migrating the VMs which have had not lost their storage
>> off first, and then kill those VMs on a stale NFS by doing hard reboot.
>> Additional time, while migrating working VMs, would even give some grace
>> time for NFS to maybe recover. :-)
> 
> You won't be able to live migrate a VM that is stuck in D state, or
> use libvirt to do so if one of its storage pools is unresponsive,
> anyway.
> 

I dont want to live migrate VMs in D state, just the working VMs. Those stuck can die with hypervisor reboot.


>> 
>> Hard reboot to recover from D state of NFS client can also be avoided by
>> using soft mount options.
> 
> As mentioned, soft and intr very rarely actually work, in my
> experience. I wish they did as I truly have come to loathe NFS for it.
> 
>> 
>> I run a bunch of Pacemaker/Corosync/Cman/Heartbeat/etc clusters and we don't
>> just kill whole nodes but fence services from specific nodes. STONITH is
>> implemented only when the node looses the quorum.
> 
> Sure, but how do you fence a KVM host from an NFS server? I don't
> think we've written a firewall plugin that works to fence hosts from
> any NFS server. Regardless, what CloudStack does is more of a poor
> man's clustering, the mgmt server is the locking in the sense that it
> is managing what's going on, but it's not a real clustering service.
> Heck, it doesn't even STONITH, it tries to clean shutdown, which fails
> as well due to hanging NFS (per the mentioned bug, to fix it they'll
> need IPMI fencing or something like that).

In my case as well as in the case of OP, the hypervisor got rebooted successfully.

> 
> I didn't write the code, I'm just saying that I can completely
> understand why it kills nodes when it deems that their storage has
> gone belly-up. It's dangerous to leave that D state VM hanging around,
> and it will until the NFS storage comes back. In a perfect world you'd
> just stop the VMs that were having the issue, or if there were no VMs
> you'd just de-register the storage from libvirt, I agree.

As previously stated on XS 6.2 you can "destroy" VMs with unaccessible NFS storage. I do not remember if processes were in the D state or whatever, cause i used the GUI, if i remember correctly. I am sure, you can test it yourself too.


> 
>> 
>> Regards,
>> F.
>> 
>> 
>> On 3/3/14 5:35 PM, Marcus wrote:
>>> 
>>> It's the standard clustering problem. Any software that does any sort
>>> of avtive clustering is going to fence nodes that have problems, or
>>> should if it cares about your data. If the risk of losing a host due
>>> to a storage pool outage is too great, you could perhaps look at
>>> rearranging your pool-to-host correlations (certain hosts run vms from
>>> certain pools) via clusters. Note that if you register a storage pool
>>> with a cluster, it will register the pool with libvirt when the pool
>>> is not in maintenance, which, when the storage pool goes down will
>>> cause problems for the host even if no VMs from that storage are
>>> running (fetching storage stats for example will cause agent threads
>>> to hang if its NFS), so you'd need to put ceph in its own cluster and
>>> NFS in its own cluster.
>>> 
>>> It's far more dangerous to leave a host in an unknown/bad state. If a
>>> host loses contact with one of your storage nodes, with HA, cloudstack
>>> will want to start the affected VMs elsewhere. If it does so, and your
>>> original host wakes up from it's NFS hang, you suddenly have a VM
>>> running in two locations, corruption ensues. You might think we could
>>> just stop the affected VMs, but NFS tends to make things that touch it
>>> go into D state, even with 'intr' and other parameters, which affects
>>> libvirt and the agent.
>>> 
>>> We could perhaps open a feature request to disable all HA and just
>>> leave things as-is, disallowing operations when there are outages. If
>>> that sounds useful you can create the feature request on
>>> https://issues.apache.org/jira.
>>> 
>>> 
>>> On Mon, Mar 3, 2014 at 5:37 AM, Andrei Mikhailovsky <an...@arhont.com>
>>> wrote:
>>>> 
>>>> Koushik, I understand that and I will put the storage into the
>>>> maintenance mode next time. However, things happen and servers crash from
>>>> time to time, which is not the reason to reboot all host servers, even those
>>>> which do not have any running vms with volumes on the nfs storage. The
>>>> bloody agent just rebooted every single host server regardless if they were
>>>> running vms with volumes on the rebooted nfs server. 95% of my vms are
>>>> running from ceph and those should have never been effected in the first
>>>> place.
>>>> ----- Original Message -----
>>>> 
>>>> From: "Koushik Das" <ko...@citrix.com>
>>>> To: "<us...@cloudstack.apache.org>" <us...@cloudstack.apache.org>
>>>> Cc: dev@cloudstack.apache.org
>>>> Sent: Monday, 3 March, 2014 5:55:34 AM
>>>> Subject: Re: ALARM - ACS reboots host servers!!!
>>>> 
>>>> The primary storage needs to be put in maintenance before doing any
>>>> upgrade/reboot as mentioned in the previous mails.
>>>> 
>>>> -Koushik
>>>> 
>>>> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
>>>> 
>>>>> Also, please note that in the bug you referenced it doesn't have a
>>>>> problem with the reboot being triggered, but with the fact that reboot
>>>>> never completes due to hanging NFS mount (which is why the reboot
>>>>> occurs, inaccessible primary storage).
>>>>> 
>>>>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>>>>>> 
>>>>>> Or do you mean you have multiple primary storages and this one was not
>>>>>> in use and put into maintenance?
>>>>>> 
>>>>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>>>>>> 
>>>>>>> I'm not sure I understand. How do you expect to reboot your primary
>>>>>>> storage while vms are running? It sounds like the host is being
>>>>>>> fenced since it cannot contact the resources it depends on.
>>>>>>> 
>>>>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>>>>>> 
>>>>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>>>>>> 
>>>>>>>>> Hello guys,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>>>>>>> all of my host servers without properly shutting down the guest vms.
>>>>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>>>>>> servers and a few minutes later, to my horror, i've found out that
>>>>>>>>> all
>>>>>>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>>>>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>>>>>>> ACS release. I mean not only does it cause downtime, but also
>>>>>>>>> possible
>>>>>>>>> data loss and server corruption.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Andrei,
>>>>>>>> 
>>>>>>>> Do you have HA enabled and did you put that primary storage in
>>>>>>>> maintenance
>>>>>>>> mode before rebooting it?
>>>>>>>> It's my understanding that ACS relies on the shared storage to
>>>>>>>> perform HA so
>>>>>>>> if the storage goes it's expected to go berserk. I've noticed similar
>>>>>>>> behaviour in Xenserver pools without ACS.
>>>>>>>> I'd imagine a "cure" for this would be to use network distributed
>>>>>>>> "filesystems" like GlusterFS or CEPH.
>>>>>>>> 
>>>>>>>> Lucian
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>>>> 
>>>>>>>> Nux!
>>>>>>>> www.nux.ro
>>>> 
>>>> 
>>

Re: ALARM - ACS reboots host servers!!!

Posted by Marcus <sh...@gmail.com>.

Of course, in the event that the NFS *server* is truly out, users may
prefer that no action is taken, everything stays as-is so that when it
comes back things can continue running. The main issue I can think of
with that is that NFS clients may think things were flushed to disk on
the NFS server that were not (if the NFS server hard crashed) if the
NFS server is running async. Then the VM processes will see corruption
and likely go into read-only and need reboot, anyway. The upside is
that unaffected storage might still run vms, and libvirt/host might
(??) recover when the NFS server comes back. There's also not
currently a way (that I'm aware of) for the agent/host side to ask the
mgmt server questions like 'is the server down, or is it just me?',
but some discovery method like that would be required, I think, since
the agent is what drives the HA right now (which also isn't great).

I just thought of one minor improvement to the situation that might
fix CLOUDSTACK-5429. They could potentially use the sysrq triggers to
force reboot.


On Tue, Mar 4, 2014 at 7:45 AM, Wido den Hollander <wi...@widodh.nl> wrote:
> On 03/04/2014 03:38 PM, Marcus wrote:
>>
>> On Tue, Mar 4, 2014 at 3:34 AM, France <ma...@isg.si> wrote:
>>>
>>> Hi Marcus and others.
>>>
>>> There is no need to kill of the entire hypervisor, if one of the primary
>>> storages fail.
>>> You just need to kill the VMs and probably disable SR on XenServer,
>>> because
>>> all other SRs and VMs have no problems.
>>> if you kill those, then you can safely start them elsewhere. On XenServer
>>> 6.2 you call destroy the VMs which lost access to NFS without any
>>> problems.
>>
>>
>> That's a great idea, but as already mentioned, it doesn't work in
>> practice. You can't kill a VM that is hanging in D state, waiting on
>> storage. I also mentioned that it causes problems for libvirt and much
>> of the other system not using the storage.
>>
>
> Just tuning in here and Marcus is right. If NFS is hanging the processes go
> into status D, both Qemu/KVM and libvirt.
>
> The only remedy at that point to Fence of the host is a reboot, you can't do
> anything with the processes which are blocking.
>
> When you run stuff which only lives in userspace like Ceph with librbd it's
> a different story, but with NFS you are stuck.
>
>
>>>
>>> If you really want to still kill the entire host and it's VMs in one go,
>>> I
>>> would suggest live migrating the VMs which have had not lost their
>>> storage
>>> off first, and then kill those VMs on a stale NFS by doing hard reboot.
>>> Additional time, while migrating working VMs, would even give some grace
>>> time for NFS to maybe recover. :-)
>>
>>
>> You won't be able to live migrate a VM that is stuck in D state, or
>> use libvirt to do so if one of its storage pools is unresponsive,
>> anyway.
>>
>
> Indeed, same issue again. Libivrt COMPLETELY blocks, not just one storage
> pool.
>
>
>>>
>>> Hard reboot to recover from D state of NFS client can also be avoided by
>>> using soft mount options.
>>
>>
>> As mentioned, soft and intr very rarely actually work, in my
>> experience. I wish they did as I truly have come to loathe NFS for it.
>>
>
> Indeed, they almost never work. I've been working with NFS for over 10 years
> now and those damn options have NEVER worked properly.
>
> That's just the downside of having stuff go through kernel space.
>
>
>>>
>>> I run a bunch of Pacemaker/Corosync/Cman/Heartbeat/etc clusters and we
>>> don't
>>> just kill whole nodes but fence services from specific nodes. STONITH is
>>> implemented only when the node looses the quorum.
>>
>>
>> Sure, but how do you fence a KVM host from an NFS server? I don't
>> think we've written a firewall plugin that works to fence hosts from
>> any NFS server. Regardless, what CloudStack does is more of a poor
>> man's clustering, the mgmt server is the locking in the sense that it
>> is managing what's going on, but it's not a real clustering service.
>> Heck, it doesn't even STONITH, it tries to clean shutdown, which fails
>> as well due to hanging NFS (per the mentioned bug, to fix it they'll
>> need IPMI fencing or something like that).
>>
>
> IPMI fencing is something I've been thinking about as well. Would be a great
> benefit for the HA in CloudStack.
>
>
>> I didn't write the code, I'm just saying that I can completely
>> understand why it kills nodes when it deems that their storage has
>> gone belly-up. It's dangerous to leave that D state VM hanging around,
>> and it will until the NFS storage comes back. In a perfect world you'd
>> just stop the VMs that were having the issue, or if there were no VMs
>> you'd just de-register the storage from libvirt, I agree.
>>
>
> de-register won't work either... Libvirt tries a umount which will block as
> well.
>
> Wido
>
>
>>>
>>> Regards,
>>> F.
>>>
>>>
>>> On 3/3/14 5:35 PM, Marcus wrote:
>>>>
>>>>
>>>> It's the standard clustering problem. Any software that does any sort
>>>> of avtive clustering is going to fence nodes that have problems, or
>>>> should if it cares about your data. If the risk of losing a host due
>>>> to a storage pool outage is too great, you could perhaps look at
>>>> rearranging your pool-to-host correlations (certain hosts run vms from
>>>> certain pools) via clusters. Note that if you register a storage pool
>>>> with a cluster, it will register the pool with libvirt when the pool
>>>> is not in maintenance, which, when the storage pool goes down will
>>>> cause problems for the host even if no VMs from that storage are
>>>> running (fetching storage stats for example will cause agent threads
>>>> to hang if its NFS), so you'd need to put ceph in its own cluster and
>>>> NFS in its own cluster.
>>>>
>>>> It's far more dangerous to leave a host in an unknown/bad state. If a
>>>> host loses contact with one of your storage nodes, with HA, cloudstack
>>>> will want to start the affected VMs elsewhere. If it does so, and your
>>>> original host wakes up from it's NFS hang, you suddenly have a VM
>>>> running in two locations, corruption ensues. You might think we could
>>>> just stop the affected VMs, but NFS tends to make things that touch it
>>>> go into D state, even with 'intr' and other parameters, which affects
>>>> libvirt and the agent.
>>>>
>>>> We could perhaps open a feature request to disable all HA and just
>>>> leave things as-is, disallowing operations when there are outages. If
>>>> that sounds useful you can create the feature request on
>>>> https://issues.apache.org/jira.
>>>>
>>>>
>>>> On Mon, Mar 3, 2014 at 5:37 AM, Andrei Mikhailovsky <an...@arhont.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> Koushik, I understand that and I will put the storage into the
>>>>> maintenance mode next time. However, things happen and servers crash
>>>>> from
>>>>> time to time, which is not the reason to reboot all host servers, even
>>>>> those
>>>>> which do not have any running vms with volumes on the nfs storage. The
>>>>> bloody agent just rebooted every single host server regardless if they
>>>>> were
>>>>> running vms with volumes on the rebooted nfs server. 95% of my vms are
>>>>> running from ceph and those should have never been effected in the
>>>>> first
>>>>> place.
>>>>> ----- Original Message -----
>>>>>
>>>>> From: "Koushik Das" <ko...@citrix.com>
>>>>> To: "<us...@cloudstack.apache.org>" <us...@cloudstack.apache.org>
>>>>> Cc: dev@cloudstack.apache.org
>>>>> Sent: Monday, 3 March, 2014 5:55:34 AM
>>>>> Subject: Re: ALARM - ACS reboots host servers!!!
>>>>>
>>>>> The primary storage needs to be put in maintenance before doing any
>>>>> upgrade/reboot as mentioned in the previous mails.
>>>>>
>>>>> -Koushik
>>>>>
>>>>> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
>>>>>
>>>>>> Also, please note that in the bug you referenced it doesn't have a
>>>>>> problem with the reboot being triggered, but with the fact that reboot
>>>>>> never completes due to hanging NFS mount (which is why the reboot
>>>>>> occurs, inaccessible primary storage).
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Or do you mean you have multiple primary storages and this one was
>>>>>>> not
>>>>>>> in use and put into maintenance?
>>>>>>>
>>>>>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm not sure I understand. How do you expect to reboot your primary
>>>>>>>> storage while vms are running? It sounds like the host is being
>>>>>>>> fenced since it cannot contact the resources it depends on.
>>>>>>>>
>>>>>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hello guys,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has
>>>>>>>>>> rebooted
>>>>>>>>>> all of my host servers without properly shutting down the guest
>>>>>>>>>> vms.
>>>>>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>>>>>>> servers and a few minutes later, to my horror, i've found out that
>>>>>>>>>> all
>>>>>>>>>> of my host servers have been rebooted. Is it just me thinking so,
>>>>>>>>>> or
>>>>>>>>>> is this bug should be fixed ASAP and should be a blocker for any
>>>>>>>>>> new
>>>>>>>>>> ACS release. I mean not only does it cause downtime, but also
>>>>>>>>>> possible
>>>>>>>>>> data loss and server corruption.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Andrei,
>>>>>>>>>
>>>>>>>>> Do you have HA enabled and did you put that primary storage in
>>>>>>>>> maintenance
>>>>>>>>> mode before rebooting it?
>>>>>>>>> It's my understanding that ACS relies on the shared storage to
>>>>>>>>> perform HA so
>>>>>>>>> if the storage goes it's expected to go berserk. I've noticed
>>>>>>>>> similar
>>>>>>>>> behaviour in Xenserver pools without ACS.
>>>>>>>>> I'd imagine a "cure" for this would be to use network distributed
>>>>>>>>> "filesystems" like GlusterFS or CEPH.
>>>>>>>>>
>>>>>>>>> Lucian
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>>>>>
>>>>>>>>> Nux!
>>>>>>>>> www.nux.ro
>>>>>
>>>>>
>>>>>
>>>
>

Re: ALARM - ACS reboots host servers!!!

Posted by Wido den Hollander <wi...@widodh.nl>.

On 03/04/2014 03:38 PM, Marcus wrote:
> On Tue, Mar 4, 2014 at 3:34 AM, France <ma...@isg.si> wrote:
>> Hi Marcus and others.
>>
>> There is no need to kill of the entire hypervisor, if one of the primary
>> storages fail.
>> You just need to kill the VMs and probably disable SR on XenServer, because
>> all other SRs and VMs have no problems.
>> if you kill those, then you can safely start them elsewhere. On XenServer
>> 6.2 you call destroy the VMs which lost access to NFS without any problems.
>
> That's a great idea, but as already mentioned, it doesn't work in
> practice. You can't kill a VM that is hanging in D state, waiting on
> storage. I also mentioned that it causes problems for libvirt and much
> of the other system not using the storage.
>

Just tuning in here and Marcus is right. If NFS is hanging the processes 
go into status D, both Qemu/KVM and libvirt.

The only remedy at that point to Fence of the host is a reboot, you 
can't do anything with the processes which are blocking.

When you run stuff which only lives in userspace like Ceph with librbd 
it's a different story, but with NFS you are stuck.

>>
>> If you really want to still kill the entire host and it's VMs in one go, I
>> would suggest live migrating the VMs which have had not lost their storage
>> off first, and then kill those VMs on a stale NFS by doing hard reboot.
>> Additional time, while migrating working VMs, would even give some grace
>> time for NFS to maybe recover. :-)
>
> You won't be able to live migrate a VM that is stuck in D state, or
> use libvirt to do so if one of its storage pools is unresponsive,
> anyway.
>

Indeed, same issue again. Libivrt COMPLETELY blocks, not just one 
storage pool.

>>
>> Hard reboot to recover from D state of NFS client can also be avoided by
>> using soft mount options.
>
> As mentioned, soft and intr very rarely actually work, in my
> experience. I wish they did as I truly have come to loathe NFS for it.
>

Indeed, they almost never work. I've been working with NFS for over 10 
years now and those damn options have NEVER worked properly.

That's just the downside of having stuff go through kernel space.

>>
>> I run a bunch of Pacemaker/Corosync/Cman/Heartbeat/etc clusters and we don't
>> just kill whole nodes but fence services from specific nodes. STONITH is
>> implemented only when the node looses the quorum.
>
> Sure, but how do you fence a KVM host from an NFS server? I don't
> think we've written a firewall plugin that works to fence hosts from
> any NFS server. Regardless, what CloudStack does is more of a poor
> man's clustering, the mgmt server is the locking in the sense that it
> is managing what's going on, but it's not a real clustering service.
> Heck, it doesn't even STONITH, it tries to clean shutdown, which fails
> as well due to hanging NFS (per the mentioned bug, to fix it they'll
> need IPMI fencing or something like that).
>

IPMI fencing is something I've been thinking about as well. Would be a 
great benefit for the HA in CloudStack.

> I didn't write the code, I'm just saying that I can completely
> understand why it kills nodes when it deems that their storage has
> gone belly-up. It's dangerous to leave that D state VM hanging around,
> and it will until the NFS storage comes back. In a perfect world you'd
> just stop the VMs that were having the issue, or if there were no VMs
> you'd just de-register the storage from libvirt, I agree.
>

de-register won't work either... Libvirt tries a umount which will block 
as well.

Wido

>>
>> Regards,
>> F.
>>
>>
>> On 3/3/14 5:35 PM, Marcus wrote:
>>>
>>> It's the standard clustering problem. Any software that does any sort
>>> of avtive clustering is going to fence nodes that have problems, or
>>> should if it cares about your data. If the risk of losing a host due
>>> to a storage pool outage is too great, you could perhaps look at
>>> rearranging your pool-to-host correlations (certain hosts run vms from
>>> certain pools) via clusters. Note that if you register a storage pool
>>> with a cluster, it will register the pool with libvirt when the pool
>>> is not in maintenance, which, when the storage pool goes down will
>>> cause problems for the host even if no VMs from that storage are
>>> running (fetching storage stats for example will cause agent threads
>>> to hang if its NFS), so you'd need to put ceph in its own cluster and
>>> NFS in its own cluster.
>>>
>>> It's far more dangerous to leave a host in an unknown/bad state. If a
>>> host loses contact with one of your storage nodes, with HA, cloudstack
>>> will want to start the affected VMs elsewhere. If it does so, and your
>>> original host wakes up from it's NFS hang, you suddenly have a VM
>>> running in two locations, corruption ensues. You might think we could
>>> just stop the affected VMs, but NFS tends to make things that touch it
>>> go into D state, even with 'intr' and other parameters, which affects
>>> libvirt and the agent.
>>>
>>> We could perhaps open a feature request to disable all HA and just
>>> leave things as-is, disallowing operations when there are outages. If
>>> that sounds useful you can create the feature request on
>>> https://issues.apache.org/jira.
>>>
>>>
>>> On Mon, Mar 3, 2014 at 5:37 AM, Andrei Mikhailovsky <an...@arhont.com>
>>> wrote:
>>>>
>>>> Koushik, I understand that and I will put the storage into the
>>>> maintenance mode next time. However, things happen and servers crash from
>>>> time to time, which is not the reason to reboot all host servers, even those
>>>> which do not have any running vms with volumes on the nfs storage. The
>>>> bloody agent just rebooted every single host server regardless if they were
>>>> running vms with volumes on the rebooted nfs server. 95% of my vms are
>>>> running from ceph and those should have never been effected in the first
>>>> place.
>>>> ----- Original Message -----
>>>>
>>>> From: "Koushik Das" <ko...@citrix.com>
>>>> To: "<us...@cloudstack.apache.org>" <us...@cloudstack.apache.org>
>>>> Cc: dev@cloudstack.apache.org
>>>> Sent: Monday, 3 March, 2014 5:55:34 AM
>>>> Subject: Re: ALARM - ACS reboots host servers!!!
>>>>
>>>> The primary storage needs to be put in maintenance before doing any
>>>> upgrade/reboot as mentioned in the previous mails.
>>>>
>>>> -Koushik
>>>>
>>>> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
>>>>
>>>>> Also, please note that in the bug you referenced it doesn't have a
>>>>> problem with the reboot being triggered, but with the fact that reboot
>>>>> never completes due to hanging NFS mount (which is why the reboot
>>>>> occurs, inaccessible primary storage).
>>>>>
>>>>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>>>>>>
>>>>>> Or do you mean you have multiple primary storages and this one was not
>>>>>> in use and put into maintenance?
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>>>>>>
>>>>>>> I'm not sure I understand. How do you expect to reboot your primary
>>>>>>> storage while vms are running? It sounds like the host is being
>>>>>>> fenced since it cannot contact the resources it depends on.
>>>>>>>
>>>>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>>>>>>
>>>>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>>>>>>
>>>>>>>>> Hello guys,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>>>>>>> all of my host servers without properly shutting down the guest vms.
>>>>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>>>>>> servers and a few minutes later, to my horror, i've found out that
>>>>>>>>> all
>>>>>>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>>>>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>>>>>>> ACS release. I mean not only does it cause downtime, but also
>>>>>>>>> possible
>>>>>>>>> data loss and server corruption.
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Andrei,
>>>>>>>>
>>>>>>>> Do you have HA enabled and did you put that primary storage in
>>>>>>>> maintenance
>>>>>>>> mode before rebooting it?
>>>>>>>> It's my understanding that ACS relies on the shared storage to
>>>>>>>> perform HA so
>>>>>>>> if the storage goes it's expected to go berserk. I've noticed similar
>>>>>>>> behaviour in Xenserver pools without ACS.
>>>>>>>> I'd imagine a "cure" for this would be to use network distributed
>>>>>>>> "filesystems" like GlusterFS or CEPH.
>>>>>>>>
>>>>>>>> Lucian
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>>>>
>>>>>>>> Nux!
>>>>>>>> www.nux.ro
>>>>
>>>>
>>

Re: ALARM - ACS reboots host servers!!!

Posted by Marcus <sh...@gmail.com>.

On Tue, Mar 4, 2014 at 3:34 AM, France <ma...@isg.si> wrote:
> Hi Marcus and others.
>
> There is no need to kill of the entire hypervisor, if one of the primary
> storages fail.
> You just need to kill the VMs and probably disable SR on XenServer, because
> all other SRs and VMs have no problems.
> if you kill those, then you can safely start them elsewhere. On XenServer
> 6.2 you call destroy the VMs which lost access to NFS without any problems.

That's a great idea, but as already mentioned, it doesn't work in
practice. You can't kill a VM that is hanging in D state, waiting on
storage. I also mentioned that it causes problems for libvirt and much
of the other system not using the storage.

>
> If you really want to still kill the entire host and it's VMs in one go, I
> would suggest live migrating the VMs which have had not lost their storage
> off first, and then kill those VMs on a stale NFS by doing hard reboot.
> Additional time, while migrating working VMs, would even give some grace
> time for NFS to maybe recover. :-)

You won't be able to live migrate a VM that is stuck in D state, or
use libvirt to do so if one of its storage pools is unresponsive,
anyway.

>
> Hard reboot to recover from D state of NFS client can also be avoided by
> using soft mount options.

As mentioned, soft and intr very rarely actually work, in my
experience. I wish they did as I truly have come to loathe NFS for it.

>
> I run a bunch of Pacemaker/Corosync/Cman/Heartbeat/etc clusters and we don't
> just kill whole nodes but fence services from specific nodes. STONITH is
> implemented only when the node looses the quorum.

Sure, but how do you fence a KVM host from an NFS server? I don't
think we've written a firewall plugin that works to fence hosts from
any NFS server. Regardless, what CloudStack does is more of a poor
man's clustering, the mgmt server is the locking in the sense that it
is managing what's going on, but it's not a real clustering service.
Heck, it doesn't even STONITH, it tries to clean shutdown, which fails
as well due to hanging NFS (per the mentioned bug, to fix it they'll
need IPMI fencing or something like that).

I didn't write the code, I'm just saying that I can completely
understand why it kills nodes when it deems that their storage has
gone belly-up. It's dangerous to leave that D state VM hanging around,
and it will until the NFS storage comes back. In a perfect world you'd
just stop the VMs that were having the issue, or if there were no VMs
you'd just de-register the storage from libvirt, I agree.

>
> Regards,
> F.
>
>
> On 3/3/14 5:35 PM, Marcus wrote:
>>
>> It's the standard clustering problem. Any software that does any sort
>> of avtive clustering is going to fence nodes that have problems, or
>> should if it cares about your data. If the risk of losing a host due
>> to a storage pool outage is too great, you could perhaps look at
>> rearranging your pool-to-host correlations (certain hosts run vms from
>> certain pools) via clusters. Note that if you register a storage pool
>> with a cluster, it will register the pool with libvirt when the pool
>> is not in maintenance, which, when the storage pool goes down will
>> cause problems for the host even if no VMs from that storage are
>> running (fetching storage stats for example will cause agent threads
>> to hang if its NFS), so you'd need to put ceph in its own cluster and
>> NFS in its own cluster.
>>
>> It's far more dangerous to leave a host in an unknown/bad state. If a
>> host loses contact with one of your storage nodes, with HA, cloudstack
>> will want to start the affected VMs elsewhere. If it does so, and your
>> original host wakes up from it's NFS hang, you suddenly have a VM
>> running in two locations, corruption ensues. You might think we could
>> just stop the affected VMs, but NFS tends to make things that touch it
>> go into D state, even with 'intr' and other parameters, which affects
>> libvirt and the agent.
>>
>> We could perhaps open a feature request to disable all HA and just
>> leave things as-is, disallowing operations when there are outages. If
>> that sounds useful you can create the feature request on
>> https://issues.apache.org/jira.
>>
>>
>> On Mon, Mar 3, 2014 at 5:37 AM, Andrei Mikhailovsky <an...@arhont.com>
>> wrote:
>>>
>>> Koushik, I understand that and I will put the storage into the
>>> maintenance mode next time. However, things happen and servers crash from
>>> time to time, which is not the reason to reboot all host servers, even those
>>> which do not have any running vms with volumes on the nfs storage. The
>>> bloody agent just rebooted every single host server regardless if they were
>>> running vms with volumes on the rebooted nfs server. 95% of my vms are
>>> running from ceph and those should have never been effected in the first
>>> place.
>>> ----- Original Message -----
>>>
>>> From: "Koushik Das" <ko...@citrix.com>
>>> To: "<us...@cloudstack.apache.org>" <us...@cloudstack.apache.org>
>>> Cc: dev@cloudstack.apache.org
>>> Sent: Monday, 3 March, 2014 5:55:34 AM
>>> Subject: Re: ALARM - ACS reboots host servers!!!
>>>
>>> The primary storage needs to be put in maintenance before doing any
>>> upgrade/reboot as mentioned in the previous mails.
>>>
>>> -Koushik
>>>
>>> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
>>>
>>>> Also, please note that in the bug you referenced it doesn't have a
>>>> problem with the reboot being triggered, but with the fact that reboot
>>>> never completes due to hanging NFS mount (which is why the reboot
>>>> occurs, inaccessible primary storage).
>>>>
>>>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>>>>>
>>>>> Or do you mean you have multiple primary storages and this one was not
>>>>> in use and put into maintenance?
>>>>>
>>>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>>>>>
>>>>>> I'm not sure I understand. How do you expect to reboot your primary
>>>>>> storage while vms are running? It sounds like the host is being
>>>>>> fenced since it cannot contact the resources it depends on.
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>>>>>
>>>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>>>>>
>>>>>>>> Hello guys,
>>>>>>>>
>>>>>>>>
>>>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>>>>>> all of my host servers without properly shutting down the guest vms.
>>>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>>>>> servers and a few minutes later, to my horror, i've found out that
>>>>>>>> all
>>>>>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>>>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>>>>>> ACS release. I mean not only does it cause downtime, but also
>>>>>>>> possible
>>>>>>>> data loss and server corruption.
>>>>>>>
>>>>>>>
>>>>>>> Hi Andrei,
>>>>>>>
>>>>>>> Do you have HA enabled and did you put that primary storage in
>>>>>>> maintenance
>>>>>>> mode before rebooting it?
>>>>>>> It's my understanding that ACS relies on the shared storage to
>>>>>>> perform HA so
>>>>>>> if the storage goes it's expected to go berserk. I've noticed similar
>>>>>>> behaviour in Xenserver pools without ACS.
>>>>>>> I'd imagine a "cure" for this would be to use network distributed
>>>>>>> "filesystems" like GlusterFS or CEPH.
>>>>>>>
>>>>>>> Lucian
>>>>>>>
>>>>>>> --
>>>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>>>
>>>>>>> Nux!
>>>>>>> www.nux.ro
>>>
>>>
>

Re: ALARM - ACS reboots host servers!!!

Posted by Marcus <sh...@gmail.com>.

On Tue, Mar 4, 2014 at 3:34 AM, France <ma...@isg.si> wrote:
> Hi Marcus and others.
>
> There is no need to kill of the entire hypervisor, if one of the primary
> storages fail.
> You just need to kill the VMs and probably disable SR on XenServer, because
> all other SRs and VMs have no problems.
> if you kill those, then you can safely start them elsewhere. On XenServer
> 6.2 you call destroy the VMs which lost access to NFS without any problems.

That's a great idea, but as already mentioned, it doesn't work in
practice. You can't kill a VM that is hanging in D state, waiting on
storage. I also mentioned that it causes problems for libvirt and much
of the other system not using the storage.

>
> If you really want to still kill the entire host and it's VMs in one go, I
> would suggest live migrating the VMs which have had not lost their storage
> off first, and then kill those VMs on a stale NFS by doing hard reboot.
> Additional time, while migrating working VMs, would even give some grace
> time for NFS to maybe recover. :-)

You won't be able to live migrate a VM that is stuck in D state, or
use libvirt to do so if one of its storage pools is unresponsive,
anyway.

>
> Hard reboot to recover from D state of NFS client can also be avoided by
> using soft mount options.

As mentioned, soft and intr very rarely actually work, in my
experience. I wish they did as I truly have come to loathe NFS for it.

>
> I run a bunch of Pacemaker/Corosync/Cman/Heartbeat/etc clusters and we don't
> just kill whole nodes but fence services from specific nodes. STONITH is
> implemented only when the node looses the quorum.

Sure, but how do you fence a KVM host from an NFS server? I don't
think we've written a firewall plugin that works to fence hosts from
any NFS server. Regardless, what CloudStack does is more of a poor
man's clustering, the mgmt server is the locking in the sense that it
is managing what's going on, but it's not a real clustering service.
Heck, it doesn't even STONITH, it tries to clean shutdown, which fails
as well due to hanging NFS (per the mentioned bug, to fix it they'll
need IPMI fencing or something like that).

I didn't write the code, I'm just saying that I can completely
understand why it kills nodes when it deems that their storage has
gone belly-up. It's dangerous to leave that D state VM hanging around,
and it will until the NFS storage comes back. In a perfect world you'd
just stop the VMs that were having the issue, or if there were no VMs
you'd just de-register the storage from libvirt, I agree.

>
> Regards,
> F.
>
>
> On 3/3/14 5:35 PM, Marcus wrote:
>>
>> It's the standard clustering problem. Any software that does any sort
>> of avtive clustering is going to fence nodes that have problems, or
>> should if it cares about your data. If the risk of losing a host due
>> to a storage pool outage is too great, you could perhaps look at
>> rearranging your pool-to-host correlations (certain hosts run vms from
>> certain pools) via clusters. Note that if you register a storage pool
>> with a cluster, it will register the pool with libvirt when the pool
>> is not in maintenance, which, when the storage pool goes down will
>> cause problems for the host even if no VMs from that storage are
>> running (fetching storage stats for example will cause agent threads
>> to hang if its NFS), so you'd need to put ceph in its own cluster and
>> NFS in its own cluster.
>>
>> It's far more dangerous to leave a host in an unknown/bad state. If a
>> host loses contact with one of your storage nodes, with HA, cloudstack
>> will want to start the affected VMs elsewhere. If it does so, and your
>> original host wakes up from it's NFS hang, you suddenly have a VM
>> running in two locations, corruption ensues. You might think we could
>> just stop the affected VMs, but NFS tends to make things that touch it
>> go into D state, even with 'intr' and other parameters, which affects
>> libvirt and the agent.
>>
>> We could perhaps open a feature request to disable all HA and just
>> leave things as-is, disallowing operations when there are outages. If
>> that sounds useful you can create the feature request on
>> https://issues.apache.org/jira.
>>
>>
>> On Mon, Mar 3, 2014 at 5:37 AM, Andrei Mikhailovsky <an...@arhont.com>
>> wrote:
>>>
>>> Koushik, I understand that and I will put the storage into the
>>> maintenance mode next time. However, things happen and servers crash from
>>> time to time, which is not the reason to reboot all host servers, even those
>>> which do not have any running vms with volumes on the nfs storage. The
>>> bloody agent just rebooted every single host server regardless if they were
>>> running vms with volumes on the rebooted nfs server. 95% of my vms are
>>> running from ceph and those should have never been effected in the first
>>> place.
>>> ----- Original Message -----
>>>
>>> From: "Koushik Das" <ko...@citrix.com>
>>> To: "<us...@cloudstack.apache.org>" <us...@cloudstack.apache.org>
>>> Cc: dev@cloudstack.apache.org
>>> Sent: Monday, 3 March, 2014 5:55:34 AM
>>> Subject: Re: ALARM - ACS reboots host servers!!!
>>>
>>> The primary storage needs to be put in maintenance before doing any
>>> upgrade/reboot as mentioned in the previous mails.
>>>
>>> -Koushik
>>>
>>> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
>>>
>>>> Also, please note that in the bug you referenced it doesn't have a
>>>> problem with the reboot being triggered, but with the fact that reboot
>>>> never completes due to hanging NFS mount (which is why the reboot
>>>> occurs, inaccessible primary storage).
>>>>
>>>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>>>>>
>>>>> Or do you mean you have multiple primary storages and this one was not
>>>>> in use and put into maintenance?
>>>>>
>>>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>>>>>
>>>>>> I'm not sure I understand. How do you expect to reboot your primary
>>>>>> storage while vms are running? It sounds like the host is being
>>>>>> fenced since it cannot contact the resources it depends on.
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>>>>>
>>>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>>>>>
>>>>>>>> Hello guys,
>>>>>>>>
>>>>>>>>
>>>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>>>>>> all of my host servers without properly shutting down the guest vms.
>>>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>>>>> servers and a few minutes later, to my horror, i've found out that
>>>>>>>> all
>>>>>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>>>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>>>>>> ACS release. I mean not only does it cause downtime, but also
>>>>>>>> possible
>>>>>>>> data loss and server corruption.
>>>>>>>
>>>>>>>
>>>>>>> Hi Andrei,
>>>>>>>
>>>>>>> Do you have HA enabled and did you put that primary storage in
>>>>>>> maintenance
>>>>>>> mode before rebooting it?
>>>>>>> It's my understanding that ACS relies on the shared storage to
>>>>>>> perform HA so
>>>>>>> if the storage goes it's expected to go berserk. I've noticed similar
>>>>>>> behaviour in Xenserver pools without ACS.
>>>>>>> I'd imagine a "cure" for this would be to use network distributed
>>>>>>> "filesystems" like GlusterFS or CEPH.
>>>>>>>
>>>>>>> Lucian
>>>>>>>
>>>>>>> --
>>>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>>>
>>>>>>> Nux!
>>>>>>> www.nux.ro
>>>
>>>
>

Re: ALARM - ACS reboots host servers!!!

Posted by Nux! <nu...@li.nux.ro>.

On 04.03.2014 12:55, Andrei Mikhailovsky wrote:

> Regarding having nfs and ceph storage in different clusters - sounds
> like a good idea for majority of cases, however, my setup will not
> allow me to do that just yet. I am using ceph for my root and data
> volumes and NFS for backup volumes.

Having tiered storage is one of the stronger features that have drawn 
me towards Cloudstack, it should work better.
I do plan to have a second, slower tier for backups and other more 
passive applications.

> I do currently need the backup
> volumes as snapshotting with KVM is somewhat broken / not fully
> working in 4.2.1. It has been improved from version 4.2.0 as it was
> completely broken. I am waiting for 4.3.0 where, hopefully, I would be
> able to keep snapshots on the primary storage (currently this feature
> is broken) which will make the snapshots with KVM usable.

KVM volume snapshots worked well in 4.2.1 AFAIK and they still work 
well in 4.3, but VM snapshots are still not supported and I don't think 
they will be any time soon. We might get somewhere with it if we opt for 
LVM thin storage and snapshots, that'd be cool.

Lucian

-- 
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Nux! <nu...@li.nux.ro>.

On 04.03.2014 12:55, Andrei Mikhailovsky wrote:

> Regarding having nfs and ceph storage in different clusters - sounds
> like a good idea for majority of cases, however, my setup will not
> allow me to do that just yet. I am using ceph for my root and data
> volumes and NFS for backup volumes.

Having tiered storage is one of the stronger features that have drawn 
me towards Cloudstack, it should work better.
I do plan to have a second, slower tier for backups and other more 
passive applications.

> I do currently need the backup
> volumes as snapshotting with KVM is somewhat broken / not fully
> working in 4.2.1. It has been improved from version 4.2.0 as it was
> completely broken. I am waiting for 4.3.0 where, hopefully, I would be
> able to keep snapshots on the primary storage (currently this feature
> is broken) which will make the snapshots with KVM usable.

KVM volume snapshots worked well in 4.2.1 AFAIK and they still work 
well in 4.3, but VM snapshots are still not supported and I don't think 
they will be any time soon. We might get somewhere with it if we opt for 
LVM thin storage and snapshots, that'd be cool.

Lucian

-- 
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Andrei Mikhailovsky <an...@arhont.com>.

I agree with France, sounds like a more sensible idea and killing hosts left, right and centre with live vms. I now understand the reasons behind killing the troubled host server, however, this should be done without killing live vms with fully working volumes. 


Regarding having nfs and ceph storage in different clusters - sounds like a good idea for majority of cases, however, my setup will not allow me to do that just yet. I am using ceph for my root and data volumes and NFS for backup volumes. I do currently need the backup volumes as snapshotting with KVM is somewhat broken / not fully working in 4.2.1. It has been improved from version 4.2.0 as it was completely broken. I am waiting for 4.3.0 where, hopefully, I would be able to keep snapshots on the primary storage (currently this feature is broken) which will make the snapshots with KVM usable. 


Cheers for your help guys 
----- Original Message -----

From: "France" <ma...@isg.si> 
To: users@cloudstack.apache.org, dev@cloudstack.apache.org 
Sent: Tuesday, 4 March, 2014 10:34:36 AM 
Subject: Re: ALARM - ACS reboots host servers!!! 

Hi Marcus and others. 

There is no need to kill of the entire hypervisor, if one of the primary 
storages fail. 
You just need to kill the VMs and probably disable SR on XenServer, 
because all other SRs and VMs have no problems. 
if you kill those, then you can safely start them elsewhere. On 
XenServer 6.2 you call destroy the VMs which lost access to NFS without 
any problems. 

If you really want to still kill the entire host and it's VMs in one go, 
I would suggest live migrating the VMs which have had not lost their 
storage off first, and then kill those VMs on a stale NFS by doing hard 
reboot. Additional time, while migrating working VMs, would even give 
some grace time for NFS to maybe recover. :-) 

Hard reboot to recover from D state of NFS client can also be avoided by 
using soft mount options. 

I run a bunch of Pacemaker/Corosync/Cman/Heartbeat/etc clusters and we 
don't just kill whole nodes but fence services from specific nodes. 
STONITH is implemented only when the node looses the quorum. 

Regards, 
F. 

On 3/3/14 5:35 PM, Marcus wrote: 
> It's the standard clustering problem. Any software that does any sort 
> of avtive clustering is going to fence nodes that have problems, or 
> should if it cares about your data. If the risk of losing a host due 
> to a storage pool outage is too great, you could perhaps look at 
> rearranging your pool-to-host correlations (certain hosts run vms from 
> certain pools) via clusters. Note that if you register a storage pool 
> with a cluster, it will register the pool with libvirt when the pool 
> is not in maintenance, which, when the storage pool goes down will 
> cause problems for the host even if no VMs from that storage are 
> running (fetching storage stats for example will cause agent threads 
> to hang if its NFS), so you'd need to put ceph in its own cluster and 
> NFS in its own cluster. 
> 
> It's far more dangerous to leave a host in an unknown/bad state. If a 
> host loses contact with one of your storage nodes, with HA, cloudstack 
> will want to start the affected VMs elsewhere. If it does so, and your 
> original host wakes up from it's NFS hang, you suddenly have a VM 
> running in two locations, corruption ensues. You might think we could 
> just stop the affected VMs, but NFS tends to make things that touch it 
> go into D state, even with 'intr' and other parameters, which affects 
> libvirt and the agent. 
> 
> We could perhaps open a feature request to disable all HA and just 
> leave things as-is, disallowing operations when there are outages. If 
> that sounds useful you can create the feature request on 
> https://issues.apache.org/jira. 
> 
> 
> On Mon, Mar 3, 2014 at 5:37 AM, Andrei Mikhailovsky <an...@arhont.com> wrote: 
>> Koushik, I understand that and I will put the storage into the maintenance mode next time. However, things happen and servers crash from time to time, which is not the reason to reboot all host servers, even those which do not have any running vms with volumes on the nfs storage. The bloody agent just rebooted every single host server regardless if they were running vms with volumes on the rebooted nfs server. 95% of my vms are running from ceph and those should have never been effected in the first place. 
>> ----- Original Message ----- 
>> 
>> From: "Koushik Das" <ko...@citrix.com> 
>> To: "<us...@cloudstack.apache.org>" <us...@cloudstack.apache.org> 
>> Cc: dev@cloudstack.apache.org 
>> Sent: Monday, 3 March, 2014 5:55:34 AM 
>> Subject: Re: ALARM - ACS reboots host servers!!! 
>> 
>> The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails. 
>> 
>> -Koushik 
>> 
>> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote: 
>> 
>>> Also, please note that in the bug you referenced it doesn't have a 
>>> problem with the reboot being triggered, but with the fact that reboot 
>>> never completes due to hanging NFS mount (which is why the reboot 
>>> occurs, inaccessible primary storage). 
>>> 
>>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote: 
>>>> Or do you mean you have multiple primary storages and this one was not 
>>>> in use and put into maintenance? 
>>>> 
>>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote: 
>>>>> I'm not sure I understand. How do you expect to reboot your primary 
>>>>> storage while vms are running? It sounds like the host is being 
>>>>> fenced since it cannot contact the resources it depends on. 
>>>>> 
>>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote: 
>>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote: 
>>>>>>> Hello guys, 
>>>>>>> 
>>>>>>> 
>>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted 
>>>>>>> all of my host servers without properly shutting down the guest vms. 
>>>>>>> I've simply upgraded and rebooted one of the nfs primary storage 
>>>>>>> servers and a few minutes later, to my horror, i've found out that all 
>>>>>>> of my host servers have been rebooted. Is it just me thinking so, or 
>>>>>>> is this bug should be fixed ASAP and should be a blocker for any new 
>>>>>>> ACS release. I mean not only does it cause downtime, but also possible 
>>>>>>> data loss and server corruption. 
>>>>>> 
>>>>>> Hi Andrei, 
>>>>>> 
>>>>>> Do you have HA enabled and did you put that primary storage in maintenance 
>>>>>> mode before rebooting it? 
>>>>>> It's my understanding that ACS relies on the shared storage to perform HA so 
>>>>>> if the storage goes it's expected to go berserk. I've noticed similar 
>>>>>> behaviour in Xenserver pools without ACS. 
>>>>>> I'd imagine a "cure" for this would be to use network distributed 
>>>>>> "filesystems" like GlusterFS or CEPH. 
>>>>>> 
>>>>>> Lucian 
>>>>>> 
>>>>>> -- 
>>>>>> Sent from the Delta quadrant using Borg technology! 
>>>>>> 
>>>>>> Nux! 
>>>>>> www.nux.ro 
>>

Re: ALARM - ACS reboots host servers!!!

Posted by Andrei Mikhailovsky <an...@arhont.com>.

I agree with France, sounds like a more sensible idea and killing hosts left, right and centre with live vms. I now understand the reasons behind killing the troubled host server, however, this should be done without killing live vms with fully working volumes. 


Regarding having nfs and ceph storage in different clusters - sounds like a good idea for majority of cases, however, my setup will not allow me to do that just yet. I am using ceph for my root and data volumes and NFS for backup volumes. I do currently need the backup volumes as snapshotting with KVM is somewhat broken / not fully working in 4.2.1. It has been improved from version 4.2.0 as it was completely broken. I am waiting for 4.3.0 where, hopefully, I would be able to keep snapshots on the primary storage (currently this feature is broken) which will make the snapshots with KVM usable. 


Cheers for your help guys 
----- Original Message -----

From: "France" <ma...@isg.si> 
To: users@cloudstack.apache.org, dev@cloudstack.apache.org 
Sent: Tuesday, 4 March, 2014 10:34:36 AM 
Subject: Re: ALARM - ACS reboots host servers!!! 

Hi Marcus and others. 

There is no need to kill of the entire hypervisor, if one of the primary 
storages fail. 
You just need to kill the VMs and probably disable SR on XenServer, 
because all other SRs and VMs have no problems. 
if you kill those, then you can safely start them elsewhere. On 
XenServer 6.2 you call destroy the VMs which lost access to NFS without 
any problems. 

If you really want to still kill the entire host and it's VMs in one go, 
I would suggest live migrating the VMs which have had not lost their 
storage off first, and then kill those VMs on a stale NFS by doing hard 
reboot. Additional time, while migrating working VMs, would even give 
some grace time for NFS to maybe recover. :-) 

Hard reboot to recover from D state of NFS client can also be avoided by 
using soft mount options. 

I run a bunch of Pacemaker/Corosync/Cman/Heartbeat/etc clusters and we 
don't just kill whole nodes but fence services from specific nodes. 
STONITH is implemented only when the node looses the quorum. 

Regards, 
F. 

On 3/3/14 5:35 PM, Marcus wrote: 
> It's the standard clustering problem. Any software that does any sort 
> of avtive clustering is going to fence nodes that have problems, or 
> should if it cares about your data. If the risk of losing a host due 
> to a storage pool outage is too great, you could perhaps look at 
> rearranging your pool-to-host correlations (certain hosts run vms from 
> certain pools) via clusters. Note that if you register a storage pool 
> with a cluster, it will register the pool with libvirt when the pool 
> is not in maintenance, which, when the storage pool goes down will 
> cause problems for the host even if no VMs from that storage are 
> running (fetching storage stats for example will cause agent threads 
> to hang if its NFS), so you'd need to put ceph in its own cluster and 
> NFS in its own cluster. 
> 
> It's far more dangerous to leave a host in an unknown/bad state. If a 
> host loses contact with one of your storage nodes, with HA, cloudstack 
> will want to start the affected VMs elsewhere. If it does so, and your 
> original host wakes up from it's NFS hang, you suddenly have a VM 
> running in two locations, corruption ensues. You might think we could 
> just stop the affected VMs, but NFS tends to make things that touch it 
> go into D state, even with 'intr' and other parameters, which affects 
> libvirt and the agent. 
> 
> We could perhaps open a feature request to disable all HA and just 
> leave things as-is, disallowing operations when there are outages. If 
> that sounds useful you can create the feature request on 
> https://issues.apache.org/jira. 
> 
> 
> On Mon, Mar 3, 2014 at 5:37 AM, Andrei Mikhailovsky <an...@arhont.com> wrote: 
>> Koushik, I understand that and I will put the storage into the maintenance mode next time. However, things happen and servers crash from time to time, which is not the reason to reboot all host servers, even those which do not have any running vms with volumes on the nfs storage. The bloody agent just rebooted every single host server regardless if they were running vms with volumes on the rebooted nfs server. 95% of my vms are running from ceph and those should have never been effected in the first place. 
>> ----- Original Message ----- 
>> 
>> From: "Koushik Das" <ko...@citrix.com> 
>> To: "<us...@cloudstack.apache.org>" <us...@cloudstack.apache.org> 
>> Cc: dev@cloudstack.apache.org 
>> Sent: Monday, 3 March, 2014 5:55:34 AM 
>> Subject: Re: ALARM - ACS reboots host servers!!! 
>> 
>> The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails. 
>> 
>> -Koushik 
>> 
>> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote: 
>> 
>>> Also, please note that in the bug you referenced it doesn't have a 
>>> problem with the reboot being triggered, but with the fact that reboot 
>>> never completes due to hanging NFS mount (which is why the reboot 
>>> occurs, inaccessible primary storage). 
>>> 
>>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote: 
>>>> Or do you mean you have multiple primary storages and this one was not 
>>>> in use and put into maintenance? 
>>>> 
>>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote: 
>>>>> I'm not sure I understand. How do you expect to reboot your primary 
>>>>> storage while vms are running? It sounds like the host is being 
>>>>> fenced since it cannot contact the resources it depends on. 
>>>>> 
>>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote: 
>>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote: 
>>>>>>> Hello guys, 
>>>>>>> 
>>>>>>> 
>>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted 
>>>>>>> all of my host servers without properly shutting down the guest vms. 
>>>>>>> I've simply upgraded and rebooted one of the nfs primary storage 
>>>>>>> servers and a few minutes later, to my horror, i've found out that all 
>>>>>>> of my host servers have been rebooted. Is it just me thinking so, or 
>>>>>>> is this bug should be fixed ASAP and should be a blocker for any new 
>>>>>>> ACS release. I mean not only does it cause downtime, but also possible 
>>>>>>> data loss and server corruption. 
>>>>>> 
>>>>>> Hi Andrei, 
>>>>>> 
>>>>>> Do you have HA enabled and did you put that primary storage in maintenance 
>>>>>> mode before rebooting it? 
>>>>>> It's my understanding that ACS relies on the shared storage to perform HA so 
>>>>>> if the storage goes it's expected to go berserk. I've noticed similar 
>>>>>> behaviour in Xenserver pools without ACS. 
>>>>>> I'd imagine a "cure" for this would be to use network distributed 
>>>>>> "filesystems" like GlusterFS or CEPH. 
>>>>>> 
>>>>>> Lucian 
>>>>>> 
>>>>>> -- 
>>>>>> Sent from the Delta quadrant using Borg technology! 
>>>>>> 
>>>>>> Nux! 
>>>>>> www.nux.ro 
>>

Re: ALARM - ACS reboots host servers!!!

Posted by France <ma...@isg.si>.

Hi Marcus and others.

There is no need to kill of the entire hypervisor, if one of the primary 
storages fail.
You just need to kill the VMs and probably disable SR on XenServer, 
because all other SRs and VMs have no problems.
if you kill those, then you can safely start them elsewhere. On 
XenServer 6.2 you call destroy the VMs which lost access to NFS without 
any problems.

If you really want to still kill the entire host and it's VMs in one go, 
I would suggest live migrating the VMs which have had not lost their 
storage off first, and then kill those VMs on a stale NFS by doing hard 
reboot. Additional time, while migrating working VMs, would even give 
some grace time for NFS to maybe recover. :-)

Hard reboot to recover from D state of NFS client can also be avoided by 
using soft mount options.

I run a bunch of Pacemaker/Corosync/Cman/Heartbeat/etc clusters and we 
don't just kill whole nodes but fence services from specific nodes. 
STONITH is implemented only when the node looses the quorum.

Regards,
F.

On 3/3/14 5:35 PM, Marcus wrote:
> It's the standard clustering problem. Any software that does any sort
> of avtive clustering is going to fence nodes that have problems, or
> should if it cares about your data. If the risk of losing a host due
> to a storage pool outage is too great, you could perhaps look at
> rearranging your pool-to-host correlations (certain hosts run vms from
> certain pools) via clusters. Note that if you register a storage pool
> with a cluster, it will register the pool with libvirt when the pool
> is not in maintenance, which, when the storage pool goes down will
> cause problems for the host even if no VMs from that storage are
> running (fetching storage stats for example will cause agent threads
> to hang if its NFS), so you'd need to put ceph in its own cluster and
> NFS in its own cluster.
>
> It's far more dangerous to leave a host in an unknown/bad state. If a
> host loses contact with one of your storage nodes, with HA, cloudstack
> will want to start the affected VMs elsewhere. If it does so, and your
> original host wakes up from it's NFS hang, you suddenly have a VM
> running in two locations, corruption ensues. You might think we could
> just stop the affected VMs, but NFS tends to make things that touch it
> go into D state, even with 'intr' and other parameters, which affects
> libvirt and the agent.
>
> We could perhaps open a feature request to disable all HA and just
> leave things as-is, disallowing operations when there are outages. If
> that sounds useful you can create the feature request on
> https://issues.apache.org/jira.
>
>
> On Mon, Mar 3, 2014 at 5:37 AM, Andrei Mikhailovsky <an...@arhont.com> wrote:
>> Koushik, I understand that and I will put the storage into the maintenance mode next time. However, things happen and servers crash from time to time, which is not the reason to reboot all host servers, even those which do not have any running vms with volumes on the nfs storage. The bloody agent just rebooted every single host server regardless if they were running vms with volumes on the rebooted nfs server. 95% of my vms are running from ceph and those should have never been effected in the first place.
>> ----- Original Message -----
>>
>> From: "Koushik Das" <ko...@citrix.com>
>> To: "<us...@cloudstack.apache.org>" <us...@cloudstack.apache.org>
>> Cc: dev@cloudstack.apache.org
>> Sent: Monday, 3 March, 2014 5:55:34 AM
>> Subject: Re: ALARM - ACS reboots host servers!!!
>>
>> The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails.
>>
>> -Koushik
>>
>> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
>>
>>> Also, please note that in the bug you referenced it doesn't have a
>>> problem with the reboot being triggered, but with the fact that reboot
>>> never completes due to hanging NFS mount (which is why the reboot
>>> occurs, inaccessible primary storage).
>>>
>>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>>>> Or do you mean you have multiple primary storages and this one was not
>>>> in use and put into maintenance?
>>>>
>>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>>>> I'm not sure I understand. How do you expect to reboot your primary
>>>>> storage while vms are running? It sounds like the host is being
>>>>> fenced since it cannot contact the resources it depends on.
>>>>>
>>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>>>> Hello guys,
>>>>>>>
>>>>>>>
>>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>>>>> all of my host servers without properly shutting down the guest vms.
>>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>>>> servers and a few minutes later, to my horror, i've found out that all
>>>>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>>>>> ACS release. I mean not only does it cause downtime, but also possible
>>>>>>> data loss and server corruption.
>>>>>>
>>>>>> Hi Andrei,
>>>>>>
>>>>>> Do you have HA enabled and did you put that primary storage in maintenance
>>>>>> mode before rebooting it?
>>>>>> It's my understanding that ACS relies on the shared storage to perform HA so
>>>>>> if the storage goes it's expected to go berserk. I've noticed similar
>>>>>> behaviour in Xenserver pools without ACS.
>>>>>> I'd imagine a "cure" for this would be to use network distributed
>>>>>> "filesystems" like GlusterFS or CEPH.
>>>>>>
>>>>>> Lucian
>>>>>>
>>>>>> --
>>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>>
>>>>>> Nux!
>>>>>> www.nux.ro
>>

Re: ALARM - ACS reboots host servers!!!

Posted by France <ma...@isg.si>.

Hi Marcus and others.

There is no need to kill of the entire hypervisor, if one of the primary 
storages fail.
You just need to kill the VMs and probably disable SR on XenServer, 
because all other SRs and VMs have no problems.
if you kill those, then you can safely start them elsewhere. On 
XenServer 6.2 you call destroy the VMs which lost access to NFS without 
any problems.

If you really want to still kill the entire host and it's VMs in one go, 
I would suggest live migrating the VMs which have had not lost their 
storage off first, and then kill those VMs on a stale NFS by doing hard 
reboot. Additional time, while migrating working VMs, would even give 
some grace time for NFS to maybe recover. :-)

Hard reboot to recover from D state of NFS client can also be avoided by 
using soft mount options.

I run a bunch of Pacemaker/Corosync/Cman/Heartbeat/etc clusters and we 
don't just kill whole nodes but fence services from specific nodes. 
STONITH is implemented only when the node looses the quorum.

Regards,
F.

On 3/3/14 5:35 PM, Marcus wrote:
> It's the standard clustering problem. Any software that does any sort
> of avtive clustering is going to fence nodes that have problems, or
> should if it cares about your data. If the risk of losing a host due
> to a storage pool outage is too great, you could perhaps look at
> rearranging your pool-to-host correlations (certain hosts run vms from
> certain pools) via clusters. Note that if you register a storage pool
> with a cluster, it will register the pool with libvirt when the pool
> is not in maintenance, which, when the storage pool goes down will
> cause problems for the host even if no VMs from that storage are
> running (fetching storage stats for example will cause agent threads
> to hang if its NFS), so you'd need to put ceph in its own cluster and
> NFS in its own cluster.
>
> It's far more dangerous to leave a host in an unknown/bad state. If a
> host loses contact with one of your storage nodes, with HA, cloudstack
> will want to start the affected VMs elsewhere. If it does so, and your
> original host wakes up from it's NFS hang, you suddenly have a VM
> running in two locations, corruption ensues. You might think we could
> just stop the affected VMs, but NFS tends to make things that touch it
> go into D state, even with 'intr' and other parameters, which affects
> libvirt and the agent.
>
> We could perhaps open a feature request to disable all HA and just
> leave things as-is, disallowing operations when there are outages. If
> that sounds useful you can create the feature request on
> https://issues.apache.org/jira.
>
>
> On Mon, Mar 3, 2014 at 5:37 AM, Andrei Mikhailovsky <an...@arhont.com> wrote:
>> Koushik, I understand that and I will put the storage into the maintenance mode next time. However, things happen and servers crash from time to time, which is not the reason to reboot all host servers, even those which do not have any running vms with volumes on the nfs storage. The bloody agent just rebooted every single host server regardless if they were running vms with volumes on the rebooted nfs server. 95% of my vms are running from ceph and those should have never been effected in the first place.
>> ----- Original Message -----
>>
>> From: "Koushik Das" <ko...@citrix.com>
>> To: "<us...@cloudstack.apache.org>" <us...@cloudstack.apache.org>
>> Cc: dev@cloudstack.apache.org
>> Sent: Monday, 3 March, 2014 5:55:34 AM
>> Subject: Re: ALARM - ACS reboots host servers!!!
>>
>> The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails.
>>
>> -Koushik
>>
>> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
>>
>>> Also, please note that in the bug you referenced it doesn't have a
>>> problem with the reboot being triggered, but with the fact that reboot
>>> never completes due to hanging NFS mount (which is why the reboot
>>> occurs, inaccessible primary storage).
>>>
>>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>>>> Or do you mean you have multiple primary storages and this one was not
>>>> in use and put into maintenance?
>>>>
>>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>>>> I'm not sure I understand. How do you expect to reboot your primary
>>>>> storage while vms are running? It sounds like the host is being
>>>>> fenced since it cannot contact the resources it depends on.
>>>>>
>>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>>>> Hello guys,
>>>>>>>
>>>>>>>
>>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>>>>> all of my host servers without properly shutting down the guest vms.
>>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>>>> servers and a few minutes later, to my horror, i've found out that all
>>>>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>>>>> ACS release. I mean not only does it cause downtime, but also possible
>>>>>>> data loss and server corruption.
>>>>>>
>>>>>> Hi Andrei,
>>>>>>
>>>>>> Do you have HA enabled and did you put that primary storage in maintenance
>>>>>> mode before rebooting it?
>>>>>> It's my understanding that ACS relies on the shared storage to perform HA so
>>>>>> if the storage goes it's expected to go berserk. I've noticed similar
>>>>>> behaviour in Xenserver pools without ACS.
>>>>>> I'd imagine a "cure" for this would be to use network distributed
>>>>>> "filesystems" like GlusterFS or CEPH.
>>>>>>
>>>>>> Lucian
>>>>>>
>>>>>> --
>>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>>
>>>>>> Nux!
>>>>>> www.nux.ro
>>

Re: ALARM - ACS reboots host servers!!!

Posted by Marcus <sh...@gmail.com>.

It's the standard clustering problem. Any software that does any sort
of avtive clustering is going to fence nodes that have problems, or
should if it cares about your data. If the risk of losing a host due
to a storage pool outage is too great, you could perhaps look at
rearranging your pool-to-host correlations (certain hosts run vms from
certain pools) via clusters. Note that if you register a storage pool
with a cluster, it will register the pool with libvirt when the pool
is not in maintenance, which, when the storage pool goes down will
cause problems for the host even if no VMs from that storage are
running (fetching storage stats for example will cause agent threads
to hang if its NFS), so you'd need to put ceph in its own cluster and
NFS in its own cluster.

It's far more dangerous to leave a host in an unknown/bad state. If a
host loses contact with one of your storage nodes, with HA, cloudstack
will want to start the affected VMs elsewhere. If it does so, and your
original host wakes up from it's NFS hang, you suddenly have a VM
running in two locations, corruption ensues. You might think we could
just stop the affected VMs, but NFS tends to make things that touch it
go into D state, even with 'intr' and other parameters, which affects
libvirt and the agent.

We could perhaps open a feature request to disable all HA and just
leave things as-is, disallowing operations when there are outages. If
that sounds useful you can create the feature request on
https://issues.apache.org/jira.

On Mon, Mar 3, 2014 at 5:37 AM, Andrei Mikhailovsky <an...@arhont.com> wrote:
>
> Koushik, I understand that and I will put the storage into the maintenance mode next time. However, things happen and servers crash from time to time, which is not the reason to reboot all host servers, even those which do not have any running vms with volumes on the nfs storage. The bloody agent just rebooted every single host server regardless if they were running vms with volumes on the rebooted nfs server. 95% of my vms are running from ceph and those should have never been effected in the first place.
> ----- Original Message -----
>
> From: "Koushik Das" <ko...@citrix.com>
> To: "<us...@cloudstack.apache.org>" <us...@cloudstack.apache.org>
> Cc: dev@cloudstack.apache.org
> Sent: Monday, 3 March, 2014 5:55:34 AM
> Subject: Re: ALARM - ACS reboots host servers!!!
>
> The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails.
>
> -Koushik
>
> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
>
>> Also, please note that in the bug you referenced it doesn't have a
>> problem with the reboot being triggered, but with the fact that reboot
>> never completes due to hanging NFS mount (which is why the reboot
>> occurs, inaccessible primary storage).
>>
>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>>> Or do you mean you have multiple primary storages and this one was not
>>> in use and put into maintenance?
>>>
>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>>> I'm not sure I understand. How do you expect to reboot your primary
>>>> storage while vms are running? It sounds like the host is being
>>>> fenced since it cannot contact the resources it depends on.
>>>>
>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>>>
>>>>>> Hello guys,
>>>>>>
>>>>>>
>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>>>> all of my host servers without properly shutting down the guest vms.
>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>>> servers and a few minutes later, to my horror, i've found out that all
>>>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>>>> ACS release. I mean not only does it cause downtime, but also possible
>>>>>> data loss and server corruption.
>>>>>
>>>>>
>>>>> Hi Andrei,
>>>>>
>>>>> Do you have HA enabled and did you put that primary storage in maintenance
>>>>> mode before rebooting it?
>>>>> It's my understanding that ACS relies on the shared storage to perform HA so
>>>>> if the storage goes it's expected to go berserk. I've noticed similar
>>>>> behaviour in Xenserver pools without ACS.
>>>>> I'd imagine a "cure" for this would be to use network distributed
>>>>> "filesystems" like GlusterFS or CEPH.
>>>>>
>>>>> Lucian
>>>>>
>>>>> --
>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>
>>>>> Nux!
>>>>> www.nux.ro
>
>

Re: ALARM - ACS reboots host servers!!!

Posted by Marcus <sh...@gmail.com>.

It's the standard clustering problem. Any software that does any sort
of avtive clustering is going to fence nodes that have problems, or
should if it cares about your data. If the risk of losing a host due
to a storage pool outage is too great, you could perhaps look at
rearranging your pool-to-host correlations (certain hosts run vms from
certain pools) via clusters. Note that if you register a storage pool
with a cluster, it will register the pool with libvirt when the pool
is not in maintenance, which, when the storage pool goes down will
cause problems for the host even if no VMs from that storage are
running (fetching storage stats for example will cause agent threads
to hang if its NFS), so you'd need to put ceph in its own cluster and
NFS in its own cluster.

It's far more dangerous to leave a host in an unknown/bad state. If a
host loses contact with one of your storage nodes, with HA, cloudstack
will want to start the affected VMs elsewhere. If it does so, and your
original host wakes up from it's NFS hang, you suddenly have a VM
running in two locations, corruption ensues. You might think we could
just stop the affected VMs, but NFS tends to make things that touch it
go into D state, even with 'intr' and other parameters, which affects
libvirt and the agent.

We could perhaps open a feature request to disable all HA and just
leave things as-is, disallowing operations when there are outages. If
that sounds useful you can create the feature request on
https://issues.apache.org/jira.

On Mon, Mar 3, 2014 at 5:37 AM, Andrei Mikhailovsky <an...@arhont.com> wrote:
>
> Koushik, I understand that and I will put the storage into the maintenance mode next time. However, things happen and servers crash from time to time, which is not the reason to reboot all host servers, even those which do not have any running vms with volumes on the nfs storage. The bloody agent just rebooted every single host server regardless if they were running vms with volumes on the rebooted nfs server. 95% of my vms are running from ceph and those should have never been effected in the first place.
> ----- Original Message -----
>
> From: "Koushik Das" <ko...@citrix.com>
> To: "<us...@cloudstack.apache.org>" <us...@cloudstack.apache.org>
> Cc: dev@cloudstack.apache.org
> Sent: Monday, 3 March, 2014 5:55:34 AM
> Subject: Re: ALARM - ACS reboots host servers!!!
>
> The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails.
>
> -Koushik
>
> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
>
>> Also, please note that in the bug you referenced it doesn't have a
>> problem with the reboot being triggered, but with the fact that reboot
>> never completes due to hanging NFS mount (which is why the reboot
>> occurs, inaccessible primary storage).
>>
>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>>> Or do you mean you have multiple primary storages and this one was not
>>> in use and put into maintenance?
>>>
>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>>> I'm not sure I understand. How do you expect to reboot your primary
>>>> storage while vms are running? It sounds like the host is being
>>>> fenced since it cannot contact the resources it depends on.
>>>>
>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>>>
>>>>>> Hello guys,
>>>>>>
>>>>>>
>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>>>> all of my host servers without properly shutting down the guest vms.
>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>>> servers and a few minutes later, to my horror, i've found out that all
>>>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>>>> ACS release. I mean not only does it cause downtime, but also possible
>>>>>> data loss and server corruption.
>>>>>
>>>>>
>>>>> Hi Andrei,
>>>>>
>>>>> Do you have HA enabled and did you put that primary storage in maintenance
>>>>> mode before rebooting it?
>>>>> It's my understanding that ACS relies on the shared storage to perform HA so
>>>>> if the storage goes it's expected to go berserk. I've noticed similar
>>>>> behaviour in Xenserver pools without ACS.
>>>>> I'd imagine a "cure" for this would be to use network distributed
>>>>> "filesystems" like GlusterFS or CEPH.
>>>>>
>>>>> Lucian
>>>>>
>>>>> --
>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>
>>>>> Nux!
>>>>> www.nux.ro
>
>

Re: ALARM - ACS reboots host servers!!!

Posted by Amin Samir <am...@hotmail.com>.

Hello,

This link addresses your issue.
https://issues.apache.org/jira/browse/CLOUDSTACK-3367

Amin

Sent from my iPad

> On Mar 3, 2014, at 1:56 PM, "Koushik Das" <ko...@citrix.com> wrote:
> 
> The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails.
> 
> -Koushik
> 
>> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
>> 
>> Also, please note that in the bug you referenced it doesn't have a
>> problem with the reboot being triggered, but with the fact that reboot
>> never completes due to hanging NFS mount (which is why the reboot
>> occurs, inaccessible primary storage).
>> 
>>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>>> Or do you mean you have multiple primary storages and this one was not
>>> in use and put into maintenance?
>>> 
>>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>>> I'm not sure I understand. How do you expect to reboot your primary
>>>> storage while vms are running?  It sounds like the host is being
>>>> fenced since it cannot contact the resources it depends on.
>>>> 
>>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>>> 
>>>>>> Hello guys,
>>>>>> 
>>>>>> 
>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>>>> all of my host servers without properly shutting down the guest vms.
>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>>> servers and a few minutes later, to my horror, i've found out that all
>>>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>>>> ACS release. I mean not only does it cause downtime, but also possible
>>>>>> data loss and server corruption.
>>>>> 
>>>>> 
>>>>> Hi Andrei,
>>>>> 
>>>>> Do you have HA enabled and did you put that primary storage in maintenance
>>>>> mode before rebooting it?
>>>>> It's my understanding that ACS relies on the shared storage to perform HA so
>>>>> if the storage goes it's expected to go berserk. I've noticed similar
>>>>> behaviour in Xenserver pools without ACS.
>>>>> I'd imagine a "cure" for this would be to use network distributed
>>>>> "filesystems" like GlusterFS or CEPH.
>>>>> 
>>>>> Lucian
>>>>> 
>>>>> --
>>>>> Sent from the Delta quadrant using Borg technology!
>>>>> 
>>>>> Nux!
>>>>> www.nux.ro
>

Re: ALARM - ACS reboots host servers!!!

Posted by Andrei Mikhailovsky <an...@arhont.com>.

Pretty poor, I agree. 


IMHO the ACS agent should not be allowed to reboot the host server. This is not the type of things you would like to automate as you will eventually end up with broken volumes and data loss. 


And you are right of course, like what happened in my case. I currently have two vms which used that NFS server for volumes and the rest 50+ vms use ceph. As a result of the nfs server reboot all host servers have rebooted causing 50+ vms to reset without being properly shutdown. 


I am using ACS 4.2.1 with KVM, so this issue seems to be present on KVM + XenServer. 


Andrei 
----- Original Message -----

From: "France" <ma...@isg.si> 
To: users@cloudstack.apache.org 
Cc: dev@cloudstack.apache.org 
Sent: Monday, 3 March, 2014 8:49:28 AM 
Subject: Re: ALARM - ACS reboots host servers!!! 

I believe this is a bug too, because VMs not running on the storage, get 
destroyed too: 

Issue has been around for a long time, like with all others I reported. 
They do not get fixed: 
https://issues.apache.org/jira/browse/CLOUDSTACK-3367 

We even lost assignee today. 

Regards, 
F. 

On 3/3/14 6:55 AM, Koushik Das wrote: 
> The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails. 
> 
> -Koushik 
> 
> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote: 
> 
>> Also, please note that in the bug you referenced it doesn't have a 
>> problem with the reboot being triggered, but with the fact that reboot 
>> never completes due to hanging NFS mount (which is why the reboot 
>> occurs, inaccessible primary storage). 
>> 
>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote: 
>>> Or do you mean you have multiple primary storages and this one was not 
>>> in use and put into maintenance? 
>>> 
>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote: 
>>>> I'm not sure I understand. How do you expect to reboot your primary 
>>>> storage while vms are running? It sounds like the host is being 
>>>> fenced since it cannot contact the resources it depends on. 
>>>> 
>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote: 
>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote: 
>>>>>> Hello guys, 
>>>>>> 
>>>>>> 
>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted 
>>>>>> all of my host servers without properly shutting down the guest vms. 
>>>>>> I've simply upgraded and rebooted one of the nfs primary storage 
>>>>>> servers and a few minutes later, to my horror, i've found out that all 
>>>>>> of my host servers have been rebooted. Is it just me thinking so, or 
>>>>>> is this bug should be fixed ASAP and should be a blocker for any new 
>>>>>> ACS release. I mean not only does it cause downtime, but also possible 
>>>>>> data loss and server corruption. 
>>>>> 
>>>>> Hi Andrei, 
>>>>> 
>>>>> Do you have HA enabled and did you put that primary storage in maintenance 
>>>>> mode before rebooting it? 
>>>>> It's my understanding that ACS relies on the shared storage to perform HA so 
>>>>> if the storage goes it's expected to go berserk. I've noticed similar 
>>>>> behaviour in Xenserver pools without ACS. 
>>>>> I'd imagine a "cure" for this would be to use network distributed 
>>>>> "filesystems" like GlusterFS or CEPH. 
>>>>> 
>>>>> Lucian 
>>>>> 
>>>>> -- 
>>>>> Sent from the Delta quadrant using Borg technology! 
>>>>> 
>>>>> Nux! 
>>>>> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Andrei Mikhailovsky <an...@arhont.com>.

Pretty poor, I agree. 


IMHO the ACS agent should not be allowed to reboot the host server. This is not the type of things you would like to automate as you will eventually end up with broken volumes and data loss. 


And you are right of course, like what happened in my case. I currently have two vms which used that NFS server for volumes and the rest 50+ vms use ceph. As a result of the nfs server reboot all host servers have rebooted causing 50+ vms to reset without being properly shutdown. 


I am using ACS 4.2.1 with KVM, so this issue seems to be present on KVM + XenServer. 


Andrei 
----- Original Message -----

From: "France" <ma...@isg.si> 
To: users@cloudstack.apache.org 
Cc: dev@cloudstack.apache.org 
Sent: Monday, 3 March, 2014 8:49:28 AM 
Subject: Re: ALARM - ACS reboots host servers!!! 

I believe this is a bug too, because VMs not running on the storage, get 
destroyed too: 

Issue has been around for a long time, like with all others I reported. 
They do not get fixed: 
https://issues.apache.org/jira/browse/CLOUDSTACK-3367 

We even lost assignee today. 

Regards, 
F. 

On 3/3/14 6:55 AM, Koushik Das wrote: 
> The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails. 
> 
> -Koushik 
> 
> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote: 
> 
>> Also, please note that in the bug you referenced it doesn't have a 
>> problem with the reboot being triggered, but with the fact that reboot 
>> never completes due to hanging NFS mount (which is why the reboot 
>> occurs, inaccessible primary storage). 
>> 
>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote: 
>>> Or do you mean you have multiple primary storages and this one was not 
>>> in use and put into maintenance? 
>>> 
>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote: 
>>>> I'm not sure I understand. How do you expect to reboot your primary 
>>>> storage while vms are running? It sounds like the host is being 
>>>> fenced since it cannot contact the resources it depends on. 
>>>> 
>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote: 
>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote: 
>>>>>> Hello guys, 
>>>>>> 
>>>>>> 
>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted 
>>>>>> all of my host servers without properly shutting down the guest vms. 
>>>>>> I've simply upgraded and rebooted one of the nfs primary storage 
>>>>>> servers and a few minutes later, to my horror, i've found out that all 
>>>>>> of my host servers have been rebooted. Is it just me thinking so, or 
>>>>>> is this bug should be fixed ASAP and should be a blocker for any new 
>>>>>> ACS release. I mean not only does it cause downtime, but also possible 
>>>>>> data loss and server corruption. 
>>>>> 
>>>>> Hi Andrei, 
>>>>> 
>>>>> Do you have HA enabled and did you put that primary storage in maintenance 
>>>>> mode before rebooting it? 
>>>>> It's my understanding that ACS relies on the shared storage to perform HA so 
>>>>> if the storage goes it's expected to go berserk. I've noticed similar 
>>>>> behaviour in Xenserver pools without ACS. 
>>>>> I'd imagine a "cure" for this would be to use network distributed 
>>>>> "filesystems" like GlusterFS or CEPH. 
>>>>> 
>>>>> Lucian 
>>>>> 
>>>>> -- 
>>>>> Sent from the Delta quadrant using Borg technology! 
>>>>> 
>>>>> Nux! 
>>>>> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by France <ma...@isg.si>.

I think this problem might only exist on KVM.
Can any1 with primary NFS test it on XenServer?


On 3/4/14 9:48 PM, Andrei Mikhailovsky wrote:
> +1
>
>
> ----- Original Message -----
> From: "Alex Huang" <Al...@citrix.com>
> To: dev@cloudstack.apache.org
> Sent: Thursday, 3 April, 2014 6:47:22 PM
> Subject: RE: ALARM - ACS reboots host servers!!!
>
> This is a severe bug if that's the case.  It's supposed to stop the heartbeat script when a primary storage is placed in maintenance.
>
> --Alex
>
>> -----Original Message-----
>> From: France [mailto:mailinglists@isg.si]
>> Sent: Thursday, April 3, 2014 1:06 AM
>> To: dev@cloudstack.apache.org
>> Subject: Re: ALARM - ACS reboots host servers!!!
>>
>> I'm also interested in this issue.
>> Can any1 from developers confirm this is expected behavior?
>>
>> On 2/4/14 2:32 PM, Andrei Mikhailovsky wrote:
>>> Coming back to this issue.
>>>
>>> This time to perform the maintenance of the nfs primary storage I've
>> plated the storage in question in the Maintenance mode. After about 20
>> minutes ACS showed the nfs storage is in Maintenance. However, none of
>> the virtual machines with volumes on that storage were stopped. I've
>> manually stopped the virtual machines and went to upgrade and restart the
>> nfs server.
>>> A few minutes after the nfs server shutdown all of my host servers went
>> into reboot killing all vms!
>>> Thus, it seems that putting nfs server in Maintenance mode does not stop
>> ACS agent from restarting the host servers.
>>> Does anyone know a way to stop this behaviour?
>>>
>>> Thanks
>>>
>>> Andrei
>>>
>>>
>>> ----- Original Message -----
>>> From: "France" <ma...@isg.si>
>>> To: users@cloudstack.apache.org
>>> Cc: dev@cloudstack.apache.org
>>> Sent: Monday, 3 March, 2014 9:49:28 AM
>>> Subject: Re: ALARM - ACS reboots host servers!!!
>>>
>>> I believe this is a bug too, because VMs not running on the storage,
>>> get destroyed too:
>>>
>>> Issue has been around for a long time, like with all others I reported.
>>> They do not get fixed:
>>> https://issues.apache.org/jira/browse/CLOUDSTACK-3367
>>>
>>> We even lost assignee today.
>>>
>>> Regards,
>>> F.
>>>
>>> On 3/3/14 6:55 AM, Koushik Das wrote:
>>>> The primary storage needs to be put in maintenance before doing any
>> upgrade/reboot as mentioned in the previous mails.
>>>> -Koushik
>>>>
>>>> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
>>>>
>>>>> Also, please note that in the bug you referenced it doesn't have a
>>>>> problem with the reboot being triggered, but with the fact that
>>>>> reboot never completes due to hanging NFS mount (which is why the
>>>>> reboot occurs, inaccessible primary storage).
>>>>>
>>>>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>>>>>> Or do you mean you have multiple primary storages and this one was
>>>>>> not in use and put into maintenance?
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com>
>> wrote:
>>>>>>> I'm not sure I understand. How do you expect to reboot your
>>>>>>> primary storage while vms are running?  It sounds like the host is
>>>>>>> being fenced since it cannot contact the resources it depends on.
>>>>>>>
>>>>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>>>>>> Hello guys,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has
>>>>>>>>> rebooted all of my host servers without properly shutting down the
>> guest vms.
>>>>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>>>>>> servers and a few minutes later, to my horror, i've found out
>>>>>>>>> that all of my host servers have been rebooted. Is it just me
>>>>>>>>> thinking so, or is this bug should be fixed ASAP and should be a
>>>>>>>>> blocker for any new ACS release. I mean not only does it cause
>>>>>>>>> downtime, but also possible data loss and server corruption.
>>>>>>>> Hi Andrei,
>>>>>>>>
>>>>>>>> Do you have HA enabled and did you put that primary storage in
>>>>>>>> maintenance mode before rebooting it?
>>>>>>>> It's my understanding that ACS relies on the shared storage to
>>>>>>>> perform HA so if the storage goes it's expected to go berserk.
>>>>>>>> I've noticed similar behaviour in Xenserver pools without ACS.
>>>>>>>> I'd imagine a "cure" for this would be to use network distributed
>>>>>>>> "filesystems" like GlusterFS or CEPH.
>>>>>>>>
>>>>>>>> Lucian
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>>>>
>>>>>>>> Nux!
>>>>>>>> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Andrei Mikhailovsky <an...@arhont.com>.

+1


----- Original Message -----
From: "Alex Huang" <Al...@citrix.com>
To: dev@cloudstack.apache.org
Sent: Thursday, 3 April, 2014 6:47:22 PM
Subject: RE: ALARM - ACS reboots host servers!!!

This is a severe bug if that's the case.  It's supposed to stop the heartbeat script when a primary storage is placed in maintenance.

--Alex

> -----Original Message-----
> From: France [mailto:mailinglists@isg.si]
> Sent: Thursday, April 3, 2014 1:06 AM
> To: dev@cloudstack.apache.org
> Subject: Re: ALARM - ACS reboots host servers!!!
> 
> I'm also interested in this issue.
> Can any1 from developers confirm this is expected behavior?
> 
> On 2/4/14 2:32 PM, Andrei Mikhailovsky wrote:
> > Coming back to this issue.
> >
> > This time to perform the maintenance of the nfs primary storage I've
> plated the storage in question in the Maintenance mode. After about 20
> minutes ACS showed the nfs storage is in Maintenance. However, none of
> the virtual machines with volumes on that storage were stopped. I've
> manually stopped the virtual machines and went to upgrade and restart the
> nfs server.
> >
> > A few minutes after the nfs server shutdown all of my host servers went
> into reboot killing all vms!
> >
> > Thus, it seems that putting nfs server in Maintenance mode does not stop
> ACS agent from restarting the host servers.
> >
> > Does anyone know a way to stop this behaviour?
> >
> > Thanks
> >
> > Andrei
> >
> >
> > ----- Original Message -----
> > From: "France" <ma...@isg.si>
> > To: users@cloudstack.apache.org
> > Cc: dev@cloudstack.apache.org
> > Sent: Monday, 3 March, 2014 9:49:28 AM
> > Subject: Re: ALARM - ACS reboots host servers!!!
> >
> > I believe this is a bug too, because VMs not running on the storage,
> > get destroyed too:
> >
> > Issue has been around for a long time, like with all others I reported.
> > They do not get fixed:
> > https://issues.apache.org/jira/browse/CLOUDSTACK-3367
> >
> > We even lost assignee today.
> >
> > Regards,
> > F.
> >
> > On 3/3/14 6:55 AM, Koushik Das wrote:
> >> The primary storage needs to be put in maintenance before doing any
> upgrade/reboot as mentioned in the previous mails.
> >>
> >> -Koushik
> >>
> >> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
> >>
> >>> Also, please note that in the bug you referenced it doesn't have a
> >>> problem with the reboot being triggered, but with the fact that
> >>> reboot never completes due to hanging NFS mount (which is why the
> >>> reboot occurs, inaccessible primary storage).
> >>>
> >>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
> >>>> Or do you mean you have multiple primary storages and this one was
> >>>> not in use and put into maintenance?
> >>>>
> >>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com>
> wrote:
> >>>>> I'm not sure I understand. How do you expect to reboot your
> >>>>> primary storage while vms are running?  It sounds like the host is
> >>>>> being fenced since it cannot contact the resources it depends on.
> >>>>>
> >>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
> >>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
> >>>>>>> Hello guys,
> >>>>>>>
> >>>>>>>
> >>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has
> >>>>>>> rebooted all of my host servers without properly shutting down the
> guest vms.
> >>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
> >>>>>>> servers and a few minutes later, to my horror, i've found out
> >>>>>>> that all of my host servers have been rebooted. Is it just me
> >>>>>>> thinking so, or is this bug should be fixed ASAP and should be a
> >>>>>>> blocker for any new ACS release. I mean not only does it cause
> >>>>>>> downtime, but also possible data loss and server corruption.
> >>>>>> Hi Andrei,
> >>>>>>
> >>>>>> Do you have HA enabled and did you put that primary storage in
> >>>>>> maintenance mode before rebooting it?
> >>>>>> It's my understanding that ACS relies on the shared storage to
> >>>>>> perform HA so if the storage goes it's expected to go berserk.
> >>>>>> I've noticed similar behaviour in Xenserver pools without ACS.
> >>>>>> I'd imagine a "cure" for this would be to use network distributed
> >>>>>> "filesystems" like GlusterFS or CEPH.
> >>>>>>
> >>>>>> Lucian
> >>>>>>
> >>>>>> --
> >>>>>> Sent from the Delta quadrant using Borg technology!
> >>>>>>
> >>>>>> Nux!
> >>>>>> www.nux.ro

RE: ALARM - ACS reboots host servers!!!

Posted by Alex Huang <Al...@citrix.com>.

This is a severe bug if that's the case.  It's supposed to stop the heartbeat script when a primary storage is placed in maintenance.

--Alex

> -----Original Message-----
> From: France [mailto:mailinglists@isg.si]
> Sent: Thursday, April 3, 2014 1:06 AM
> To: dev@cloudstack.apache.org
> Subject: Re: ALARM - ACS reboots host servers!!!
> 
> I'm also interested in this issue.
> Can any1 from developers confirm this is expected behavior?
> 
> On 2/4/14 2:32 PM, Andrei Mikhailovsky wrote:
> > Coming back to this issue.
> >
> > This time to perform the maintenance of the nfs primary storage I've
> plated the storage in question in the Maintenance mode. After about 20
> minutes ACS showed the nfs storage is in Maintenance. However, none of
> the virtual machines with volumes on that storage were stopped. I've
> manually stopped the virtual machines and went to upgrade and restart the
> nfs server.
> >
> > A few minutes after the nfs server shutdown all of my host servers went
> into reboot killing all vms!
> >
> > Thus, it seems that putting nfs server in Maintenance mode does not stop
> ACS agent from restarting the host servers.
> >
> > Does anyone know a way to stop this behaviour?
> >
> > Thanks
> >
> > Andrei
> >
> >
> > ----- Original Message -----
> > From: "France" <ma...@isg.si>
> > To: users@cloudstack.apache.org
> > Cc: dev@cloudstack.apache.org
> > Sent: Monday, 3 March, 2014 9:49:28 AM
> > Subject: Re: ALARM - ACS reboots host servers!!!
> >
> > I believe this is a bug too, because VMs not running on the storage,
> > get destroyed too:
> >
> > Issue has been around for a long time, like with all others I reported.
> > They do not get fixed:
> > https://issues.apache.org/jira/browse/CLOUDSTACK-3367
> >
> > We even lost assignee today.
> >
> > Regards,
> > F.
> >
> > On 3/3/14 6:55 AM, Koushik Das wrote:
> >> The primary storage needs to be put in maintenance before doing any
> upgrade/reboot as mentioned in the previous mails.
> >>
> >> -Koushik
> >>
> >> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
> >>
> >>> Also, please note that in the bug you referenced it doesn't have a
> >>> problem with the reboot being triggered, but with the fact that
> >>> reboot never completes due to hanging NFS mount (which is why the
> >>> reboot occurs, inaccessible primary storage).
> >>>
> >>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
> >>>> Or do you mean you have multiple primary storages and this one was
> >>>> not in use and put into maintenance?
> >>>>
> >>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com>
> wrote:
> >>>>> I'm not sure I understand. How do you expect to reboot your
> >>>>> primary storage while vms are running?  It sounds like the host is
> >>>>> being fenced since it cannot contact the resources it depends on.
> >>>>>
> >>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
> >>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
> >>>>>>> Hello guys,
> >>>>>>>
> >>>>>>>
> >>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has
> >>>>>>> rebooted all of my host servers without properly shutting down the
> guest vms.
> >>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
> >>>>>>> servers and a few minutes later, to my horror, i've found out
> >>>>>>> that all of my host servers have been rebooted. Is it just me
> >>>>>>> thinking so, or is this bug should be fixed ASAP and should be a
> >>>>>>> blocker for any new ACS release. I mean not only does it cause
> >>>>>>> downtime, but also possible data loss and server corruption.
> >>>>>> Hi Andrei,
> >>>>>>
> >>>>>> Do you have HA enabled and did you put that primary storage in
> >>>>>> maintenance mode before rebooting it?
> >>>>>> It's my understanding that ACS relies on the shared storage to
> >>>>>> perform HA so if the storage goes it's expected to go berserk.
> >>>>>> I've noticed similar behaviour in Xenserver pools without ACS.
> >>>>>> I'd imagine a "cure" for this would be to use network distributed
> >>>>>> "filesystems" like GlusterFS or CEPH.
> >>>>>>
> >>>>>> Lucian
> >>>>>>
> >>>>>> --
> >>>>>> Sent from the Delta quadrant using Borg technology!
> >>>>>>
> >>>>>> Nux!
> >>>>>> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Andrei Mikhailovsky <an...@arhont.com>.

I am on KVM.  thanks

----- Original Message -----
From: "France" <ma...@isg.si>
To: dev@cloudstack.apache.org
Sent: Thursday, 3 April, 2014 2:34:53 PM
Subject: Re: ALARM - ACS reboots host servers!!!

Andrei,

is your hypervisor KVM?
I'm using XenServer.

Re: ALARM - ACS reboots host servers!!!

Posted by France <ma...@isg.si>.

Andrei,

is your hypervisor KVM?
I'm using XenServer.

Re: ALARM - ACS reboots host servers!!!

Posted by Wido den Hollander <wi...@widodh.nl>.


On 04/03/2014 10:06 AM, France wrote:
> I'm also interested in this issue.
> Can any1 from developers confirm this is expected behavior?
>

Yes, this still happens due to the kvmheartbeat.sh script which runs.

On some clusters I disabled this by simply overwriting that script with 
a version where "reboot" is removed.

I have some ideas on how to fix this, but I don't have the time at the 
moment.

Short version: The hosts shouldn't reboot themselves as long as they can 
reach other nodes or it should at least be configurable.

The management server should also do further inspection during HA by 
using a helper on the KVM Agent.

Wido

> On 2/4/14 2:32 PM, Andrei Mikhailovsky wrote:
>> Coming back to this issue.
>>
>> This time to perform the maintenance of the nfs primary storage I've
>> plated the storage in question in the Maintenance mode. After about 20
>> minutes ACS showed the nfs storage is in Maintenance. However, none of
>> the virtual machines with volumes on that storage were stopped. I've
>> manually stopped the virtual machines and went to upgrade and restart
>> the nfs server.
>>
>> A few minutes after the nfs server shutdown all of my host servers
>> went into reboot killing all vms!
>>
>> Thus, it seems that putting nfs server in Maintenance mode does not
>> stop ACS agent from restarting the host servers.
>>
>> Does anyone know a way to stop this behaviour?
>>
>> Thanks
>>
>> Andrei
>>
>>
>> ----- Original Message -----
>> From: "France" <ma...@isg.si>
>> To: users@cloudstack.apache.org
>> Cc: dev@cloudstack.apache.org
>> Sent: Monday, 3 March, 2014 9:49:28 AM
>> Subject: Re: ALARM - ACS reboots host servers!!!
>>
>> I believe this is a bug too, because VMs not running on the storage, get
>> destroyed too:
>>
>> Issue has been around for a long time, like with all others I reported.
>> They do not get fixed:
>> https://issues.apache.org/jira/browse/CLOUDSTACK-3367
>>
>> We even lost assignee today.
>>
>> Regards,
>> F.
>>
>> On 3/3/14 6:55 AM, Koushik Das wrote:
>>> The primary storage needs to be put in maintenance before doing any
>>> upgrade/reboot as mentioned in the previous mails.
>>>
>>> -Koushik
>>>
>>> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
>>>
>>>> Also, please note that in the bug you referenced it doesn't have a
>>>> problem with the reboot being triggered, but with the fact that reboot
>>>> never completes due to hanging NFS mount (which is why the reboot
>>>> occurs, inaccessible primary storage).
>>>>
>>>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>>>>> Or do you mean you have multiple primary storages and this one was not
>>>>> in use and put into maintenance?
>>>>>
>>>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>>>>> I'm not sure I understand. How do you expect to reboot your primary
>>>>>> storage while vms are running?  It sounds like the host is being
>>>>>> fenced since it cannot contact the resources it depends on.
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>>>>> Hello guys,
>>>>>>>>
>>>>>>>>
>>>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has
>>>>>>>> rebooted
>>>>>>>> all of my host servers without properly shutting down the guest
>>>>>>>> vms.
>>>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>>>>> servers and a few minutes later, to my horror, i've found out
>>>>>>>> that all
>>>>>>>> of my host servers have been rebooted. Is it just me thinking
>>>>>>>> so, or
>>>>>>>> is this bug should be fixed ASAP and should be a blocker for any
>>>>>>>> new
>>>>>>>> ACS release. I mean not only does it cause downtime, but also
>>>>>>>> possible
>>>>>>>> data loss and server corruption.
>>>>>>> Hi Andrei,
>>>>>>>
>>>>>>> Do you have HA enabled and did you put that primary storage in
>>>>>>> maintenance
>>>>>>> mode before rebooting it?
>>>>>>> It's my understanding that ACS relies on the shared storage to
>>>>>>> perform HA so
>>>>>>> if the storage goes it's expected to go berserk. I've noticed
>>>>>>> similar
>>>>>>> behaviour in Xenserver pools without ACS.
>>>>>>> I'd imagine a "cure" for this would be to use network distributed
>>>>>>> "filesystems" like GlusterFS or CEPH.
>>>>>>>
>>>>>>> Lucian
>>>>>>>
>>>>>>> --
>>>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>>>
>>>>>>> Nux!
>>>>>>> www.nux.ro
>

Re: ALARM - ACS reboots host servers!!!

Posted by France <ma...@isg.si>.

I'm also interested in this issue.
Can any1 from developers confirm this is expected behavior?

On 2/4/14 2:32 PM, Andrei Mikhailovsky wrote:
> Coming back to this issue.
>
> This time to perform the maintenance of the nfs primary storage I've plated the storage in question in the Maintenance mode. After about 20 minutes ACS showed the nfs storage is in Maintenance. However, none of the virtual machines with volumes on that storage were stopped. I've manually stopped the virtual machines and went to upgrade and restart the nfs server.
>
> A few minutes after the nfs server shutdown all of my host servers went into reboot killing all vms!
>
> Thus, it seems that putting nfs server in Maintenance mode does not stop ACS agent from restarting the host servers.
>
> Does anyone know a way to stop this behaviour?
>
> Thanks
>
> Andrei
>
>
> ----- Original Message -----
> From: "France" <ma...@isg.si>
> To: users@cloudstack.apache.org
> Cc: dev@cloudstack.apache.org
> Sent: Monday, 3 March, 2014 9:49:28 AM
> Subject: Re: ALARM - ACS reboots host servers!!!
>
> I believe this is a bug too, because VMs not running on the storage, get
> destroyed too:
>
> Issue has been around for a long time, like with all others I reported.
> They do not get fixed:
> https://issues.apache.org/jira/browse/CLOUDSTACK-3367
>
> We even lost assignee today.
>
> Regards,
> F.
>
> On 3/3/14 6:55 AM, Koushik Das wrote:
>> The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails.
>>
>> -Koushik
>>
>> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
>>
>>> Also, please note that in the bug you referenced it doesn't have a
>>> problem with the reboot being triggered, but with the fact that reboot
>>> never completes due to hanging NFS mount (which is why the reboot
>>> occurs, inaccessible primary storage).
>>>
>>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>>>> Or do you mean you have multiple primary storages and this one was not
>>>> in use and put into maintenance?
>>>>
>>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>>>> I'm not sure I understand. How do you expect to reboot your primary
>>>>> storage while vms are running?  It sounds like the host is being
>>>>> fenced since it cannot contact the resources it depends on.
>>>>>
>>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>>>> Hello guys,
>>>>>>>
>>>>>>>
>>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>>>>> all of my host servers without properly shutting down the guest vms.
>>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>>>> servers and a few minutes later, to my horror, i've found out that all
>>>>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>>>>> ACS release. I mean not only does it cause downtime, but also possible
>>>>>>> data loss and server corruption.
>>>>>> Hi Andrei,
>>>>>>
>>>>>> Do you have HA enabled and did you put that primary storage in maintenance
>>>>>> mode before rebooting it?
>>>>>> It's my understanding that ACS relies on the shared storage to perform HA so
>>>>>> if the storage goes it's expected to go berserk. I've noticed similar
>>>>>> behaviour in Xenserver pools without ACS.
>>>>>> I'd imagine a "cure" for this would be to use network distributed
>>>>>> "filesystems" like GlusterFS or CEPH.
>>>>>>
>>>>>> Lucian
>>>>>>
>>>>>> --
>>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>>
>>>>>> Nux!
>>>>>> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Andrei Mikhailovsky <an...@arhont.com>.

Coming back to this issue.

This time to perform the maintenance of the nfs primary storage I've plated the storage in question in the Maintenance mode. After about 20 minutes ACS showed the nfs storage is in Maintenance. However, none of the virtual machines with volumes on that storage were stopped. I've manually stopped the virtual machines and went to upgrade and restart the nfs server.

A few minutes after the nfs server shutdown all of my host servers went into reboot killing all vms!

Thus, it seems that putting nfs server in Maintenance mode does not stop ACS agent from restarting the host servers.

Does anyone know a way to stop this behaviour? 

Thanks

Andrei


----- Original Message -----
From: "France" <ma...@isg.si>
To: users@cloudstack.apache.org
Cc: dev@cloudstack.apache.org
Sent: Monday, 3 March, 2014 9:49:28 AM
Subject: Re: ALARM - ACS reboots host servers!!!

I believe this is a bug too, because VMs not running on the storage, get 
destroyed too:

Issue has been around for a long time, like with all others I reported. 
They do not get fixed:
https://issues.apache.org/jira/browse/CLOUDSTACK-3367

We even lost assignee today.

Regards,
F.

On 3/3/14 6:55 AM, Koushik Das wrote:
> The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails.
>
> -Koushik
>
> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
>
>> Also, please note that in the bug you referenced it doesn't have a
>> problem with the reboot being triggered, but with the fact that reboot
>> never completes due to hanging NFS mount (which is why the reboot
>> occurs, inaccessible primary storage).
>>
>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>>> Or do you mean you have multiple primary storages and this one was not
>>> in use and put into maintenance?
>>>
>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>>> I'm not sure I understand. How do you expect to reboot your primary
>>>> storage while vms are running?  It sounds like the host is being
>>>> fenced since it cannot contact the resources it depends on.
>>>>
>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>>> Hello guys,
>>>>>>
>>>>>>
>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>>>> all of my host servers without properly shutting down the guest vms.
>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>>> servers and a few minutes later, to my horror, i've found out that all
>>>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>>>> ACS release. I mean not only does it cause downtime, but also possible
>>>>>> data loss and server corruption.
>>>>>
>>>>> Hi Andrei,
>>>>>
>>>>> Do you have HA enabled and did you put that primary storage in maintenance
>>>>> mode before rebooting it?
>>>>> It's my understanding that ACS relies on the shared storage to perform HA so
>>>>> if the storage goes it's expected to go berserk. I've noticed similar
>>>>> behaviour in Xenserver pools without ACS.
>>>>> I'd imagine a "cure" for this would be to use network distributed
>>>>> "filesystems" like GlusterFS or CEPH.
>>>>>
>>>>> Lucian
>>>>>
>>>>> --
>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>
>>>>> Nux!
>>>>> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by France <ma...@isg.si>.

I believe this is a bug too, because VMs not running on the storage, get 
destroyed too:

Issue has been around for a long time, like with all others I reported. 
They do not get fixed:
https://issues.apache.org/jira/browse/CLOUDSTACK-3367

We even lost assignee today.

Regards,
F.

On 3/3/14 6:55 AM, Koushik Das wrote:
> The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails.
>
> -Koushik
>
> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
>
>> Also, please note that in the bug you referenced it doesn't have a
>> problem with the reboot being triggered, but with the fact that reboot
>> never completes due to hanging NFS mount (which is why the reboot
>> occurs, inaccessible primary storage).
>>
>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>>> Or do you mean you have multiple primary storages and this one was not
>>> in use and put into maintenance?
>>>
>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>>> I'm not sure I understand. How do you expect to reboot your primary
>>>> storage while vms are running?  It sounds like the host is being
>>>> fenced since it cannot contact the resources it depends on.
>>>>
>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>>> Hello guys,
>>>>>>
>>>>>>
>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>>>> all of my host servers without properly shutting down the guest vms.
>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>>> servers and a few minutes later, to my horror, i've found out that all
>>>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>>>> ACS release. I mean not only does it cause downtime, but also possible
>>>>>> data loss and server corruption.
>>>>>
>>>>> Hi Andrei,
>>>>>
>>>>> Do you have HA enabled and did you put that primary storage in maintenance
>>>>> mode before rebooting it?
>>>>> It's my understanding that ACS relies on the shared storage to perform HA so
>>>>> if the storage goes it's expected to go berserk. I've noticed similar
>>>>> behaviour in Xenserver pools without ACS.
>>>>> I'd imagine a "cure" for this would be to use network distributed
>>>>> "filesystems" like GlusterFS or CEPH.
>>>>>
>>>>> Lucian
>>>>>
>>>>> --
>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>
>>>>> Nux!
>>>>> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by France <ma...@isg.si>.

I believe this is a bug too, because VMs not running on the storage, get 
destroyed too:

Issue has been around for a long time, like with all others I reported. 
They do not get fixed:
https://issues.apache.org/jira/browse/CLOUDSTACK-3367

We even lost assignee today.

Regards,
F.

On 3/3/14 6:55 AM, Koushik Das wrote:
> The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails.
>
> -Koushik
>
> On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:
>
>> Also, please note that in the bug you referenced it doesn't have a
>> problem with the reboot being triggered, but with the fact that reboot
>> never completes due to hanging NFS mount (which is why the reboot
>> occurs, inaccessible primary storage).
>>
>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>>> Or do you mean you have multiple primary storages and this one was not
>>> in use and put into maintenance?
>>>
>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>>> I'm not sure I understand. How do you expect to reboot your primary
>>>> storage while vms are running?  It sounds like the host is being
>>>> fenced since it cannot contact the resources it depends on.
>>>>
>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>>> Hello guys,
>>>>>>
>>>>>>
>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>>>> all of my host servers without properly shutting down the guest vms.
>>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>>> servers and a few minutes later, to my horror, i've found out that all
>>>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>>>> ACS release. I mean not only does it cause downtime, but also possible
>>>>>> data loss and server corruption.
>>>>>
>>>>> Hi Andrei,
>>>>>
>>>>> Do you have HA enabled and did you put that primary storage in maintenance
>>>>> mode before rebooting it?
>>>>> It's my understanding that ACS relies on the shared storage to perform HA so
>>>>> if the storage goes it's expected to go berserk. I've noticed similar
>>>>> behaviour in Xenserver pools without ACS.
>>>>> I'd imagine a "cure" for this would be to use network distributed
>>>>> "filesystems" like GlusterFS or CEPH.
>>>>>
>>>>> Lucian
>>>>>
>>>>> --
>>>>> Sent from the Delta quadrant using Borg technology!
>>>>>
>>>>> Nux!
>>>>> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Nux! <nu...@li.nux.ro>.

On 03.03.2014 12:37, Andrei Mikhailovsky wrote:
> Koushik, I understand that and I will put the storage into the
> maintenance mode next time. However, things happen and servers crash
> from time to time, which is not the reason to reboot all host servers,
> even those which do not have any running vms with volumes on the nfs
> storage. The bloody agent just rebooted every single host server
> regardless if they were running vms with volumes on the rebooted nfs
> server. 95% of my vms are running from ceph and those should have
> never been effected in the first place.

It sounds like ACS need to become more aware of multiple primary 
storages..

-- 
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Andrei Mikhailovsky <an...@arhont.com>.

Koushik, I understand that and I will put the storage into the maintenance mode next time. However, things happen and servers crash from time to time, which is not the reason to reboot all host servers, even those which do not have any running vms with volumes on the nfs storage. The bloody agent just rebooted every single host server regardless if they were running vms with volumes on the rebooted nfs server. 95% of my vms are running from ceph and those should have never been effected in the first place. 
----- Original Message -----

From: "Koushik Das" <ko...@citrix.com> 
To: "<us...@cloudstack.apache.org>" <us...@cloudstack.apache.org> 
Cc: dev@cloudstack.apache.org 
Sent: Monday, 3 March, 2014 5:55:34 AM 
Subject: Re: ALARM - ACS reboots host servers!!! 

The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails. 

-Koushik 

On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote: 

> Also, please note that in the bug you referenced it doesn't have a 
> problem with the reboot being triggered, but with the fact that reboot 
> never completes due to hanging NFS mount (which is why the reboot 
> occurs, inaccessible primary storage). 
> 
> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote: 
>> Or do you mean you have multiple primary storages and this one was not 
>> in use and put into maintenance? 
>> 
>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote: 
>>> I'm not sure I understand. How do you expect to reboot your primary 
>>> storage while vms are running? It sounds like the host is being 
>>> fenced since it cannot contact the resources it depends on. 
>>> 
>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote: 
>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote: 
>>>>> 
>>>>> Hello guys, 
>>>>> 
>>>>> 
>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted 
>>>>> all of my host servers without properly shutting down the guest vms. 
>>>>> I've simply upgraded and rebooted one of the nfs primary storage 
>>>>> servers and a few minutes later, to my horror, i've found out that all 
>>>>> of my host servers have been rebooted. Is it just me thinking so, or 
>>>>> is this bug should be fixed ASAP and should be a blocker for any new 
>>>>> ACS release. I mean not only does it cause downtime, but also possible 
>>>>> data loss and server corruption. 
>>>> 
>>>> 
>>>> Hi Andrei, 
>>>> 
>>>> Do you have HA enabled and did you put that primary storage in maintenance 
>>>> mode before rebooting it? 
>>>> It's my understanding that ACS relies on the shared storage to perform HA so 
>>>> if the storage goes it's expected to go berserk. I've noticed similar 
>>>> behaviour in Xenserver pools without ACS. 
>>>> I'd imagine a "cure" for this would be to use network distributed 
>>>> "filesystems" like GlusterFS or CEPH. 
>>>> 
>>>> Lucian 
>>>> 
>>>> -- 
>>>> Sent from the Delta quadrant using Borg technology! 
>>>> 
>>>> Nux! 
>>>> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Andrei Mikhailovsky <an...@arhont.com>.

Koushik, I understand that and I will put the storage into the maintenance mode next time. However, things happen and servers crash from time to time, which is not the reason to reboot all host servers, even those which do not have any running vms with volumes on the nfs storage. The bloody agent just rebooted every single host server regardless if they were running vms with volumes on the rebooted nfs server. 95% of my vms are running from ceph and those should have never been effected in the first place. 
----- Original Message -----

From: "Koushik Das" <ko...@citrix.com> 
To: "<us...@cloudstack.apache.org>" <us...@cloudstack.apache.org> 
Cc: dev@cloudstack.apache.org 
Sent: Monday, 3 March, 2014 5:55:34 AM 
Subject: Re: ALARM - ACS reboots host servers!!! 

The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails. 

-Koushik 

On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote: 

> Also, please note that in the bug you referenced it doesn't have a 
> problem with the reboot being triggered, but with the fact that reboot 
> never completes due to hanging NFS mount (which is why the reboot 
> occurs, inaccessible primary storage). 
> 
> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote: 
>> Or do you mean you have multiple primary storages and this one was not 
>> in use and put into maintenance? 
>> 
>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote: 
>>> I'm not sure I understand. How do you expect to reboot your primary 
>>> storage while vms are running? It sounds like the host is being 
>>> fenced since it cannot contact the resources it depends on. 
>>> 
>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote: 
>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote: 
>>>>> 
>>>>> Hello guys, 
>>>>> 
>>>>> 
>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted 
>>>>> all of my host servers without properly shutting down the guest vms. 
>>>>> I've simply upgraded and rebooted one of the nfs primary storage 
>>>>> servers and a few minutes later, to my horror, i've found out that all 
>>>>> of my host servers have been rebooted. Is it just me thinking so, or 
>>>>> is this bug should be fixed ASAP and should be a blocker for any new 
>>>>> ACS release. I mean not only does it cause downtime, but also possible 
>>>>> data loss and server corruption. 
>>>> 
>>>> 
>>>> Hi Andrei, 
>>>> 
>>>> Do you have HA enabled and did you put that primary storage in maintenance 
>>>> mode before rebooting it? 
>>>> It's my understanding that ACS relies on the shared storage to perform HA so 
>>>> if the storage goes it's expected to go berserk. I've noticed similar 
>>>> behaviour in Xenserver pools without ACS. 
>>>> I'd imagine a "cure" for this would be to use network distributed 
>>>> "filesystems" like GlusterFS or CEPH. 
>>>> 
>>>> Lucian 
>>>> 
>>>> -- 
>>>> Sent from the Delta quadrant using Borg technology! 
>>>> 
>>>> Nux! 
>>>> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Koushik Das <ko...@citrix.com>.

The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails.

-Koushik

On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:

> Also, please note that in the bug you referenced it doesn't have a
> problem with the reboot being triggered, but with the fact that reboot
> never completes due to hanging NFS mount (which is why the reboot
> occurs, inaccessible primary storage).
> 
> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>> Or do you mean you have multiple primary storages and this one was not
>> in use and put into maintenance?
>> 
>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>> I'm not sure I understand. How do you expect to reboot your primary
>>> storage while vms are running?  It sounds like the host is being
>>> fenced since it cannot contact the resources it depends on.
>>> 
>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>> 
>>>>> Hello guys,
>>>>> 
>>>>> 
>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>>> all of my host servers without properly shutting down the guest vms.
>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>> servers and a few minutes later, to my horror, i've found out that all
>>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>>> ACS release. I mean not only does it cause downtime, but also possible
>>>>> data loss and server corruption.
>>>> 
>>>> 
>>>> Hi Andrei,
>>>> 
>>>> Do you have HA enabled and did you put that primary storage in maintenance
>>>> mode before rebooting it?
>>>> It's my understanding that ACS relies on the shared storage to perform HA so
>>>> if the storage goes it's expected to go berserk. I've noticed similar
>>>> behaviour in Xenserver pools without ACS.
>>>> I'd imagine a "cure" for this would be to use network distributed
>>>> "filesystems" like GlusterFS or CEPH.
>>>> 
>>>> Lucian
>>>> 
>>>> --
>>>> Sent from the Delta quadrant using Borg technology!
>>>> 
>>>> Nux!
>>>> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Koushik Das <ko...@citrix.com>.

The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails.

-Koushik

On 03-Mar-2014, at 6:07 AM, Marcus <sh...@gmail.com> wrote:

> Also, please note that in the bug you referenced it doesn't have a
> problem with the reboot being triggered, but with the fact that reboot
> never completes due to hanging NFS mount (which is why the reboot
> occurs, inaccessible primary storage).
> 
> On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
>> Or do you mean you have multiple primary storages and this one was not
>> in use and put into maintenance?
>> 
>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>>> I'm not sure I understand. How do you expect to reboot your primary
>>> storage while vms are running?  It sounds like the host is being
>>> fenced since it cannot contact the resources it depends on.
>>> 
>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>> 
>>>>> Hello guys,
>>>>> 
>>>>> 
>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>>> all of my host servers without properly shutting down the guest vms.
>>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>>> servers and a few minutes later, to my horror, i've found out that all
>>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>>> ACS release. I mean not only does it cause downtime, but also possible
>>>>> data loss and server corruption.
>>>> 
>>>> 
>>>> Hi Andrei,
>>>> 
>>>> Do you have HA enabled and did you put that primary storage in maintenance
>>>> mode before rebooting it?
>>>> It's my understanding that ACS relies on the shared storage to perform HA so
>>>> if the storage goes it's expected to go berserk. I've noticed similar
>>>> behaviour in Xenserver pools without ACS.
>>>> I'd imagine a "cure" for this would be to use network distributed
>>>> "filesystems" like GlusterFS or CEPH.
>>>> 
>>>> Lucian
>>>> 
>>>> --
>>>> Sent from the Delta quadrant using Borg technology!
>>>> 
>>>> Nux!
>>>> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Marcus <sh...@gmail.com>.

Also, please note that in the bug you referenced it doesn't have a
problem with the reboot being triggered, but with the fact that reboot
never completes due to hanging NFS mount (which is why the reboot
occurs, inaccessible primary storage).

On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
> Or do you mean you have multiple primary storages and this one was not
> in use and put into maintenance?
>
> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>> I'm not sure I understand. How do you expect to reboot your primary
>> storage while vms are running?  It sounds like the host is being
>> fenced since it cannot contact the resources it depends on.
>>
>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>
>>>> Hello guys,
>>>>
>>>>
>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>> all of my host servers without properly shutting down the guest vms.
>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>> servers and a few minutes later, to my horror, i've found out that all
>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>> ACS release. I mean not only does it cause downtime, but also possible
>>>> data loss and server corruption.
>>>
>>>
>>> Hi Andrei,
>>>
>>> Do you have HA enabled and did you put that primary storage in maintenance
>>> mode before rebooting it?
>>> It's my understanding that ACS relies on the shared storage to perform HA so
>>> if the storage goes it's expected to go berserk. I've noticed similar
>>> behaviour in Xenserver pools without ACS.
>>> I'd imagine a "cure" for this would be to use network distributed
>>> "filesystems" like GlusterFS or CEPH.
>>>
>>> Lucian
>>>
>>> --
>>> Sent from the Delta quadrant using Borg technology!
>>>
>>> Nux!
>>> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Marcus <sh...@gmail.com>.

Also, please note that in the bug you referenced it doesn't have a
problem with the reboot being triggered, but with the fact that reboot
never completes due to hanging NFS mount (which is why the reboot
occurs, inaccessible primary storage).

On Sun, Mar 2, 2014 at 5:26 PM, Marcus <sh...@gmail.com> wrote:
> Or do you mean you have multiple primary storages and this one was not
> in use and put into maintenance?
>
> On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
>> I'm not sure I understand. How do you expect to reboot your primary
>> storage while vms are running?  It sounds like the host is being
>> fenced since it cannot contact the resources it depends on.
>>
>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>>
>>>> Hello guys,
>>>>
>>>>
>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>>> all of my host servers without properly shutting down the guest vms.
>>>> I've simply upgraded and rebooted one of the nfs primary storage
>>>> servers and a few minutes later, to my horror, i've found out that all
>>>> of my host servers have been rebooted. Is it just me thinking so, or
>>>> is this bug should be fixed ASAP and should be a blocker for any new
>>>> ACS release. I mean not only does it cause downtime, but also possible
>>>> data loss and server corruption.
>>>
>>>
>>> Hi Andrei,
>>>
>>> Do you have HA enabled and did you put that primary storage in maintenance
>>> mode before rebooting it?
>>> It's my understanding that ACS relies on the shared storage to perform HA so
>>> if the storage goes it's expected to go berserk. I've noticed similar
>>> behaviour in Xenserver pools without ACS.
>>> I'd imagine a "cure" for this would be to use network distributed
>>> "filesystems" like GlusterFS or CEPH.
>>>
>>> Lucian
>>>
>>> --
>>> Sent from the Delta quadrant using Borg technology!
>>>
>>> Nux!
>>> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Marcus <sh...@gmail.com>.

Or do you mean you have multiple primary storages and this one was not
in use and put into maintenance?

On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
> I'm not sure I understand. How do you expect to reboot your primary
> storage while vms are running?  It sounds like the host is being
> fenced since it cannot contact the resources it depends on.
>
> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>
>>> Hello guys,
>>>
>>>
>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>> all of my host servers without properly shutting down the guest vms.
>>> I've simply upgraded and rebooted one of the nfs primary storage
>>> servers and a few minutes later, to my horror, i've found out that all
>>> of my host servers have been rebooted. Is it just me thinking so, or
>>> is this bug should be fixed ASAP and should be a blocker for any new
>>> ACS release. I mean not only does it cause downtime, but also possible
>>> data loss and server corruption.
>>
>>
>> Hi Andrei,
>>
>> Do you have HA enabled and did you put that primary storage in maintenance
>> mode before rebooting it?
>> It's my understanding that ACS relies on the shared storage to perform HA so
>> if the storage goes it's expected to go berserk. I've noticed similar
>> behaviour in Xenserver pools without ACS.
>> I'd imagine a "cure" for this would be to use network distributed
>> "filesystems" like GlusterFS or CEPH.
>>
>> Lucian
>>
>> --
>> Sent from the Delta quadrant using Borg technology!
>>
>> Nux!
>> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Marcus <sh...@gmail.com>.

Or do you mean you have multiple primary storages and this one was not
in use and put into maintenance?

On Sun, Mar 2, 2014 at 5:25 PM, Marcus <sh...@gmail.com> wrote:
> I'm not sure I understand. How do you expect to reboot your primary
> storage while vms are running?  It sounds like the host is being
> fenced since it cannot contact the resources it depends on.
>
> On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>>
>>> Hello guys,
>>>
>>>
>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>>> all of my host servers without properly shutting down the guest vms.
>>> I've simply upgraded and rebooted one of the nfs primary storage
>>> servers and a few minutes later, to my horror, i've found out that all
>>> of my host servers have been rebooted. Is it just me thinking so, or
>>> is this bug should be fixed ASAP and should be a blocker for any new
>>> ACS release. I mean not only does it cause downtime, but also possible
>>> data loss and server corruption.
>>
>>
>> Hi Andrei,
>>
>> Do you have HA enabled and did you put that primary storage in maintenance
>> mode before rebooting it?
>> It's my understanding that ACS relies on the shared storage to perform HA so
>> if the storage goes it's expected to go berserk. I've noticed similar
>> behaviour in Xenserver pools without ACS.
>> I'd imagine a "cure" for this would be to use network distributed
>> "filesystems" like GlusterFS or CEPH.
>>
>> Lucian
>>
>> --
>> Sent from the Delta quadrant using Borg technology!
>>
>> Nux!
>> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Marcus <sh...@gmail.com>.

I'm not sure I understand. How do you expect to reboot your primary
storage while vms are running?  It sounds like the host is being
fenced since it cannot contact the resources it depends on.

On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>
>> Hello guys,
>>
>>
>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>> all of my host servers without properly shutting down the guest vms.
>> I've simply upgraded and rebooted one of the nfs primary storage
>> servers and a few minutes later, to my horror, i've found out that all
>> of my host servers have been rebooted. Is it just me thinking so, or
>> is this bug should be fixed ASAP and should be a blocker for any new
>> ACS release. I mean not only does it cause downtime, but also possible
>> data loss and server corruption.
>
>
> Hi Andrei,
>
> Do you have HA enabled and did you put that primary storage in maintenance
> mode before rebooting it?
> It's my understanding that ACS relies on the shared storage to perform HA so
> if the storage goes it's expected to go berserk. I've noticed similar
> behaviour in Xenserver pools without ACS.
> I'd imagine a "cure" for this would be to use network distributed
> "filesystems" like GlusterFS or CEPH.
>
> Lucian
>
> --
> Sent from the Delta quadrant using Borg technology!
>
> Nux!
> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Marcus <sh...@gmail.com>.

I'm not sure I understand. How do you expect to reboot your primary
storage while vms are running?  It sounds like the host is being
fenced since it cannot contact the resources it depends on.

On Sun, Mar 2, 2014 at 3:24 PM, Nux! <nu...@li.nux.ro> wrote:
> On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
>>
>> Hello guys,
>>
>>
>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
>> all of my host servers without properly shutting down the guest vms.
>> I've simply upgraded and rebooted one of the nfs primary storage
>> servers and a few minutes later, to my horror, i've found out that all
>> of my host servers have been rebooted. Is it just me thinking so, or
>> is this bug should be fixed ASAP and should be a blocker for any new
>> ACS release. I mean not only does it cause downtime, but also possible
>> data loss and server corruption.
>
>
> Hi Andrei,
>
> Do you have HA enabled and did you put that primary storage in maintenance
> mode before rebooting it?
> It's my understanding that ACS relies on the shared storage to perform HA so
> if the storage goes it's expected to go berserk. I've noticed similar
> behaviour in Xenserver pools without ACS.
> I'd imagine a "cure" for this would be to use network distributed
> "filesystems" like GlusterFS or CEPH.
>
> Lucian
>
> --
> Sent from the Delta quadrant using Borg technology!
>
> Nux!
> www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Nux! <nu...@li.nux.ro>.

On 03.03.2014 12:24, Andrei Mikhailovsky wrote:
> I am using HA for about 30% of the guest vms, but my testing showed
> that HA is not working reliably with KVM. It works pretty well if you
> initiate a vm shutdown inside a guest without using the ACS GUI.
> However, when the host goes down for whatever reason (power failure,
> init 6/0, network failure, etc.) the HA fails to kick in and restart
> the vms.

This shuld be submitted as a bug. Which version are you on?

> 
> 
> Regarding the nfs storage, I did not put the nfs server in the
> maintenance mode. Would this solve the problem with reboots? I will
> try it next time when I am doing maintenance on the nfs, but I do
> recall that i've previously restarted the nfs server in the past and
> I've not seen the hosts rebooting themselves. Is there a timeout which
> causes the hosts to reboot?

Not sure what the timeout is, I'd be interested in finding out as well.

To the best of my knowledge, when you put primary storage in m-mode ACS 
will shut down the VMs on it.
Otherwise the shared storage is used by ACS to maintain HA (so your HA 
is as good as your shared storage ...), if link to the shared storage is 
down the host assumes something is wrong and shuts down (fences itself), 
this is the correct and expected behaviour. Maybe your network has 
segmented etc.

> 
> 
> In any case, I think it is not safe to do an automated host server
> reboot and if it was up to me I would disable this feature from the
> agent. IMHO this should be down to system administrator and acs agent
> should send an alert email if something goes wrong instead of
> rebooting the host servers.

Not sure what to tell you, HA is a sensitive and complex subject. For 
now I'm ok with this behaviour and I see it implemented similarly in 
Xenserver, too.

> 
> 
> I am using ceph for my primary storage for guest vms data and root
> disks. The NFS is used as a backup disk offering for the guest.
> 
> 
> Andrei
> 

-- 
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Andrei Mikhailovsky <an...@arhont.com>.

Nux, 

I am using HA for about 30% of the guest vms, but my testing showed that HA is not working reliably with KVM. It works pretty well if you initiate a vm shutdown inside a guest without using the ACS GUI. However, when the host goes down for whatever reason (power failure, init 6/0, network failure, etc.) the HA fails to kick in and restart the vms. 

Regarding the nfs storage, I did not put the nfs server in the maintenance mode. Would this solve the problem with reboots? I will try it next time when I am doing maintenance on the nfs, but I do recall that i've previously restarted the nfs server in the past and I've not seen the hosts rebooting themselves. Is there a timeout which causes the hosts to reboot? 

In any case, I think it is not safe to do an automated host server reboot and if it was up to me I would disable this feature from the agent. IMHO this should be down to system administrator and acs agent should send an alert email if something goes wrong instead of rebooting the host servers. 

I am using ceph for my primary storage for guest vms data and root disks. The NFS is used as a backup disk offering for the guest. 

Andrei 

----- Original Message -----

From: "Nux!" <nu...@li.nux.ro> 
To: users@cloudstack.apache.org 
Sent: Sunday, 2 March, 2014 10:24:07 PM 
Subject: Re: ALARM - ACS reboots host servers!!! 

On 02.03.2014 21:17, Andrei Mikhailovsky wrote: 
> Hello guys, 
> 
> 
> I've recently came across the bug CLOUDSTACK-5429 which has rebooted 
> all of my host servers without properly shutting down the guest vms. 
> I've simply upgraded and rebooted one of the nfs primary storage 
> servers and a few minutes later, to my horror, i've found out that all 
> of my host servers have been rebooted. Is it just me thinking so, or 
> is this bug should be fixed ASAP and should be a blocker for any new 
> ACS release. I mean not only does it cause downtime, but also possible 
> data loss and server corruption. 

Hi Andrei, 

Do you have HA enabled and did you put that primary storage in 
maintenance mode before rebooting it? 
It's my understanding that ACS relies on the shared storage to perform 
HA so if the storage goes it's expected to go berserk. I've noticed 
similar behaviour in Xenserver pools without ACS. 
I'd imagine a "cure" for this would be to use network distributed 
"filesystems" like GlusterFS or CEPH. 

Lucian 

-- 
Sent from the Delta quadrant using Borg technology! 

Nux! 
www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Nux! <nu...@li.nux.ro>.

On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
> Hello guys,
> 
> 
> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
> all of my host servers without properly shutting down the guest vms.
> I've simply upgraded and rebooted one of the nfs primary storage
> servers and a few minutes later, to my horror, i've found out that all
> of my host servers have been rebooted. Is it just me thinking so, or
> is this bug should be fixed ASAP and should be a blocker for any new
> ACS release. I mean not only does it cause downtime, but also possible
> data loss and server corruption.

Hi Andrei,

Do you have HA enabled and did you put that primary storage in 
maintenance mode before rebooting it?
It's my understanding that ACS relies on the shared storage to perform 
HA so if the storage goes it's expected to go berserk. I've noticed 
similar behaviour in Xenserver pools without ACS.
I'd imagine a "cure" for this would be to use network distributed 
"filesystems" like GlusterFS or CEPH.

Lucian

-- 
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

Re: ALARM - ACS reboots host servers!!!

Posted by Nux! <nu...@li.nux.ro>.

On 02.03.2014 21:17, Andrei Mikhailovsky wrote:
> Hello guys,
> 
> 
> I've recently came across the bug CLOUDSTACK-5429 which has rebooted
> all of my host servers without properly shutting down the guest vms.
> I've simply upgraded and rebooted one of the nfs primary storage
> servers and a few minutes later, to my horror, i've found out that all
> of my host servers have been rebooted. Is it just me thinking so, or
> is this bug should be fixed ASAP and should be a blocker for any new
> ACS release. I mean not only does it cause downtime, but also possible
> data loss and server corruption.

Hi Andrei,

Do you have HA enabled and did you put that primary storage in 
maintenance mode before rebooting it?
It's my understanding that ACS relies on the shared storage to perform 
HA so if the storage goes it's expected to go berserk. I've noticed 
similar behaviour in Xenserver pools without ACS.
I'd imagine a "cure" for this would be to use network distributed 
"filesystems" like GlusterFS or CEPH.

Lucian

-- 
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro