You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cloudstack.apache.org by Andrija Panic <an...@gmail.com> on 2015/02/16 11:00:27 UTC

Disable HA temporary ?

Hi team,

I just had funny behaviour few days ago - one of my hosts was under heavy
load (some disk/network load) and it went disconnected from MGMT server.

Then MGMT server stared doing HA thing, but without being able to make sure
that the VMs on the disconnected hosts are really shutdown (and they were
NOT).

So MGMT started again some VMs on other hosts, thus resulting in having 2
copies of the same VM, using shared strage  - so corruption happened on the
disk.

Is there a way to temporary disable HA feature on global level, or anything
similar ?
Thanks

-- 

Andrija Panić

Re: Disable HA temporary ?

Posted by Andrija Panic <an...@gmail.com>.

I agree...and understand :)

But would this means, that VMs will not be provisioned anywhere during HA
kicking in ? I guess so...
So I avoid having started another copy of the same VM, that is alrady
running on disconnected hosts - I need this as the temporary solution,
during CEPH backfilling, so not sure if this heavy hack is good , or will
case me even more trouble...

cheers


On 16 February 2015 at 16:58, Logan Barfield <lb...@tqhosting.com>
wrote:

> Hi Andrija,
>
> The way I understand it (and have seen in practice) is that by default
> the MGMT server will use any available server for HA.  Setting the HA
> tag on a hosts just dedicates that host to HA, meaning that during
> normal provisioning no VMs will use that host, it will only be used
> for HA purposes.  In other words, the "HA" tag is not required for HA
> to work.
>
> Thank You,
>
> Logan Barfield
> Tranquil Hosting
>
>
> On Mon, Feb 16, 2015 at 10:43 AM, Andrija Panic <an...@gmail.com>
> wrote:
> > Seems to me, that I'm about to issue something similar to:   update
> > cloud.vm_instance set ha = 0 where ha =1...
> >
> > Now seriously, wondering, per the manual - if you define HA host tag on
> the
> > global config level, and then have NO hosts with that tag - MGMT will not
> > be able to start VMs on other hosts, since there are no hosts that are
> > dedicated got HA destination ?
> >
> > Does this makes sense ? I guess the VMs will be just marked as Stopped in
> > the GUI/databse, but unable to start them...
> > Stupid proposal, but... ?
> >
> > On 16 February 2015 at 16:22, Logan Barfield <lb...@tqhosting.com>
> > wrote:
> >
> >> Some sort of fencing independent of the management server is
> >> definitely needed.  HA in general (particularly on KVM) is all kinds
> >> of unpredictable/buggy right now.
> >>
> >> I like the idea of having a switch that an admin can flip to stop HA.
> >> In fact I think a better job control system in general (e.g., being
> >> able to stop/restart/manually start tasks) would be awesome, if it's
> >> feasible.
> >>
> >> Thank You,
> >>
> >> Logan Barfield
> >> Tranquil Hosting
> >>
> >>
> >> On Mon, Feb 16, 2015 at 10:05 AM, Wido den Hollander <wi...@widodh.nl>
> >> wrote:
> >> >
> >> >
> >> > On 16-02-15 13:16, Andrei Mikhailovsky wrote:
> >> >> I had similar issues at least two or thee times. The host agent would
> >> disconnect from the management server. The agent would not connect back
> to
> >> the management server without manual intervention, however, it would
> >> happily continue running the vms. The management server would initiate
> the
> >> HA and fire up vms, which are already running on the disconnected host.
> I
> >> ended up with a handful of vms and virtual routers being ran on two
> >> hypervisors, thus corrupting the disk and having all sorts of issues
> ((( .
> >> >>
> >> >> I think there has to be a better way of dealing with this case. At
> >> least on an image level. Perhaps a host should keep some sort of lock
> file
> >> or a file for every image where it would record a time stamp. Something
> >> like:
> >> >>
> >> >> f5ffa8b0-d852-41c8-a386-6efb8241e2e7 and
> >> >> f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp
> >> >>
> >> >> Thus, the f5ffa8b0-d852-41c8-a386-6efb8241e2e7 is the name of the
> disk
> >> image and f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp is the image's
> >> time stamp.
> >> >>
> >> >> The hypervisor should record the time stamp in this file while the vm
> >> is running. Let's say every 5-10 seconds. If the timestamp is old, we
> can
> >> assume that the volume is no longer used by the hypervisor.
> >> >>
> >> >> When a vm is started, the timestamp file should be checked and if the
> >> timestamp is recent, the vm should not start, otherwise, the vm should
> >> start and the timestamp file should be regularly updated.
> >> >>
> >> >> I am sure there are better ways of doing this, but at least this
> method
> >> should not allow two vms running on different hosts to use the same
> volume
> >> and corrupt the data.
> >> >>
> >> >> In ceph, as far as I remember, a new feature is being developed to
> >> provide a locking mechanism of an rbd image. Not sure if this will do
> the
> >> job?
> >> >>
> >> >
> >> > Something like this is still on my wishlist for Ceph/RBD, something
> like
> >> > you propose.
> >> >
> >> > For NFS we currently have this in place, but for Ceph/RBD we don't.
> It's
> >> > a matter of code in the Agent and the investigators inside the
> >> > Management Server which decide if HA should kick in.
> >> >
> >> > Wido
> >> >
> >> >> Andrei
> >> >>
> >> >> ----- Original Message -----
> >> >>
> >> >>> From: "Wido den Hollander" <wi...@widodh.nl>
> >> >>> To: dev@cloudstack.apache.org
> >> >>> Sent: Monday, 16 February, 2015 11:32:13 AM
> >> >>> Subject: Re: Disable HA temporary ?
> >> >>
> >> >>> On 16-02-15 11:00, Andrija Panic wrote:
> >> >>>> Hi team,
> >> >>>>
> >> >>>> I just had funny behaviour few days ago - one of my hosts was under
> >> >>>> heavy
> >> >>>> load (some disk/network load) and it went disconnected from MGMT
> >> >>>> server.
> >> >>>>
> >> >>>> Then MGMT server stared doing HA thing, but without being able to
> >> >>>> make sure
> >> >>>> that the VMs on the disconnected hosts are really shutdown (and
> >> >>>> they were
> >> >>>> NOT).
> >> >>>>
> >> >>>> So MGMT started again some VMs on other hosts, thus resulting in
> >> >>>> having 2
> >> >>>> copies of the same VM, using shared strage - so corruption happened
> >> >>>> on the
> >> >>>> disk.
> >> >>>>
> >> >>>> Is there a way to temporary disable HA feature on global level, or
> >> >>>> anything
> >> >>>> similar ?
> >> >>
> >> >>> Not that I'm aware of, but this is something I also ran in to a
> >> >>> couple
> >> >>> of times.
> >> >>
> >> >>> It would indeed be nice if there could be a way to stop the HA
> >> >>> process
> >> >>> completely as an Admin.
> >> >>
> >> >>> Wido
> >> >>
> >> >>>> Thanks
> >> >>>>
> >> >>
> >>
> >
> >
> >
> > --
> >
> > Andrija Panić
>



-- 

Andrija Panić

Re: Disable HA temporary ?

Posted by Logan Barfield <lb...@tqhosting.com>.

Hi Andrija,

The way I understand it (and have seen in practice) is that by default
the MGMT server will use any available server for HA.  Setting the HA
tag on a hosts just dedicates that host to HA, meaning that during
normal provisioning no VMs will use that host, it will only be used
for HA purposes.  In other words, the "HA" tag is not required for HA
to work.

Thank You,

Logan Barfield
Tranquil Hosting


On Mon, Feb 16, 2015 at 10:43 AM, Andrija Panic <an...@gmail.com> wrote:
> Seems to me, that I'm about to issue something similar to:   update
> cloud.vm_instance set ha = 0 where ha =1...
>
> Now seriously, wondering, per the manual - if you define HA host tag on the
> global config level, and then have NO hosts with that tag - MGMT will not
> be able to start VMs on other hosts, since there are no hosts that are
> dedicated got HA destination ?
>
> Does this makes sense ? I guess the VMs will be just marked as Stopped in
> the GUI/databse, but unable to start them...
> Stupid proposal, but... ?
>
> On 16 February 2015 at 16:22, Logan Barfield <lb...@tqhosting.com>
> wrote:
>
>> Some sort of fencing independent of the management server is
>> definitely needed.  HA in general (particularly on KVM) is all kinds
>> of unpredictable/buggy right now.
>>
>> I like the idea of having a switch that an admin can flip to stop HA.
>> In fact I think a better job control system in general (e.g., being
>> able to stop/restart/manually start tasks) would be awesome, if it's
>> feasible.
>>
>> Thank You,
>>
>> Logan Barfield
>> Tranquil Hosting
>>
>>
>> On Mon, Feb 16, 2015 at 10:05 AM, Wido den Hollander <wi...@widodh.nl>
>> wrote:
>> >
>> >
>> > On 16-02-15 13:16, Andrei Mikhailovsky wrote:
>> >> I had similar issues at least two or thee times. The host agent would
>> disconnect from the management server. The agent would not connect back to
>> the management server without manual intervention, however, it would
>> happily continue running the vms. The management server would initiate the
>> HA and fire up vms, which are already running on the disconnected host. I
>> ended up with a handful of vms and virtual routers being ran on two
>> hypervisors, thus corrupting the disk and having all sorts of issues ((( .
>> >>
>> >> I think there has to be a better way of dealing with this case. At
>> least on an image level. Perhaps a host should keep some sort of lock file
>> or a file for every image where it would record a time stamp. Something
>> like:
>> >>
>> >> f5ffa8b0-d852-41c8-a386-6efb8241e2e7 and
>> >> f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp
>> >>
>> >> Thus, the f5ffa8b0-d852-41c8-a386-6efb8241e2e7 is the name of the disk
>> image and f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp is the image's
>> time stamp.
>> >>
>> >> The hypervisor should record the time stamp in this file while the vm
>> is running. Let's say every 5-10 seconds. If the timestamp is old, we can
>> assume that the volume is no longer used by the hypervisor.
>> >>
>> >> When a vm is started, the timestamp file should be checked and if the
>> timestamp is recent, the vm should not start, otherwise, the vm should
>> start and the timestamp file should be regularly updated.
>> >>
>> >> I am sure there are better ways of doing this, but at least this method
>> should not allow two vms running on different hosts to use the same volume
>> and corrupt the data.
>> >>
>> >> In ceph, as far as I remember, a new feature is being developed to
>> provide a locking mechanism of an rbd image. Not sure if this will do the
>> job?
>> >>
>> >
>> > Something like this is still on my wishlist for Ceph/RBD, something like
>> > you propose.
>> >
>> > For NFS we currently have this in place, but for Ceph/RBD we don't. It's
>> > a matter of code in the Agent and the investigators inside the
>> > Management Server which decide if HA should kick in.
>> >
>> > Wido
>> >
>> >> Andrei
>> >>
>> >> ----- Original Message -----
>> >>
>> >>> From: "Wido den Hollander" <wi...@widodh.nl>
>> >>> To: dev@cloudstack.apache.org
>> >>> Sent: Monday, 16 February, 2015 11:32:13 AM
>> >>> Subject: Re: Disable HA temporary ?
>> >>
>> >>> On 16-02-15 11:00, Andrija Panic wrote:
>> >>>> Hi team,
>> >>>>
>> >>>> I just had funny behaviour few days ago - one of my hosts was under
>> >>>> heavy
>> >>>> load (some disk/network load) and it went disconnected from MGMT
>> >>>> server.
>> >>>>
>> >>>> Then MGMT server stared doing HA thing, but without being able to
>> >>>> make sure
>> >>>> that the VMs on the disconnected hosts are really shutdown (and
>> >>>> they were
>> >>>> NOT).
>> >>>>
>> >>>> So MGMT started again some VMs on other hosts, thus resulting in
>> >>>> having 2
>> >>>> copies of the same VM, using shared strage - so corruption happened
>> >>>> on the
>> >>>> disk.
>> >>>>
>> >>>> Is there a way to temporary disable HA feature on global level, or
>> >>>> anything
>> >>>> similar ?
>> >>
>> >>> Not that I'm aware of, but this is something I also ran in to a
>> >>> couple
>> >>> of times.
>> >>
>> >>> It would indeed be nice if there could be a way to stop the HA
>> >>> process
>> >>> completely as an Admin.
>> >>
>> >>> Wido
>> >>
>> >>>> Thanks
>> >>>>
>> >>
>>
>
>
>
> --
>
> Andrija Panić

Re: Disable HA temporary ?

Posted by Andrija Panic <an...@gmail.com>.

Seems to me, that I'm about to issue something similar to:   update
cloud.vm_instance set ha = 0 where ha =1...

Now seriously, wondering, per the manual - if you define HA host tag on the
global config level, and then have NO hosts with that tag - MGMT will not
be able to start VMs on other hosts, since there are no hosts that are
dedicated got HA destination ?

Does this makes sense ? I guess the VMs will be just marked as Stopped in
the GUI/databse, but unable to start them...
Stupid proposal, but... ?

On 16 February 2015 at 16:22, Logan Barfield <lb...@tqhosting.com>
wrote:

> Some sort of fencing independent of the management server is
> definitely needed.  HA in general (particularly on KVM) is all kinds
> of unpredictable/buggy right now.
>
> I like the idea of having a switch that an admin can flip to stop HA.
> In fact I think a better job control system in general (e.g., being
> able to stop/restart/manually start tasks) would be awesome, if it's
> feasible.
>
> Thank You,
>
> Logan Barfield
> Tranquil Hosting
>
>
> On Mon, Feb 16, 2015 at 10:05 AM, Wido den Hollander <wi...@widodh.nl>
> wrote:
> >
> >
> > On 16-02-15 13:16, Andrei Mikhailovsky wrote:
> >> I had similar issues at least two or thee times. The host agent would
> disconnect from the management server. The agent would not connect back to
> the management server without manual intervention, however, it would
> happily continue running the vms. The management server would initiate the
> HA and fire up vms, which are already running on the disconnected host. I
> ended up with a handful of vms and virtual routers being ran on two
> hypervisors, thus corrupting the disk and having all sorts of issues ((( .
> >>
> >> I think there has to be a better way of dealing with this case. At
> least on an image level. Perhaps a host should keep some sort of lock file
> or a file for every image where it would record a time stamp. Something
> like:
> >>
> >> f5ffa8b0-d852-41c8-a386-6efb8241e2e7 and
> >> f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp
> >>
> >> Thus, the f5ffa8b0-d852-41c8-a386-6efb8241e2e7 is the name of the disk
> image and f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp is the image's
> time stamp.
> >>
> >> The hypervisor should record the time stamp in this file while the vm
> is running. Let's say every 5-10 seconds. If the timestamp is old, we can
> assume that the volume is no longer used by the hypervisor.
> >>
> >> When a vm is started, the timestamp file should be checked and if the
> timestamp is recent, the vm should not start, otherwise, the vm should
> start and the timestamp file should be regularly updated.
> >>
> >> I am sure there are better ways of doing this, but at least this method
> should not allow two vms running on different hosts to use the same volume
> and corrupt the data.
> >>
> >> In ceph, as far as I remember, a new feature is being developed to
> provide a locking mechanism of an rbd image. Not sure if this will do the
> job?
> >>
> >
> > Something like this is still on my wishlist for Ceph/RBD, something like
> > you propose.
> >
> > For NFS we currently have this in place, but for Ceph/RBD we don't. It's
> > a matter of code in the Agent and the investigators inside the
> > Management Server which decide if HA should kick in.
> >
> > Wido
> >
> >> Andrei
> >>
> >> ----- Original Message -----
> >>
> >>> From: "Wido den Hollander" <wi...@widodh.nl>
> >>> To: dev@cloudstack.apache.org
> >>> Sent: Monday, 16 February, 2015 11:32:13 AM
> >>> Subject: Re: Disable HA temporary ?
> >>
> >>> On 16-02-15 11:00, Andrija Panic wrote:
> >>>> Hi team,
> >>>>
> >>>> I just had funny behaviour few days ago - one of my hosts was under
> >>>> heavy
> >>>> load (some disk/network load) and it went disconnected from MGMT
> >>>> server.
> >>>>
> >>>> Then MGMT server stared doing HA thing, but without being able to
> >>>> make sure
> >>>> that the VMs on the disconnected hosts are really shutdown (and
> >>>> they were
> >>>> NOT).
> >>>>
> >>>> So MGMT started again some VMs on other hosts, thus resulting in
> >>>> having 2
> >>>> copies of the same VM, using shared strage - so corruption happened
> >>>> on the
> >>>> disk.
> >>>>
> >>>> Is there a way to temporary disable HA feature on global level, or
> >>>> anything
> >>>> similar ?
> >>
> >>> Not that I'm aware of, but this is something I also ran in to a
> >>> couple
> >>> of times.
> >>
> >>> It would indeed be nice if there could be a way to stop the HA
> >>> process
> >>> completely as an Admin.
> >>
> >>> Wido
> >>
> >>>> Thanks
> >>>>
> >>
>



-- 

Andrija Panić

Re: Disable HA temporary ?

Posted by Logan Barfield <lb...@tqhosting.com>.

Some sort of fencing independent of the management server is
definitely needed.  HA in general (particularly on KVM) is all kinds
of unpredictable/buggy right now.

I like the idea of having a switch that an admin can flip to stop HA.
In fact I think a better job control system in general (e.g., being
able to stop/restart/manually start tasks) would be awesome, if it's
feasible.

Thank You,

Logan Barfield
Tranquil Hosting


On Mon, Feb 16, 2015 at 10:05 AM, Wido den Hollander <wi...@widodh.nl> wrote:
>
>
> On 16-02-15 13:16, Andrei Mikhailovsky wrote:
>> I had similar issues at least two or thee times. The host agent would disconnect from the management server. The agent would not connect back to the management server without manual intervention, however, it would happily continue running the vms. The management server would initiate the HA and fire up vms, which are already running on the disconnected host. I ended up with a handful of vms and virtual routers being ran on two hypervisors, thus corrupting the disk and having all sorts of issues ((( .
>>
>> I think there has to be a better way of dealing with this case. At least on an image level. Perhaps a host should keep some sort of lock file or a file for every image where it would record a time stamp. Something like:
>>
>> f5ffa8b0-d852-41c8-a386-6efb8241e2e7 and
>> f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp
>>
>> Thus, the f5ffa8b0-d852-41c8-a386-6efb8241e2e7 is the name of the disk image and f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp is the image's time stamp.
>>
>> The hypervisor should record the time stamp in this file while the vm is running. Let's say every 5-10 seconds. If the timestamp is old, we can assume that the volume is no longer used by the hypervisor.
>>
>> When a vm is started, the timestamp file should be checked and if the timestamp is recent, the vm should not start, otherwise, the vm should start and the timestamp file should be regularly updated.
>>
>> I am sure there are better ways of doing this, but at least this method should not allow two vms running on different hosts to use the same volume and corrupt the data.
>>
>> In ceph, as far as I remember, a new feature is being developed to provide a locking mechanism of an rbd image. Not sure if this will do the job?
>>
>
> Something like this is still on my wishlist for Ceph/RBD, something like
> you propose.
>
> For NFS we currently have this in place, but for Ceph/RBD we don't. It's
> a matter of code in the Agent and the investigators inside the
> Management Server which decide if HA should kick in.
>
> Wido
>
>> Andrei
>>
>> ----- Original Message -----
>>
>>> From: "Wido den Hollander" <wi...@widodh.nl>
>>> To: dev@cloudstack.apache.org
>>> Sent: Monday, 16 February, 2015 11:32:13 AM
>>> Subject: Re: Disable HA temporary ?
>>
>>> On 16-02-15 11:00, Andrija Panic wrote:
>>>> Hi team,
>>>>
>>>> I just had funny behaviour few days ago - one of my hosts was under
>>>> heavy
>>>> load (some disk/network load) and it went disconnected from MGMT
>>>> server.
>>>>
>>>> Then MGMT server stared doing HA thing, but without being able to
>>>> make sure
>>>> that the VMs on the disconnected hosts are really shutdown (and
>>>> they were
>>>> NOT).
>>>>
>>>> So MGMT started again some VMs on other hosts, thus resulting in
>>>> having 2
>>>> copies of the same VM, using shared strage - so corruption happened
>>>> on the
>>>> disk.
>>>>
>>>> Is there a way to temporary disable HA feature on global level, or
>>>> anything
>>>> similar ?
>>
>>> Not that I'm aware of, but this is something I also ran in to a
>>> couple
>>> of times.
>>
>>> It would indeed be nice if there could be a way to stop the HA
>>> process
>>> completely as an Admin.
>>
>>> Wido
>>
>>>> Thanks
>>>>
>>

Re: Disable HA temporary ?

Posted by Wido den Hollander <wi...@widodh.nl>.


On 16-02-15 13:16, Andrei Mikhailovsky wrote:
> I had similar issues at least two or thee times. The host agent would disconnect from the management server. The agent would not connect back to the management server without manual intervention, however, it would happily continue running the vms. The management server would initiate the HA and fire up vms, which are already running on the disconnected host. I ended up with a handful of vms and virtual routers being ran on two hypervisors, thus corrupting the disk and having all sorts of issues ((( . 
> 
> I think there has to be a better way of dealing with this case. At least on an image level. Perhaps a host should keep some sort of lock file or a file for every image where it would record a time stamp. Something like: 
> 
> f5ffa8b0-d852-41c8-a386-6efb8241e2e7 and 
> f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp 
> 
> Thus, the f5ffa8b0-d852-41c8-a386-6efb8241e2e7 is the name of the disk image and f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp is the image's time stamp. 
> 
> The hypervisor should record the time stamp in this file while the vm is running. Let's say every 5-10 seconds. If the timestamp is old, we can assume that the volume is no longer used by the hypervisor. 
> 
> When a vm is started, the timestamp file should be checked and if the timestamp is recent, the vm should not start, otherwise, the vm should start and the timestamp file should be regularly updated. 
> 
> I am sure there are better ways of doing this, but at least this method should not allow two vms running on different hosts to use the same volume and corrupt the data. 
> 
> In ceph, as far as I remember, a new feature is being developed to provide a locking mechanism of an rbd image. Not sure if this will do the job? 
>

Something like this is still on my wishlist for Ceph/RBD, something like
you propose.

For NFS we currently have this in place, but for Ceph/RBD we don't. It's
a matter of code in the Agent and the investigators inside the
Management Server which decide if HA should kick in.

Wido

> Andrei 
> 
> ----- Original Message -----
> 
>> From: "Wido den Hollander" <wi...@widodh.nl>
>> To: dev@cloudstack.apache.org
>> Sent: Monday, 16 February, 2015 11:32:13 AM
>> Subject: Re: Disable HA temporary ?
> 
>> On 16-02-15 11:00, Andrija Panic wrote:
>>> Hi team,
>>>
>>> I just had funny behaviour few days ago - one of my hosts was under
>>> heavy
>>> load (some disk/network load) and it went disconnected from MGMT
>>> server.
>>>
>>> Then MGMT server stared doing HA thing, but without being able to
>>> make sure
>>> that the VMs on the disconnected hosts are really shutdown (and
>>> they were
>>> NOT).
>>>
>>> So MGMT started again some VMs on other hosts, thus resulting in
>>> having 2
>>> copies of the same VM, using shared strage - so corruption happened
>>> on the
>>> disk.
>>>
>>> Is there a way to temporary disable HA feature on global level, or
>>> anything
>>> similar ?
> 
>> Not that I'm aware of, but this is something I also ran in to a
>> couple
>> of times.
> 
>> It would indeed be nice if there could be a way to stop the HA
>> process
>> completely as an Admin.
> 
>> Wido
> 
>>> Thanks
>>>
>

Re: Disable HA temporary ?

Posted by Andrei Mikhailovsky <an...@arhont.com>.

I had similar issues at least two or thee times. The host agent would disconnect from the management server. The agent would not connect back to the management server without manual intervention, however, it would happily continue running the vms. The management server would initiate the HA and fire up vms, which are already running on the disconnected host. I ended up with a handful of vms and virtual routers being ran on two hypervisors, thus corrupting the disk and having all sorts of issues ((( . 

I think there has to be a better way of dealing with this case. At least on an image level. Perhaps a host should keep some sort of lock file or a file for every image where it would record a time stamp. Something like: 

f5ffa8b0-d852-41c8-a386-6efb8241e2e7 and 
f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp 

Thus, the f5ffa8b0-d852-41c8-a386-6efb8241e2e7 is the name of the disk image and f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp is the image's time stamp. 

The hypervisor should record the time stamp in this file while the vm is running. Let's say every 5-10 seconds. If the timestamp is old, we can assume that the volume is no longer used by the hypervisor. 

When a vm is started, the timestamp file should be checked and if the timestamp is recent, the vm should not start, otherwise, the vm should start and the timestamp file should be regularly updated. 

I am sure there are better ways of doing this, but at least this method should not allow two vms running on different hosts to use the same volume and corrupt the data. 

In ceph, as far as I remember, a new feature is being developed to provide a locking mechanism of an rbd image. Not sure if this will do the job? 

Andrei 

----- Original Message -----

> From: "Wido den Hollander" <wi...@widodh.nl>
> To: dev@cloudstack.apache.org
> Sent: Monday, 16 February, 2015 11:32:13 AM
> Subject: Re: Disable HA temporary ?

> On 16-02-15 11:00, Andrija Panic wrote:
> > Hi team,
> >
> > I just had funny behaviour few days ago - one of my hosts was under
> > heavy
> > load (some disk/network load) and it went disconnected from MGMT
> > server.
> >
> > Then MGMT server stared doing HA thing, but without being able to
> > make sure
> > that the VMs on the disconnected hosts are really shutdown (and
> > they were
> > NOT).
> >
> > So MGMT started again some VMs on other hosts, thus resulting in
> > having 2
> > copies of the same VM, using shared strage - so corruption happened
> > on the
> > disk.
> >
> > Is there a way to temporary disable HA feature on global level, or
> > anything
> > similar ?

> Not that I'm aware of, but this is something I also ran in to a
> couple
> of times.

> It would indeed be nice if there could be a way to stop the HA
> process
> completely as an Admin.

> Wido

> > Thanks
> >

Re: Disable HA temporary ?

Posted by Wido den Hollander <wi...@widodh.nl>.


On 16-02-15 11:00, Andrija Panic wrote:
> Hi team,
> 
> I just had funny behaviour few days ago - one of my hosts was under heavy
> load (some disk/network load) and it went disconnected from MGMT server.
> 
> Then MGMT server stared doing HA thing, but without being able to make sure
> that the VMs on the disconnected hosts are really shutdown (and they were
> NOT).
> 
> So MGMT started again some VMs on other hosts, thus resulting in having 2
> copies of the same VM, using shared strage  - so corruption happened on the
> disk.
> 
> Is there a way to temporary disable HA feature on global level, or anything
> similar ?

Not that I'm aware of, but this is something I also ran in to a couple
of times.

It would indeed be nice if there could be a way to stop the HA process
completely as an Admin.

Wido

> Thanks
>