You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Darren Shepherd <da...@gmail.com> on 2013/10/03 00:36:30 UTC

HA is broken on master

Alex,

In scheduleRestart() when it calls _itMgr.advanceStop() it used to
pass the VO.  Now it passes a UUID.  So the VO the HA manager holds is
out of sync with the DB and the recorded previous state and update
count are wrong, so HA will just stop the VM in the worker.

I really think the update count approach is far too fragile.  For
example, currently if you try to start a VM and it fails, the update
count will change.  But the current code will record the new update
count so the next try it will have the updated count.  I can see the
following issue, maybe there's some work around for it.  Imagine you
have a large failure, the stuff really hits the fan.  So you have
1000's of HA jobs trying to run and things just keep failing.  So to
stop the churn you shutdown the mgmt stack to figure out whats up with
infrastructure.  There's a really good chance that you would kill the
mgmt stack while a VM was in starting.  So now the hawork update count
will be out of sync with the current DB.  So when you bring the mgmt
stack back up.  It won't try to restart that VM.

Maybe that situation is taken care of somehow, but I could probably
dream up another one.  I think it is far simpler that when a user
starts a VM, you record in the vm_instance table, in a new column,
"Should be running", then when the HA worker processes the record, it
will always say it should be running.  If the user does a stop, you
clear that column.  This has the added benefit of when things are bad
and a user starts clicking restart/start, they won't mess with the HA.
 I think, maybe things have changed, but before what I would see is
that we'd have an issue so VMs should be started, but weren't.  So HA
was trying, but it kept failing.  The user would login and see they're
VM is down, so they would click start.  But that would fail (similar
to how HA was also failing).  So the VM would stay in stopped, but
since they touched the VM, the update count changed and HA wouldn't
start it back up when the infra worked again.  So customers who
proactively tried to do something would get penalized in that their
downtime was longer because cloudstack wouldn't bring their VM back up
like the other VMs.

Darren

Re: HA is broken on master

Posted by Kelven Yang <ke...@citrix.com>.
I'm now rebasing the VMsync work to master, will send a merge request once
I'm done

Kelven

On 10/2/13 4:31 PM, "Chiradeep Vittal" <Ch...@citrix.com> wrote:

>My bad. I thought this was merged into master, but it isn't.
>
>On 10/2/13 4:24 PM, "David Nalley" <da...@gnsa.us> wrote:
>
>>Why is the work happening in master?
>>
>>
>>On Wed, Oct 2, 2013 at 7:09 PM, Chiradeep Vittal
>><Ch...@citrix.com> wrote:
>>> Perhaps as a result of this work:
>>> https://cwiki.apache.org/confluence/x/tYvlAQ
>>> I think Kelven is trying to separate the job state (starting, stopping)
>>> from the actual VM state.
>>>
>>> On 10/2/13 3:36 PM, "Darren Shepherd" <da...@gmail.com>
>>>wrote:
>>>
>>>>Alex,
>>>>
>>>>In scheduleRestart() when it calls _itMgr.advanceStop() it used to
>>>>pass the VO.  Now it passes a UUID.  So the VO the HA manager holds is
>>>>out of sync with the DB and the recorded previous state and update
>>>>count are wrong, so HA will just stop the VM in the worker.
>>>>
>>>>I really think the update count approach is far too fragile.  For
>>>>example, currently if you try to start a VM and it fails, the update
>>>>count will change.  But the current code will record the new update
>>>>count so the next try it will have the updated count.  I can see the
>>>>following issue, maybe there's some work around for it.  Imagine you
>>>>have a large failure, the stuff really hits the fan.  So you have
>>>>1000's of HA jobs trying to run and things just keep failing.  So to
>>>>stop the churn you shutdown the mgmt stack to figure out whats up with
>>>>infrastructure.  There's a really good chance that you would kill the
>>>>mgmt stack while a VM was in starting.  So now the hawork update count
>>>>will be out of sync with the current DB.  So when you bring the mgmt
>>>>stack back up.  It won't try to restart that VM.
>>>>
>>>>Maybe that situation is taken care of somehow, but I could probably
>>>>dream up another one.  I think it is far simpler that when a user
>>>>starts a VM, you record in the vm_instance table, in a new column,
>>>>"Should be running", then when the HA worker processes the record, it
>>>>will always say it should be running.  If the user does a stop, you
>>>>clear that column.  This has the added benefit of when things are bad
>>>>and a user starts clicking restart/start, they won't mess with the HA.
>>>> I think, maybe things have changed, but before what I would see is
>>>>that we'd have an issue so VMs should be started, but weren't.  So HA
>>>>was trying, but it kept failing.  The user would login and see they're
>>>>VM is down, so they would click start.  But that would fail (similar
>>>>to how HA was also failing).  So the VM would stay in stopped, but
>>>>since they touched the VM, the update count changed and HA wouldn't
>>>>start it back up when the infra worked again.  So customers who
>>>>proactively tried to do something would get penalized in that their
>>>>downtime was longer because cloudstack wouldn't bring their VM back up
>>>>like the other VMs.
>>>>
>>>>Darren
>>>
>


Re: HA is broken on master

Posted by Chiradeep Vittal <Ch...@citrix.com>.
My bad. I thought this was merged into master, but it isn't.

On 10/2/13 4:24 PM, "David Nalley" <da...@gnsa.us> wrote:

>Why is the work happening in master?
>
>
>On Wed, Oct 2, 2013 at 7:09 PM, Chiradeep Vittal
><Ch...@citrix.com> wrote:
>> Perhaps as a result of this work:
>> https://cwiki.apache.org/confluence/x/tYvlAQ
>> I think Kelven is trying to separate the job state (starting, stopping)
>> from the actual VM state.
>>
>> On 10/2/13 3:36 PM, "Darren Shepherd" <da...@gmail.com>
>>wrote:
>>
>>>Alex,
>>>
>>>In scheduleRestart() when it calls _itMgr.advanceStop() it used to
>>>pass the VO.  Now it passes a UUID.  So the VO the HA manager holds is
>>>out of sync with the DB and the recorded previous state and update
>>>count are wrong, so HA will just stop the VM in the worker.
>>>
>>>I really think the update count approach is far too fragile.  For
>>>example, currently if you try to start a VM and it fails, the update
>>>count will change.  But the current code will record the new update
>>>count so the next try it will have the updated count.  I can see the
>>>following issue, maybe there's some work around for it.  Imagine you
>>>have a large failure, the stuff really hits the fan.  So you have
>>>1000's of HA jobs trying to run and things just keep failing.  So to
>>>stop the churn you shutdown the mgmt stack to figure out whats up with
>>>infrastructure.  There's a really good chance that you would kill the
>>>mgmt stack while a VM was in starting.  So now the hawork update count
>>>will be out of sync with the current DB.  So when you bring the mgmt
>>>stack back up.  It won't try to restart that VM.
>>>
>>>Maybe that situation is taken care of somehow, but I could probably
>>>dream up another one.  I think it is far simpler that when a user
>>>starts a VM, you record in the vm_instance table, in a new column,
>>>"Should be running", then when the HA worker processes the record, it
>>>will always say it should be running.  If the user does a stop, you
>>>clear that column.  This has the added benefit of when things are bad
>>>and a user starts clicking restart/start, they won't mess with the HA.
>>> I think, maybe things have changed, but before what I would see is
>>>that we'd have an issue so VMs should be started, but weren't.  So HA
>>>was trying, but it kept failing.  The user would login and see they're
>>>VM is down, so they would click start.  But that would fail (similar
>>>to how HA was also failing).  So the VM would stay in stopped, but
>>>since they touched the VM, the update count changed and HA wouldn't
>>>start it back up when the infra worked again.  So customers who
>>>proactively tried to do something would get penalized in that their
>>>downtime was longer because cloudstack wouldn't bring their VM back up
>>>like the other VMs.
>>>
>>>Darren
>>


Re: HA is broken on master

Posted by David Nalley <da...@gnsa.us>.
Why is the work happening in master?


On Wed, Oct 2, 2013 at 7:09 PM, Chiradeep Vittal
<Ch...@citrix.com> wrote:
> Perhaps as a result of this work:
> https://cwiki.apache.org/confluence/x/tYvlAQ
> I think Kelven is trying to separate the job state (starting, stopping)
> from the actual VM state.
>
> On 10/2/13 3:36 PM, "Darren Shepherd" <da...@gmail.com> wrote:
>
>>Alex,
>>
>>In scheduleRestart() when it calls _itMgr.advanceStop() it used to
>>pass the VO.  Now it passes a UUID.  So the VO the HA manager holds is
>>out of sync with the DB and the recorded previous state and update
>>count are wrong, so HA will just stop the VM in the worker.
>>
>>I really think the update count approach is far too fragile.  For
>>example, currently if you try to start a VM and it fails, the update
>>count will change.  But the current code will record the new update
>>count so the next try it will have the updated count.  I can see the
>>following issue, maybe there's some work around for it.  Imagine you
>>have a large failure, the stuff really hits the fan.  So you have
>>1000's of HA jobs trying to run and things just keep failing.  So to
>>stop the churn you shutdown the mgmt stack to figure out whats up with
>>infrastructure.  There's a really good chance that you would kill the
>>mgmt stack while a VM was in starting.  So now the hawork update count
>>will be out of sync with the current DB.  So when you bring the mgmt
>>stack back up.  It won't try to restart that VM.
>>
>>Maybe that situation is taken care of somehow, but I could probably
>>dream up another one.  I think it is far simpler that when a user
>>starts a VM, you record in the vm_instance table, in a new column,
>>"Should be running", then when the HA worker processes the record, it
>>will always say it should be running.  If the user does a stop, you
>>clear that column.  This has the added benefit of when things are bad
>>and a user starts clicking restart/start, they won't mess with the HA.
>> I think, maybe things have changed, but before what I would see is
>>that we'd have an issue so VMs should be started, but weren't.  So HA
>>was trying, but it kept failing.  The user would login and see they're
>>VM is down, so they would click start.  But that would fail (similar
>>to how HA was also failing).  So the VM would stay in stopped, but
>>since they touched the VM, the update count changed and HA wouldn't
>>start it back up when the infra worked again.  So customers who
>>proactively tried to do something would get penalized in that their
>>downtime was longer because cloudstack wouldn't bring their VM back up
>>like the other VMs.
>>
>>Darren
>

Re: HA is broken on master

Posted by Chiradeep Vittal <Ch...@citrix.com>.
Perhaps as a result of this work:
https://cwiki.apache.org/confluence/x/tYvlAQ
I think Kelven is trying to separate the job state (starting, stopping)
from the actual VM state.

On 10/2/13 3:36 PM, "Darren Shepherd" <da...@gmail.com> wrote:

>Alex,
>
>In scheduleRestart() when it calls _itMgr.advanceStop() it used to
>pass the VO.  Now it passes a UUID.  So the VO the HA manager holds is
>out of sync with the DB and the recorded previous state and update
>count are wrong, so HA will just stop the VM in the worker.
>
>I really think the update count approach is far too fragile.  For
>example, currently if you try to start a VM and it fails, the update
>count will change.  But the current code will record the new update
>count so the next try it will have the updated count.  I can see the
>following issue, maybe there's some work around for it.  Imagine you
>have a large failure, the stuff really hits the fan.  So you have
>1000's of HA jobs trying to run and things just keep failing.  So to
>stop the churn you shutdown the mgmt stack to figure out whats up with
>infrastructure.  There's a really good chance that you would kill the
>mgmt stack while a VM was in starting.  So now the hawork update count
>will be out of sync with the current DB.  So when you bring the mgmt
>stack back up.  It won't try to restart that VM.
>
>Maybe that situation is taken care of somehow, but I could probably
>dream up another one.  I think it is far simpler that when a user
>starts a VM, you record in the vm_instance table, in a new column,
>"Should be running", then when the HA worker processes the record, it
>will always say it should be running.  If the user does a stop, you
>clear that column.  This has the added benefit of when things are bad
>and a user starts clicking restart/start, they won't mess with the HA.
> I think, maybe things have changed, but before what I would see is
>that we'd have an issue so VMs should be started, but weren't.  So HA
>was trying, but it kept failing.  The user would login and see they're
>VM is down, so they would click start.  But that would fail (similar
>to how HA was also failing).  So the VM would stay in stopped, but
>since they touched the VM, the update count changed and HA wouldn't
>start it back up when the infra worked again.  So customers who
>proactively tried to do something would get penalized in that their
>downtime was longer because cloudstack wouldn't bring their VM back up
>like the other VMs.
>
>Darren