You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Chiradeep Vittal <Ch...@citrix.com> on 2013/02/22 21:41:28 UTC

Re: Issues when vCenter becomes unavailable

CC'ing Kelven to see if he has any ideas.

On 2/22/13 12:22 PM, "Dave Dunaway" <da...@gmail.com> wrote:

>If I may suggest also testing a disconnect of a host (hypervisor) from
>vcenter, so that vcenter and CS can still talk, but vcenter cannot talk to
>the hosts (hypervisors). CS marks the host as down or failed or whatever.
>
>When the host comes back up vcenter can it just fine and all seems good.
>That however is not the case (I had this with CS 3.0.5 and vmware esxi
>5.0)
>when CS tries to talk to vcenter and the previously disconnected host
>(that
>is now recovered).
>
>What we experienced was that we had to migrate all guests off the
>recovered
>host, and then destroy that host in CS, and re-create it. Then we could
>migrate back onto it the guests which had been previously migrated.
>
>The curious thing is that while CS did not want to send commands to the
>host (it kept on saying host id=X has timedout when whatever command was
>sent to it), CS WAS polling the host for resources and getting the correct
>numbers.... so CS could in some ways talk to the host (ie: it knew the
>capabilities, number of VMs on it, etc).
>
>Luckily for me this all happened in a test environment. In production,
>this
>would have been a real nightmare!
>
>
>dave
>
>
>On Fri, Feb 22, 2013 at 2:48 PM, Musayev, Ilya <im...@webmd.net> wrote:
>
>> Andi
>>
>> I'm on CS4.0. I simulated the VMWare VCenter 5 failure by adding a bogus
>> IP entry in /etc/hosts for 10 minutes for virtual center host. That in
>>turn
>> made VC unreachable by CS.
>>
>> I then began executing commands and sure enough commands failed or
>> backlogged. Once I restored VC connectivity, the backlogged commands
>> executed and I did not experience any abnormalities.
>>
>> I will redo this test and leave VC off for an hour - maybe a need a
>>longer
>> outage.
>>
>> Regards
>> ilya
>>
>>
>>
>> -----Original Message-----
>> From: Musayev, Ilya
>> Sent: Thursday, February 21, 2013 2:43 PM
>> To: cloudstack-users@incubator.apache.org
>> Subject: RE: Issues when vCenter becomes unavailable
>>
>> This is definitely not the behavior we want with vcenter.
>>
>> I will test this out on my lab setup shortly.
>>
>> Thanks
>> ilya
>>
>> -----Original Message-----
>> From: Chip Childers [mailto:chip.childers@sungard.com]
>> Sent: Thursday, February 21, 2013 9:40 AM
>> To: cloudstack-users@incubator.apache.org
>> Subject: Re: Issues when vCenter becomes unavailable
>>
>> On Thu, Feb 21, 2013 at 08:59:14AM -0500, Mathias Mullins wrote:
>> > Andreas,
>> >
>> > The open source community doesn't support the Citrix version 3.0.6.
>> > You need to report this via your Citrix Support contract. Sounds like
>> > this could be a bug.
>> >
>> > Community - this could be a possible issue in 4.0.0 / 4.0.1. I don't
>> > know if this test case has been explored.
>>
>> Thx - I forwarded to cs-dev@i.a.o to get the test engineers in the
>> community to take a look.
>>
>> >
>> > Thanks,
>> > Matt Mullins
>> > CloudPlatform Implementation Engineer
>> > Worldwide Cloud Services  Citrix System, Inc.
>> > +1 (407) 920-1107  Office/Cell Phone
>> > matt.mullins@citrix.com
>> >
>> >
>> >
>> > On 2/21/13 5:35 AM, "Fuchs, Andreas (SwissTXT)"
>> > <An...@swisstxt.ch> wrote:
>> >
>> > >Hi CS Users
>> > >
>> > >We are running CS 3.0.6 on a vSphere platform and found a strange
>> > >behavior.
>> > >
>> > >When the vCenter becomes unavailable due to a reboot or some other
>> > >issue, it seems that CS is shutting down instances when vCenter
>> > >becomes available again.
>> > >
>> > >What we think what happens.
>> > >1. vCenter becomes unrechabale
>> > >2. CS marks the ESX servers as "down"
>> > >3. We think this leads to: CS marks the instances as down as well 4.
>> > >When vCenter becomes available again, CS stops the "marked as down"
>> > >instances
>> > >
>> > >This is very bad as the Instances where running all the time and the
>> > >the shutdown issued by CS is forcing a service interruption.
>> > >
>> > >My problem is that I cannot realy reporoduce as allot of testing is
>> > >ongoing on the platform at the moment, so my question:
>> > >
>> > >Does someone else see this issue as well and can maybe reproduce?
>> > >Is there a workaround to it, can I change some flag or something
>> > >which tells CS to never shut down an instance by himself?
>> > >Why are the ESX hosts getting marked as down and not unreachable or
>> > >something?
>> > >
>> > >Best regards
>> > >Andi
>> >
>> >
>>
>>
>>


RE: Issues when vCenter becomes unavailable

Posted by "Musayev, Ilya" <im...@webmd.net>.
FYI, please note this JIRA Issue, if there is something I left out, please chime in.

Thanks
ilya

https://issues.apache.org/jira/browse/CLOUDSTACK-1411



From: Musayev, Ilya
Sent: Saturday, February 23, 2013 6:22 PM
To: kelven.yang@citrix.com; cloudstack-dev@incubator.apache.org; cloudstack-users@incubator.apache.org
Subject: Re: Issues when vCenter becomes unavailable

Any chance of some sort of fix for 4.0 or 4.1?

I understand that CS-669 (feature/enhacement) patch missed the commit deadline and will be in 4.2, but there is a real issue here that impacts production now.

Also, this is not a feature but a bug, I don't know if bugs are also treated on the same schedule as features.

Technically, for testing - we don't need to fail hypervisors. vMotion would achieve the same effect and host ID will get out of sync. It's only a theory though.

I will open a bug request on JIRA and ask for some visibility.

Alternatively, we can probably have a hack that will query VC for hosts and vms, identify what's changed, and update db - I'm just trying to avoid hacks.

Kelven Yang <ke...@citrix.com>> wrote:
This is an issue that we are targeting to solve to sync states between
vCenter/Cloudstack in a controllable way. Please track the status of this
ticket for further progress

https://issues.apache.org/jira/browse/CLOUDSTACK-669


Kelven


On 2/22/13 3:51 PM, "Musayev, Ilya" <im...@webmd.net>> wrote:

>Abit Incomplete email as I was in train and mistakenly press send,
>correction below:.. sorry :)
>
>-----Original Message-----
>From: Musayev, Ilya [mailto:imusayev@webmd.net]
>Sent: Friday, February 22, 2013 6:49 PM
>To: cloudstack-dev@incubator.apache.org<ma...@incubator.apache.org>;
>cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
>Cc: Kelven Yang
>Subject: RE: Issues when vCenter becomes unavailable
>
>Summary:
>
>I have 3 hypervisors
>Hypervisor 1 and 2 are down, hypervisor 3 is up. All VMs live on
>hypervisor 3, however, the host_id in instance table for the VMs are not
>being updated to reflect the only hypervisor alive.
>
>Details:
>
>I physically powered off 2 hypervisors that had most of my VMs and left 1
>online.
>
>The VMs were brought back online by vcenter, however from then on, I
>experience what Dave and Andreas mentioned.
>
>That is, VMWare VMs instances are bound to host id (hypervisor) and not
>vcenter and operations that would be executed on the VMs require for the
>hypervisor to stay up. If the hypervisor goes off line, while VMs still
>come up in VC, CS cannot comprehend that these VMs now live on another
>hypervisor.
>
>This is bad for production roll outs - because VMs are bound to a
>hypervisor ID and not virtual center and it appears its not getting
>updated - though I do see in the log that CS is trying to find it.
>
>Did a little more digging, it looks like the host_ids don't get updated
>in mysql for vm in instances table. I need to double check on this
>because I totally messed 2 of test cloudstack clusters.
>
>Can someone do the following test - if time allows - if not - I can try
>on monday:
>
>1) Pick a hypervisor for a test crash and note 1 vm (I.e. i-2-89)
>2) Navigate to "host" table in mysql and note the host_id for hypervisor
>that is about to be powered off.
>3) In mysql goto instances table and note the last_host_id and host_id
>for a VM on test crash hypervisor.
>4) Power off the hypervisor and let VCenter bring it back online
>5) Attempt to launch a console on the VM was on crashed hypervisors and
>was powered back on by VC
>6) If it fails - as it did in my case, alter the value of host_id to a
>next hypervisor its living on (my test is not clean because I've ruined
>the cluster that hosts my console vm and don't have time now to work on
>it ATM)
>7) Launch console again to see if the issue resolved
>
>I'm under suspicion the host_id does not get updated as I witnessed by
>examining mysql instance table, but I need to fix my env issues to
>confirm.
>
>Regards
>ilya
>
>
>-----Original Message-----
>From: Chiradeep Vittal [mailto:Chiradeep.Vittal@citrix.com]
>Sent: Friday, February 22, 2013 3:41 PM
>To: cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
>Cc: Kelven Yang; CloudStack DeveloperList
>Subject: Re: Issues when vCenter becomes unavailable
>
>CC'ing Kelven to see if he has any ideas.
>
>On 2/22/13 12:22 PM, "Dave Dunaway" <da...@gmail.com>> wrote:
>
>>If I may suggest also testing a disconnect of a host (hypervisor) from
>>vcenter, so that vcenter and CS can still talk, but vcenter cannot talk
>>to the hosts (hypervisors). CS marks the host as down or failed or
>>whatever.
>>
>>When the host comes back up vcenter can it just fine and all seems good.
>>That however is not the case (I had this with CS 3.0.5 and vmware esxi
>>5.0)
>>when CS tries to talk to vcenter and the previously disconnected host
>>(that is now recovered).
>>
>>What we experienced was that we had to migrate all guests off the
>>recovered host, and then destroy that host in CS, and re-create it.
>>Then we could migrate back onto it the guests which had been previously
>>migrated.
>>
>>The curious thing is that while CS did not want to send commands to the
>>host (it kept on saying host id=X has timedout when whatever command
>>was sent to it), CS WAS polling the host for resources and getting the
>>correct numbers.... so CS could in some ways talk to the host (ie: it
>>knew the capabilities, number of VMs on it, etc).
>>
>>Luckily for me this all happened in a test environment. In production,
>>this would have been a real nightmare!
>>
>>
>>dave
>>
>>
>>On Fri, Feb 22, 2013 at 2:48 PM, Musayev, Ilya <im...@webmd.net>>
>>wrote:
>>
>>> Andi
>>>
>>> I'm on CS4.0. I simulated the VMWare VCenter 5 failure by adding a
>>>bogus  IP entry in /etc/hosts for 10 minutes for virtual center host.
>>>That in turn  made VC unreachable by CS.
>>>
>>> I then began executing commands and sure enough commands failed or
>>> backlogged. Once I restored VC connectivity, the backlogged commands
>>> executed and I did not experience any abnormalities.
>>>
>>> I will redo this test and leave VC off for an hour - maybe a need a
>>>longer  outage.
>>>
>>> Regards
>>> ilya
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Musayev, Ilya
>>> Sent: Thursday, February 21, 2013 2:43 PM
>>> To: cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
>>> Subject: RE: Issues when vCenter becomes unavailable
>>>
>>> This is definitely not the behavior we want with vcenter.
>>>
>>> I will test this out on my lab setup shortly.
>>>
>>> Thanks
>>> ilya
>>>
>>> -----Original Message-----
>>> From: Chip Childers [mailto:chip.childers@sungard.com]
>>> Sent: Thursday, February 21, 2013 9:40 AM
>>> To: cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
>>> Subject: Re: Issues when vCenter becomes unavailable
>>>
>>> On Thu, Feb 21, 2013 at 08:59:14AM -0500, Mathias Mullins wrote:
>>> > Andreas,
>>> >
>>> > The open source community doesn't support the Citrix version 3.0.6.
>>> > You need to report this via your Citrix Support contract. Sounds
>>> > like this could be a bug.
>>> >
>>> > Community - this could be a possible issue in 4.0.0 / 4.0.1. I
>>> > don't know if this test case has been explored.
>>>
>>> Thx - I forwarded to cs-dev@i.a.o<ma...@i.a.o> to get the test engineers in the
>>> community to take a look.
>>>
>>> >
>>> > Thanks,
>>> > Matt Mullins
>>> > CloudPlatform Implementation Engineer Worldwide Cloud Services
>>> > Citrix System, Inc.
>>> > +1 (407) 920-1107  Office/Cell Phone
>>> > matt.mullins@citrix.com<ma...@citrix.com>
>>> >
>>> >
>>> >
>>> > On 2/21/13 5:35 AM, "Fuchs, Andreas (SwissTXT)"
>>> > <An...@swisstxt.ch>> wrote:
>>> >
>>> > >Hi CS Users
>>> > >
>>> > >We are running CS 3.0.6 on a vSphere platform and found a strange
>>> > >behavior.
>>> > >
>>> > >When the vCenter becomes unavailable due to a reboot or some other
>>> > >issue, it seems that CS is shutting down instances when vCenter
>>> > >becomes available again.
>>> > >
>>> > >What we think what happens.
>>> > >1. vCenter becomes unrechabale
>>> > >2. CS marks the ESX servers as "down"
>>> > >3. We think this leads to: CS marks the instances as down as well 4.
>>> > >When vCenter becomes available again, CS stops the "marked as down"
>>> > >instances
>>> > >
>>> > >This is very bad as the Instances where running all the time and
>>> > >the the shutdown issued by CS is forcing a service interruption.
>>> > >
>>> > >My problem is that I cannot realy reporoduce as allot of testing
>>> > >is ongoing on the platform at the moment, so my question:
>>> > >
>>> > >Does someone else see this issue as well and can maybe reproduce?
>>> > >Is there a workaround to it, can I change some flag or something
>>> > >which tells CS to never shut down an instance by himself?
>>> > >Why are the ESX hosts getting marked as down and not unreachable
>>> > >or something?
>>> > >
>>> > >Best regards
>>> > >Andi
>>> >
>>> >
>>>
>>>
>>>
>
>
>
>
>


RE: Issues when vCenter becomes unavailable

Posted by "Musayev, Ilya" <im...@webmd.net>.
FYI, please note this JIRA Issue, if there is something I left out, please chime in.

Thanks
ilya

https://issues.apache.org/jira/browse/CLOUDSTACK-1411



From: Musayev, Ilya
Sent: Saturday, February 23, 2013 6:22 PM
To: kelven.yang@citrix.com; cloudstack-dev@incubator.apache.org; cloudstack-users@incubator.apache.org
Subject: Re: Issues when vCenter becomes unavailable

Any chance of some sort of fix for 4.0 or 4.1?

I understand that CS-669 (feature/enhacement) patch missed the commit deadline and will be in 4.2, but there is a real issue here that impacts production now.

Also, this is not a feature but a bug, I don't know if bugs are also treated on the same schedule as features.

Technically, for testing - we don't need to fail hypervisors. vMotion would achieve the same effect and host ID will get out of sync. It's only a theory though.

I will open a bug request on JIRA and ask for some visibility.

Alternatively, we can probably have a hack that will query VC for hosts and vms, identify what's changed, and update db - I'm just trying to avoid hacks.

Kelven Yang <ke...@citrix.com>> wrote:
This is an issue that we are targeting to solve to sync states between
vCenter/Cloudstack in a controllable way. Please track the status of this
ticket for further progress

https://issues.apache.org/jira/browse/CLOUDSTACK-669


Kelven


On 2/22/13 3:51 PM, "Musayev, Ilya" <im...@webmd.net>> wrote:

>Abit Incomplete email as I was in train and mistakenly press send,
>correction below:.. sorry :)
>
>-----Original Message-----
>From: Musayev, Ilya [mailto:imusayev@webmd.net]
>Sent: Friday, February 22, 2013 6:49 PM
>To: cloudstack-dev@incubator.apache.org<ma...@incubator.apache.org>;
>cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
>Cc: Kelven Yang
>Subject: RE: Issues when vCenter becomes unavailable
>
>Summary:
>
>I have 3 hypervisors
>Hypervisor 1 and 2 are down, hypervisor 3 is up. All VMs live on
>hypervisor 3, however, the host_id in instance table for the VMs are not
>being updated to reflect the only hypervisor alive.
>
>Details:
>
>I physically powered off 2 hypervisors that had most of my VMs and left 1
>online.
>
>The VMs were brought back online by vcenter, however from then on, I
>experience what Dave and Andreas mentioned.
>
>That is, VMWare VMs instances are bound to host id (hypervisor) and not
>vcenter and operations that would be executed on the VMs require for the
>hypervisor to stay up. If the hypervisor goes off line, while VMs still
>come up in VC, CS cannot comprehend that these VMs now live on another
>hypervisor.
>
>This is bad for production roll outs - because VMs are bound to a
>hypervisor ID and not virtual center and it appears its not getting
>updated - though I do see in the log that CS is trying to find it.
>
>Did a little more digging, it looks like the host_ids don't get updated
>in mysql for vm in instances table. I need to double check on this
>because I totally messed 2 of test cloudstack clusters.
>
>Can someone do the following test - if time allows - if not - I can try
>on monday:
>
>1) Pick a hypervisor for a test crash and note 1 vm (I.e. i-2-89)
>2) Navigate to "host" table in mysql and note the host_id for hypervisor
>that is about to be powered off.
>3) In mysql goto instances table and note the last_host_id and host_id
>for a VM on test crash hypervisor.
>4) Power off the hypervisor and let VCenter bring it back online
>5) Attempt to launch a console on the VM was on crashed hypervisors and
>was powered back on by VC
>6) If it fails - as it did in my case, alter the value of host_id to a
>next hypervisor its living on (my test is not clean because I've ruined
>the cluster that hosts my console vm and don't have time now to work on
>it ATM)
>7) Launch console again to see if the issue resolved
>
>I'm under suspicion the host_id does not get updated as I witnessed by
>examining mysql instance table, but I need to fix my env issues to
>confirm.
>
>Regards
>ilya
>
>
>-----Original Message-----
>From: Chiradeep Vittal [mailto:Chiradeep.Vittal@citrix.com]
>Sent: Friday, February 22, 2013 3:41 PM
>To: cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
>Cc: Kelven Yang; CloudStack DeveloperList
>Subject: Re: Issues when vCenter becomes unavailable
>
>CC'ing Kelven to see if he has any ideas.
>
>On 2/22/13 12:22 PM, "Dave Dunaway" <da...@gmail.com>> wrote:
>
>>If I may suggest also testing a disconnect of a host (hypervisor) from
>>vcenter, so that vcenter and CS can still talk, but vcenter cannot talk
>>to the hosts (hypervisors). CS marks the host as down or failed or
>>whatever.
>>
>>When the host comes back up vcenter can it just fine and all seems good.
>>That however is not the case (I had this with CS 3.0.5 and vmware esxi
>>5.0)
>>when CS tries to talk to vcenter and the previously disconnected host
>>(that is now recovered).
>>
>>What we experienced was that we had to migrate all guests off the
>>recovered host, and then destroy that host in CS, and re-create it.
>>Then we could migrate back onto it the guests which had been previously
>>migrated.
>>
>>The curious thing is that while CS did not want to send commands to the
>>host (it kept on saying host id=X has timedout when whatever command
>>was sent to it), CS WAS polling the host for resources and getting the
>>correct numbers.... so CS could in some ways talk to the host (ie: it
>>knew the capabilities, number of VMs on it, etc).
>>
>>Luckily for me this all happened in a test environment. In production,
>>this would have been a real nightmare!
>>
>>
>>dave
>>
>>
>>On Fri, Feb 22, 2013 at 2:48 PM, Musayev, Ilya <im...@webmd.net>>
>>wrote:
>>
>>> Andi
>>>
>>> I'm on CS4.0. I simulated the VMWare VCenter 5 failure by adding a
>>>bogus  IP entry in /etc/hosts for 10 minutes for virtual center host.
>>>That in turn  made VC unreachable by CS.
>>>
>>> I then began executing commands and sure enough commands failed or
>>> backlogged. Once I restored VC connectivity, the backlogged commands
>>> executed and I did not experience any abnormalities.
>>>
>>> I will redo this test and leave VC off for an hour - maybe a need a
>>>longer  outage.
>>>
>>> Regards
>>> ilya
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Musayev, Ilya
>>> Sent: Thursday, February 21, 2013 2:43 PM
>>> To: cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
>>> Subject: RE: Issues when vCenter becomes unavailable
>>>
>>> This is definitely not the behavior we want with vcenter.
>>>
>>> I will test this out on my lab setup shortly.
>>>
>>> Thanks
>>> ilya
>>>
>>> -----Original Message-----
>>> From: Chip Childers [mailto:chip.childers@sungard.com]
>>> Sent: Thursday, February 21, 2013 9:40 AM
>>> To: cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
>>> Subject: Re: Issues when vCenter becomes unavailable
>>>
>>> On Thu, Feb 21, 2013 at 08:59:14AM -0500, Mathias Mullins wrote:
>>> > Andreas,
>>> >
>>> > The open source community doesn't support the Citrix version 3.0.6.
>>> > You need to report this via your Citrix Support contract. Sounds
>>> > like this could be a bug.
>>> >
>>> > Community - this could be a possible issue in 4.0.0 / 4.0.1. I
>>> > don't know if this test case has been explored.
>>>
>>> Thx - I forwarded to cs-dev@i.a.o<ma...@i.a.o> to get the test engineers in the
>>> community to take a look.
>>>
>>> >
>>> > Thanks,
>>> > Matt Mullins
>>> > CloudPlatform Implementation Engineer Worldwide Cloud Services
>>> > Citrix System, Inc.
>>> > +1 (407) 920-1107  Office/Cell Phone
>>> > matt.mullins@citrix.com<ma...@citrix.com>
>>> >
>>> >
>>> >
>>> > On 2/21/13 5:35 AM, "Fuchs, Andreas (SwissTXT)"
>>> > <An...@swisstxt.ch>> wrote:
>>> >
>>> > >Hi CS Users
>>> > >
>>> > >We are running CS 3.0.6 on a vSphere platform and found a strange
>>> > >behavior.
>>> > >
>>> > >When the vCenter becomes unavailable due to a reboot or some other
>>> > >issue, it seems that CS is shutting down instances when vCenter
>>> > >becomes available again.
>>> > >
>>> > >What we think what happens.
>>> > >1. vCenter becomes unrechabale
>>> > >2. CS marks the ESX servers as "down"
>>> > >3. We think this leads to: CS marks the instances as down as well 4.
>>> > >When vCenter becomes available again, CS stops the "marked as down"
>>> > >instances
>>> > >
>>> > >This is very bad as the Instances where running all the time and
>>> > >the the shutdown issued by CS is forcing a service interruption.
>>> > >
>>> > >My problem is that I cannot realy reporoduce as allot of testing
>>> > >is ongoing on the platform at the moment, so my question:
>>> > >
>>> > >Does someone else see this issue as well and can maybe reproduce?
>>> > >Is there a workaround to it, can I change some flag or something
>>> > >which tells CS to never shut down an instance by himself?
>>> > >Why are the ESX hosts getting marked as down and not unreachable
>>> > >or something?
>>> > >
>>> > >Best regards
>>> > >Andi
>>> >
>>> >
>>>
>>>
>>>
>
>
>
>
>


Re: Issues when vCenter becomes unavailable

Posted by "Musayev, Ilya" <im...@webmd.net>.
Any chance of some sort of fix for 4.0 or 4.1?

I understand that CS-669 (feature/enhacement) patch missed the commit deadline and will be in 4.2, but there is a real issue here that impacts production now.

Also, this is not a feature but a bug, I don't know if bugs are also treated on the same schedule as features.

Technically, for testing - we don't need to fail hypervisors. vMotion would achieve the same effect and host ID will get out of sync. It's only a theory though.

I will open a bug request on JIRA and ask for some visibility.

Alternatively, we can probably have a hack that will query VC for hosts and vms, identify what's changed, and update db - I'm just trying to avoid hacks.

Kelven Yang <ke...@citrix.com> wrote:
This is an issue that we are targeting to solve to sync states between
vCenter/Cloudstack in a controllable way. Please track the status of this
ticket for further progress

https://issues.apache.org/jira/browse/CLOUDSTACK-669


Kelven


On 2/22/13 3:51 PM, "Musayev, Ilya" <im...@webmd.net> wrote:

>Abit Incomplete email as I was in train and mistakenly press send,
>correction below:.. sorry :)
>
>-----Original Message-----
>From: Musayev, Ilya [mailto:imusayev@webmd.net]
>Sent: Friday, February 22, 2013 6:49 PM
>To: cloudstack-dev@incubator.apache.org;
>cloudstack-users@incubator.apache.org
>Cc: Kelven Yang
>Subject: RE: Issues when vCenter becomes unavailable
>
>Summary:
>
>I have 3 hypervisors
>Hypervisor 1 and 2 are down, hypervisor 3 is up. All VMs live on
>hypervisor 3, however, the host_id in instance table for the VMs are not
>being updated to reflect the only hypervisor alive.
>
>Details:
>
>I physically powered off 2 hypervisors that had most of my VMs and left 1
>online.
>
>The VMs were brought back online by vcenter, however from then on, I
>experience what Dave and Andreas mentioned.
>
>That is, VMWare VMs instances are bound to host id (hypervisor) and not
>vcenter and operations that would be executed on the VMs require for the
>hypervisor to stay up. If the hypervisor goes off line, while VMs still
>come up in VC, CS cannot comprehend that these VMs now live on another
>hypervisor.
>
>This is bad for production roll outs - because VMs are bound to a
>hypervisor ID and not virtual center and it appears its not getting
>updated - though I do see in the log that CS is trying to find it.
>
>Did a little more digging, it looks like the host_ids don't get updated
>in mysql for vm in instances table. I need to double check on this
>because I totally messed 2 of test cloudstack clusters.
>
>Can someone do the following test - if time allows - if not - I can try
>on monday:
>
>1) Pick a hypervisor for a test crash and note 1 vm (I.e. i-2-89)
>2) Navigate to "host" table in mysql and note the host_id for hypervisor
>that is about to be powered off.
>3) In mysql goto instances table and note the last_host_id and host_id
>for a VM on test crash hypervisor.
>4) Power off the hypervisor and let VCenter bring it back online
>5) Attempt to launch a console on the VM was on crashed hypervisors and
>was powered back on by VC
>6) If it fails - as it did in my case, alter the value of host_id to a
>next hypervisor its living on (my test is not clean because I've ruined
>the cluster that hosts my console vm and don't have time now to work on
>it ATM)
>7) Launch console again to see if the issue resolved
>
>I'm under suspicion the host_id does not get updated as I witnessed by
>examining mysql instance table, but I need to fix my env issues to
>confirm.
>
>Regards
>ilya
>
>
>-----Original Message-----
>From: Chiradeep Vittal [mailto:Chiradeep.Vittal@citrix.com]
>Sent: Friday, February 22, 2013 3:41 PM
>To: cloudstack-users@incubator.apache.org
>Cc: Kelven Yang; CloudStack DeveloperList
>Subject: Re: Issues when vCenter becomes unavailable
>
>CC'ing Kelven to see if he has any ideas.
>
>On 2/22/13 12:22 PM, "Dave Dunaway" <da...@gmail.com> wrote:
>
>>If I may suggest also testing a disconnect of a host (hypervisor) from
>>vcenter, so that vcenter and CS can still talk, but vcenter cannot talk
>>to the hosts (hypervisors). CS marks the host as down or failed or
>>whatever.
>>
>>When the host comes back up vcenter can it just fine and all seems good.
>>That however is not the case (I had this with CS 3.0.5 and vmware esxi
>>5.0)
>>when CS tries to talk to vcenter and the previously disconnected host
>>(that is now recovered).
>>
>>What we experienced was that we had to migrate all guests off the
>>recovered host, and then destroy that host in CS, and re-create it.
>>Then we could migrate back onto it the guests which had been previously
>>migrated.
>>
>>The curious thing is that while CS did not want to send commands to the
>>host (it kept on saying host id=X has timedout when whatever command
>>was sent to it), CS WAS polling the host for resources and getting the
>>correct numbers.... so CS could in some ways talk to the host (ie: it
>>knew the capabilities, number of VMs on it, etc).
>>
>>Luckily for me this all happened in a test environment. In production,
>>this would have been a real nightmare!
>>
>>
>>dave
>>
>>
>>On Fri, Feb 22, 2013 at 2:48 PM, Musayev, Ilya <im...@webmd.net>
>>wrote:
>>
>>> Andi
>>>
>>> I'm on CS4.0. I simulated the VMWare VCenter 5 failure by adding a
>>>bogus  IP entry in /etc/hosts for 10 minutes for virtual center host.
>>>That in turn  made VC unreachable by CS.
>>>
>>> I then began executing commands and sure enough commands failed or
>>> backlogged. Once I restored VC connectivity, the backlogged commands
>>> executed and I did not experience any abnormalities.
>>>
>>> I will redo this test and leave VC off for an hour - maybe a need a
>>>longer  outage.
>>>
>>> Regards
>>> ilya
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Musayev, Ilya
>>> Sent: Thursday, February 21, 2013 2:43 PM
>>> To: cloudstack-users@incubator.apache.org
>>> Subject: RE: Issues when vCenter becomes unavailable
>>>
>>> This is definitely not the behavior we want with vcenter.
>>>
>>> I will test this out on my lab setup shortly.
>>>
>>> Thanks
>>> ilya
>>>
>>> -----Original Message-----
>>> From: Chip Childers [mailto:chip.childers@sungard.com]
>>> Sent: Thursday, February 21, 2013 9:40 AM
>>> To: cloudstack-users@incubator.apache.org
>>> Subject: Re: Issues when vCenter becomes unavailable
>>>
>>> On Thu, Feb 21, 2013 at 08:59:14AM -0500, Mathias Mullins wrote:
>>> > Andreas,
>>> >
>>> > The open source community doesn't support the Citrix version 3.0.6.
>>> > You need to report this via your Citrix Support contract. Sounds
>>> > like this could be a bug.
>>> >
>>> > Community - this could be a possible issue in 4.0.0 / 4.0.1. I
>>> > don't know if this test case has been explored.
>>>
>>> Thx - I forwarded to cs-dev@i.a.o to get the test engineers in the
>>> community to take a look.
>>>
>>> >
>>> > Thanks,
>>> > Matt Mullins
>>> > CloudPlatform Implementation Engineer Worldwide Cloud Services
>>> > Citrix System, Inc.
>>> > +1 (407) 920-1107  Office/Cell Phone
>>> > matt.mullins@citrix.com
>>> >
>>> >
>>> >
>>> > On 2/21/13 5:35 AM, "Fuchs, Andreas (SwissTXT)"
>>> > <An...@swisstxt.ch> wrote:
>>> >
>>> > >Hi CS Users
>>> > >
>>> > >We are running CS 3.0.6 on a vSphere platform and found a strange
>>> > >behavior.
>>> > >
>>> > >When the vCenter becomes unavailable due to a reboot or some other
>>> > >issue, it seems that CS is shutting down instances when vCenter
>>> > >becomes available again.
>>> > >
>>> > >What we think what happens.
>>> > >1. vCenter becomes unrechabale
>>> > >2. CS marks the ESX servers as "down"
>>> > >3. We think this leads to: CS marks the instances as down as well 4.
>>> > >When vCenter becomes available again, CS stops the "marked as down"
>>> > >instances
>>> > >
>>> > >This is very bad as the Instances where running all the time and
>>> > >the the shutdown issued by CS is forcing a service interruption.
>>> > >
>>> > >My problem is that I cannot realy reporoduce as allot of testing
>>> > >is ongoing on the platform at the moment, so my question:
>>> > >
>>> > >Does someone else see this issue as well and can maybe reproduce?
>>> > >Is there a workaround to it, can I change some flag or something
>>> > >which tells CS to never shut down an instance by himself?
>>> > >Why are the ESX hosts getting marked as down and not unreachable
>>> > >or something?
>>> > >
>>> > >Best regards
>>> > >Andi
>>> >
>>> >
>>>
>>>
>>>
>
>
>
>
>



Re: Issues when vCenter becomes unavailable

Posted by "Musayev, Ilya" <im...@webmd.net>.
Any chance of some sort of fix for 4.0 or 4.1?

I understand that CS-669 (feature/enhacement) patch missed the commit deadline and will be in 4.2, but there is a real issue here that impacts production now.

Also, this is not a feature but a bug, I don't know if bugs are also treated on the same schedule as features.

Technically, for testing - we don't need to fail hypervisors. vMotion would achieve the same effect and host ID will get out of sync. It's only a theory though.

I will open a bug request on JIRA and ask for some visibility.

Alternatively, we can probably have a hack that will query VC for hosts and vms, identify what's changed, and update db - I'm just trying to avoid hacks.

Kelven Yang <ke...@citrix.com> wrote:
This is an issue that we are targeting to solve to sync states between
vCenter/Cloudstack in a controllable way. Please track the status of this
ticket for further progress

https://issues.apache.org/jira/browse/CLOUDSTACK-669


Kelven


On 2/22/13 3:51 PM, "Musayev, Ilya" <im...@webmd.net> wrote:

>Abit Incomplete email as I was in train and mistakenly press send,
>correction below:.. sorry :)
>
>-----Original Message-----
>From: Musayev, Ilya [mailto:imusayev@webmd.net]
>Sent: Friday, February 22, 2013 6:49 PM
>To: cloudstack-dev@incubator.apache.org;
>cloudstack-users@incubator.apache.org
>Cc: Kelven Yang
>Subject: RE: Issues when vCenter becomes unavailable
>
>Summary:
>
>I have 3 hypervisors
>Hypervisor 1 and 2 are down, hypervisor 3 is up. All VMs live on
>hypervisor 3, however, the host_id in instance table for the VMs are not
>being updated to reflect the only hypervisor alive.
>
>Details:
>
>I physically powered off 2 hypervisors that had most of my VMs and left 1
>online.
>
>The VMs were brought back online by vcenter, however from then on, I
>experience what Dave and Andreas mentioned.
>
>That is, VMWare VMs instances are bound to host id (hypervisor) and not
>vcenter and operations that would be executed on the VMs require for the
>hypervisor to stay up. If the hypervisor goes off line, while VMs still
>come up in VC, CS cannot comprehend that these VMs now live on another
>hypervisor.
>
>This is bad for production roll outs - because VMs are bound to a
>hypervisor ID and not virtual center and it appears its not getting
>updated - though I do see in the log that CS is trying to find it.
>
>Did a little more digging, it looks like the host_ids don't get updated
>in mysql for vm in instances table. I need to double check on this
>because I totally messed 2 of test cloudstack clusters.
>
>Can someone do the following test - if time allows - if not - I can try
>on monday:
>
>1) Pick a hypervisor for a test crash and note 1 vm (I.e. i-2-89)
>2) Navigate to "host" table in mysql and note the host_id for hypervisor
>that is about to be powered off.
>3) In mysql goto instances table and note the last_host_id and host_id
>for a VM on test crash hypervisor.
>4) Power off the hypervisor and let VCenter bring it back online
>5) Attempt to launch a console on the VM was on crashed hypervisors and
>was powered back on by VC
>6) If it fails - as it did in my case, alter the value of host_id to a
>next hypervisor its living on (my test is not clean because I've ruined
>the cluster that hosts my console vm and don't have time now to work on
>it ATM)
>7) Launch console again to see if the issue resolved
>
>I'm under suspicion the host_id does not get updated as I witnessed by
>examining mysql instance table, but I need to fix my env issues to
>confirm.
>
>Regards
>ilya
>
>
>-----Original Message-----
>From: Chiradeep Vittal [mailto:Chiradeep.Vittal@citrix.com]
>Sent: Friday, February 22, 2013 3:41 PM
>To: cloudstack-users@incubator.apache.org
>Cc: Kelven Yang; CloudStack DeveloperList
>Subject: Re: Issues when vCenter becomes unavailable
>
>CC'ing Kelven to see if he has any ideas.
>
>On 2/22/13 12:22 PM, "Dave Dunaway" <da...@gmail.com> wrote:
>
>>If I may suggest also testing a disconnect of a host (hypervisor) from
>>vcenter, so that vcenter and CS can still talk, but vcenter cannot talk
>>to the hosts (hypervisors). CS marks the host as down or failed or
>>whatever.
>>
>>When the host comes back up vcenter can it just fine and all seems good.
>>That however is not the case (I had this with CS 3.0.5 and vmware esxi
>>5.0)
>>when CS tries to talk to vcenter and the previously disconnected host
>>(that is now recovered).
>>
>>What we experienced was that we had to migrate all guests off the
>>recovered host, and then destroy that host in CS, and re-create it.
>>Then we could migrate back onto it the guests which had been previously
>>migrated.
>>
>>The curious thing is that while CS did not want to send commands to the
>>host (it kept on saying host id=X has timedout when whatever command
>>was sent to it), CS WAS polling the host for resources and getting the
>>correct numbers.... so CS could in some ways talk to the host (ie: it
>>knew the capabilities, number of VMs on it, etc).
>>
>>Luckily for me this all happened in a test environment. In production,
>>this would have been a real nightmare!
>>
>>
>>dave
>>
>>
>>On Fri, Feb 22, 2013 at 2:48 PM, Musayev, Ilya <im...@webmd.net>
>>wrote:
>>
>>> Andi
>>>
>>> I'm on CS4.0. I simulated the VMWare VCenter 5 failure by adding a
>>>bogus  IP entry in /etc/hosts for 10 minutes for virtual center host.
>>>That in turn  made VC unreachable by CS.
>>>
>>> I then began executing commands and sure enough commands failed or
>>> backlogged. Once I restored VC connectivity, the backlogged commands
>>> executed and I did not experience any abnormalities.
>>>
>>> I will redo this test and leave VC off for an hour - maybe a need a
>>>longer  outage.
>>>
>>> Regards
>>> ilya
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Musayev, Ilya
>>> Sent: Thursday, February 21, 2013 2:43 PM
>>> To: cloudstack-users@incubator.apache.org
>>> Subject: RE: Issues when vCenter becomes unavailable
>>>
>>> This is definitely not the behavior we want with vcenter.
>>>
>>> I will test this out on my lab setup shortly.
>>>
>>> Thanks
>>> ilya
>>>
>>> -----Original Message-----
>>> From: Chip Childers [mailto:chip.childers@sungard.com]
>>> Sent: Thursday, February 21, 2013 9:40 AM
>>> To: cloudstack-users@incubator.apache.org
>>> Subject: Re: Issues when vCenter becomes unavailable
>>>
>>> On Thu, Feb 21, 2013 at 08:59:14AM -0500, Mathias Mullins wrote:
>>> > Andreas,
>>> >
>>> > The open source community doesn't support the Citrix version 3.0.6.
>>> > You need to report this via your Citrix Support contract. Sounds
>>> > like this could be a bug.
>>> >
>>> > Community - this could be a possible issue in 4.0.0 / 4.0.1. I
>>> > don't know if this test case has been explored.
>>>
>>> Thx - I forwarded to cs-dev@i.a.o to get the test engineers in the
>>> community to take a look.
>>>
>>> >
>>> > Thanks,
>>> > Matt Mullins
>>> > CloudPlatform Implementation Engineer Worldwide Cloud Services
>>> > Citrix System, Inc.
>>> > +1 (407) 920-1107  Office/Cell Phone
>>> > matt.mullins@citrix.com
>>> >
>>> >
>>> >
>>> > On 2/21/13 5:35 AM, "Fuchs, Andreas (SwissTXT)"
>>> > <An...@swisstxt.ch> wrote:
>>> >
>>> > >Hi CS Users
>>> > >
>>> > >We are running CS 3.0.6 on a vSphere platform and found a strange
>>> > >behavior.
>>> > >
>>> > >When the vCenter becomes unavailable due to a reboot or some other
>>> > >issue, it seems that CS is shutting down instances when vCenter
>>> > >becomes available again.
>>> > >
>>> > >What we think what happens.
>>> > >1. vCenter becomes unrechabale
>>> > >2. CS marks the ESX servers as "down"
>>> > >3. We think this leads to: CS marks the instances as down as well 4.
>>> > >When vCenter becomes available again, CS stops the "marked as down"
>>> > >instances
>>> > >
>>> > >This is very bad as the Instances where running all the time and
>>> > >the the shutdown issued by CS is forcing a service interruption.
>>> > >
>>> > >My problem is that I cannot realy reporoduce as allot of testing
>>> > >is ongoing on the platform at the moment, so my question:
>>> > >
>>> > >Does someone else see this issue as well and can maybe reproduce?
>>> > >Is there a workaround to it, can I change some flag or something
>>> > >which tells CS to never shut down an instance by himself?
>>> > >Why are the ESX hosts getting marked as down and not unreachable
>>> > >or something?
>>> > >
>>> > >Best regards
>>> > >Andi
>>> >
>>> >
>>>
>>>
>>>
>
>
>
>
>



Re: Issues when vCenter becomes unavailable

Posted by Kelven Yang <ke...@citrix.com>.
This is an issue that we are targeting to solve to sync states between
vCenter/Cloudstack in a controllable way. Please track the status of this
ticket for further progress

https://issues.apache.org/jira/browse/CLOUDSTACK-669


Kelven


On 2/22/13 3:51 PM, "Musayev, Ilya" <im...@webmd.net> wrote:

>Abit Incomplete email as I was in train and mistakenly press send,
>correction below:.. sorry :)
>
>-----Original Message-----
>From: Musayev, Ilya [mailto:imusayev@webmd.net]
>Sent: Friday, February 22, 2013 6:49 PM
>To: cloudstack-dev@incubator.apache.org;
>cloudstack-users@incubator.apache.org
>Cc: Kelven Yang
>Subject: RE: Issues when vCenter becomes unavailable
>
>Summary:
>
>I have 3 hypervisors
>Hypervisor 1 and 2 are down, hypervisor 3 is up. All VMs live on
>hypervisor 3, however, the host_id in instance table for the VMs are not
>being updated to reflect the only hypervisor alive.
>
>Details:
>
>I physically powered off 2 hypervisors that had most of my VMs and left 1
>online.
>
>The VMs were brought back online by vcenter, however from then on, I
>experience what Dave and Andreas mentioned.
>
>That is, VMWare VMs instances are bound to host id (hypervisor) and not
>vcenter and operations that would be executed on the VMs require for the
>hypervisor to stay up. If the hypervisor goes off line, while VMs still
>come up in VC, CS cannot comprehend that these VMs now live on another
>hypervisor. 
>
>This is bad for production roll outs - because VMs are bound to a
>hypervisor ID and not virtual center and it appears its not getting
>updated - though I do see in the log that CS is trying to find it.
>
>Did a little more digging, it looks like the host_ids don't get updated
>in mysql for vm in instances table. I need to double check on this
>because I totally messed 2 of test cloudstack clusters.
>
>Can someone do the following test - if time allows - if not - I can try
>on monday:
>
>1) Pick a hypervisor for a test crash and note 1 vm (I.e. i-2-89)
>2) Navigate to "host" table in mysql and note the host_id for hypervisor
>that is about to be powered off.
>3) In mysql goto instances table and note the last_host_id and host_id
>for a VM on test crash hypervisor.
>4) Power off the hypervisor and let VCenter bring it back online
>5) Attempt to launch a console on the VM was on crashed hypervisors and
>was powered back on by VC
>6) If it fails - as it did in my case, alter the value of host_id to a
>next hypervisor its living on (my test is not clean because I've ruined
>the cluster that hosts my console vm and don't have time now to work on
>it ATM)
>7) Launch console again to see if the issue resolved
>
>I'm under suspicion the host_id does not get updated as I witnessed by
>examining mysql instance table, but I need to fix my env issues to
>confirm.
>
>Regards
>ilya
>
>
>-----Original Message-----
>From: Chiradeep Vittal [mailto:Chiradeep.Vittal@citrix.com]
>Sent: Friday, February 22, 2013 3:41 PM
>To: cloudstack-users@incubator.apache.org
>Cc: Kelven Yang; CloudStack DeveloperList
>Subject: Re: Issues when vCenter becomes unavailable
>
>CC'ing Kelven to see if he has any ideas.
>
>On 2/22/13 12:22 PM, "Dave Dunaway" <da...@gmail.com> wrote:
>
>>If I may suggest also testing a disconnect of a host (hypervisor) from
>>vcenter, so that vcenter and CS can still talk, but vcenter cannot talk
>>to the hosts (hypervisors). CS marks the host as down or failed or
>>whatever.
>>
>>When the host comes back up vcenter can it just fine and all seems good.
>>That however is not the case (I had this with CS 3.0.5 and vmware esxi
>>5.0)
>>when CS tries to talk to vcenter and the previously disconnected host
>>(that is now recovered).
>>
>>What we experienced was that we had to migrate all guests off the
>>recovered host, and then destroy that host in CS, and re-create it.
>>Then we could migrate back onto it the guests which had been previously
>>migrated.
>>
>>The curious thing is that while CS did not want to send commands to the
>>host (it kept on saying host id=X has timedout when whatever command
>>was sent to it), CS WAS polling the host for resources and getting the
>>correct numbers.... so CS could in some ways talk to the host (ie: it
>>knew the capabilities, number of VMs on it, etc).
>>
>>Luckily for me this all happened in a test environment. In production,
>>this would have been a real nightmare!
>>
>>
>>dave
>>
>>
>>On Fri, Feb 22, 2013 at 2:48 PM, Musayev, Ilya <im...@webmd.net>
>>wrote:
>>
>>> Andi
>>>
>>> I'm on CS4.0. I simulated the VMWare VCenter 5 failure by adding a
>>>bogus  IP entry in /etc/hosts for 10 minutes for virtual center host.
>>>That in turn  made VC unreachable by CS.
>>>
>>> I then began executing commands and sure enough commands failed or
>>> backlogged. Once I restored VC connectivity, the backlogged commands
>>> executed and I did not experience any abnormalities.
>>>
>>> I will redo this test and leave VC off for an hour - maybe a need a
>>>longer  outage.
>>>
>>> Regards
>>> ilya
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Musayev, Ilya
>>> Sent: Thursday, February 21, 2013 2:43 PM
>>> To: cloudstack-users@incubator.apache.org
>>> Subject: RE: Issues when vCenter becomes unavailable
>>>
>>> This is definitely not the behavior we want with vcenter.
>>>
>>> I will test this out on my lab setup shortly.
>>>
>>> Thanks
>>> ilya
>>>
>>> -----Original Message-----
>>> From: Chip Childers [mailto:chip.childers@sungard.com]
>>> Sent: Thursday, February 21, 2013 9:40 AM
>>> To: cloudstack-users@incubator.apache.org
>>> Subject: Re: Issues when vCenter becomes unavailable
>>>
>>> On Thu, Feb 21, 2013 at 08:59:14AM -0500, Mathias Mullins wrote:
>>> > Andreas,
>>> >
>>> > The open source community doesn't support the Citrix version 3.0.6.
>>> > You need to report this via your Citrix Support contract. Sounds
>>> > like this could be a bug.
>>> >
>>> > Community - this could be a possible issue in 4.0.0 / 4.0.1. I
>>> > don't know if this test case has been explored.
>>>
>>> Thx - I forwarded to cs-dev@i.a.o to get the test engineers in the
>>> community to take a look.
>>>
>>> >
>>> > Thanks,
>>> > Matt Mullins
>>> > CloudPlatform Implementation Engineer Worldwide Cloud Services
>>> > Citrix System, Inc.
>>> > +1 (407) 920-1107  Office/Cell Phone
>>> > matt.mullins@citrix.com
>>> >
>>> >
>>> >
>>> > On 2/21/13 5:35 AM, "Fuchs, Andreas (SwissTXT)"
>>> > <An...@swisstxt.ch> wrote:
>>> >
>>> > >Hi CS Users
>>> > >
>>> > >We are running CS 3.0.6 on a vSphere platform and found a strange
>>> > >behavior.
>>> > >
>>> > >When the vCenter becomes unavailable due to a reboot or some other
>>> > >issue, it seems that CS is shutting down instances when vCenter
>>> > >becomes available again.
>>> > >
>>> > >What we think what happens.
>>> > >1. vCenter becomes unrechabale
>>> > >2. CS marks the ESX servers as "down"
>>> > >3. We think this leads to: CS marks the instances as down as well 4.
>>> > >When vCenter becomes available again, CS stops the "marked as down"
>>> > >instances
>>> > >
>>> > >This is very bad as the Instances where running all the time and
>>> > >the the shutdown issued by CS is forcing a service interruption.
>>> > >
>>> > >My problem is that I cannot realy reporoduce as allot of testing
>>> > >is ongoing on the platform at the moment, so my question:
>>> > >
>>> > >Does someone else see this issue as well and can maybe reproduce?
>>> > >Is there a workaround to it, can I change some flag or something
>>> > >which tells CS to never shut down an instance by himself?
>>> > >Why are the ESX hosts getting marked as down and not unreachable
>>> > >or something?
>>> > >
>>> > >Best regards
>>> > >Andi
>>> >
>>> >
>>>
>>>
>>>
>
>
>
>
>


Re: Issues when vCenter becomes unavailable

Posted by Kelven Yang <ke...@citrix.com>.
This is an issue that we are targeting to solve to sync states between
vCenter/Cloudstack in a controllable way. Please track the status of this
ticket for further progress

https://issues.apache.org/jira/browse/CLOUDSTACK-669


Kelven


On 2/22/13 3:51 PM, "Musayev, Ilya" <im...@webmd.net> wrote:

>Abit Incomplete email as I was in train and mistakenly press send,
>correction below:.. sorry :)
>
>-----Original Message-----
>From: Musayev, Ilya [mailto:imusayev@webmd.net]
>Sent: Friday, February 22, 2013 6:49 PM
>To: cloudstack-dev@incubator.apache.org;
>cloudstack-users@incubator.apache.org
>Cc: Kelven Yang
>Subject: RE: Issues when vCenter becomes unavailable
>
>Summary:
>
>I have 3 hypervisors
>Hypervisor 1 and 2 are down, hypervisor 3 is up. All VMs live on
>hypervisor 3, however, the host_id in instance table for the VMs are not
>being updated to reflect the only hypervisor alive.
>
>Details:
>
>I physically powered off 2 hypervisors that had most of my VMs and left 1
>online.
>
>The VMs were brought back online by vcenter, however from then on, I
>experience what Dave and Andreas mentioned.
>
>That is, VMWare VMs instances are bound to host id (hypervisor) and not
>vcenter and operations that would be executed on the VMs require for the
>hypervisor to stay up. If the hypervisor goes off line, while VMs still
>come up in VC, CS cannot comprehend that these VMs now live on another
>hypervisor. 
>
>This is bad for production roll outs - because VMs are bound to a
>hypervisor ID and not virtual center and it appears its not getting
>updated - though I do see in the log that CS is trying to find it.
>
>Did a little more digging, it looks like the host_ids don't get updated
>in mysql for vm in instances table. I need to double check on this
>because I totally messed 2 of test cloudstack clusters.
>
>Can someone do the following test - if time allows - if not - I can try
>on monday:
>
>1) Pick a hypervisor for a test crash and note 1 vm (I.e. i-2-89)
>2) Navigate to "host" table in mysql and note the host_id for hypervisor
>that is about to be powered off.
>3) In mysql goto instances table and note the last_host_id and host_id
>for a VM on test crash hypervisor.
>4) Power off the hypervisor and let VCenter bring it back online
>5) Attempt to launch a console on the VM was on crashed hypervisors and
>was powered back on by VC
>6) If it fails - as it did in my case, alter the value of host_id to a
>next hypervisor its living on (my test is not clean because I've ruined
>the cluster that hosts my console vm and don't have time now to work on
>it ATM)
>7) Launch console again to see if the issue resolved
>
>I'm under suspicion the host_id does not get updated as I witnessed by
>examining mysql instance table, but I need to fix my env issues to
>confirm.
>
>Regards
>ilya
>
>
>-----Original Message-----
>From: Chiradeep Vittal [mailto:Chiradeep.Vittal@citrix.com]
>Sent: Friday, February 22, 2013 3:41 PM
>To: cloudstack-users@incubator.apache.org
>Cc: Kelven Yang; CloudStack DeveloperList
>Subject: Re: Issues when vCenter becomes unavailable
>
>CC'ing Kelven to see if he has any ideas.
>
>On 2/22/13 12:22 PM, "Dave Dunaway" <da...@gmail.com> wrote:
>
>>If I may suggest also testing a disconnect of a host (hypervisor) from
>>vcenter, so that vcenter and CS can still talk, but vcenter cannot talk
>>to the hosts (hypervisors). CS marks the host as down or failed or
>>whatever.
>>
>>When the host comes back up vcenter can it just fine and all seems good.
>>That however is not the case (I had this with CS 3.0.5 and vmware esxi
>>5.0)
>>when CS tries to talk to vcenter and the previously disconnected host
>>(that is now recovered).
>>
>>What we experienced was that we had to migrate all guests off the
>>recovered host, and then destroy that host in CS, and re-create it.
>>Then we could migrate back onto it the guests which had been previously
>>migrated.
>>
>>The curious thing is that while CS did not want to send commands to the
>>host (it kept on saying host id=X has timedout when whatever command
>>was sent to it), CS WAS polling the host for resources and getting the
>>correct numbers.... so CS could in some ways talk to the host (ie: it
>>knew the capabilities, number of VMs on it, etc).
>>
>>Luckily for me this all happened in a test environment. In production,
>>this would have been a real nightmare!
>>
>>
>>dave
>>
>>
>>On Fri, Feb 22, 2013 at 2:48 PM, Musayev, Ilya <im...@webmd.net>
>>wrote:
>>
>>> Andi
>>>
>>> I'm on CS4.0. I simulated the VMWare VCenter 5 failure by adding a
>>>bogus  IP entry in /etc/hosts for 10 minutes for virtual center host.
>>>That in turn  made VC unreachable by CS.
>>>
>>> I then began executing commands and sure enough commands failed or
>>> backlogged. Once I restored VC connectivity, the backlogged commands
>>> executed and I did not experience any abnormalities.
>>>
>>> I will redo this test and leave VC off for an hour - maybe a need a
>>>longer  outage.
>>>
>>> Regards
>>> ilya
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Musayev, Ilya
>>> Sent: Thursday, February 21, 2013 2:43 PM
>>> To: cloudstack-users@incubator.apache.org
>>> Subject: RE: Issues when vCenter becomes unavailable
>>>
>>> This is definitely not the behavior we want with vcenter.
>>>
>>> I will test this out on my lab setup shortly.
>>>
>>> Thanks
>>> ilya
>>>
>>> -----Original Message-----
>>> From: Chip Childers [mailto:chip.childers@sungard.com]
>>> Sent: Thursday, February 21, 2013 9:40 AM
>>> To: cloudstack-users@incubator.apache.org
>>> Subject: Re: Issues when vCenter becomes unavailable
>>>
>>> On Thu, Feb 21, 2013 at 08:59:14AM -0500, Mathias Mullins wrote:
>>> > Andreas,
>>> >
>>> > The open source community doesn't support the Citrix version 3.0.6.
>>> > You need to report this via your Citrix Support contract. Sounds
>>> > like this could be a bug.
>>> >
>>> > Community - this could be a possible issue in 4.0.0 / 4.0.1. I
>>> > don't know if this test case has been explored.
>>>
>>> Thx - I forwarded to cs-dev@i.a.o to get the test engineers in the
>>> community to take a look.
>>>
>>> >
>>> > Thanks,
>>> > Matt Mullins
>>> > CloudPlatform Implementation Engineer Worldwide Cloud Services
>>> > Citrix System, Inc.
>>> > +1 (407) 920-1107  Office/Cell Phone
>>> > matt.mullins@citrix.com
>>> >
>>> >
>>> >
>>> > On 2/21/13 5:35 AM, "Fuchs, Andreas (SwissTXT)"
>>> > <An...@swisstxt.ch> wrote:
>>> >
>>> > >Hi CS Users
>>> > >
>>> > >We are running CS 3.0.6 on a vSphere platform and found a strange
>>> > >behavior.
>>> > >
>>> > >When the vCenter becomes unavailable due to a reboot or some other
>>> > >issue, it seems that CS is shutting down instances when vCenter
>>> > >becomes available again.
>>> > >
>>> > >What we think what happens.
>>> > >1. vCenter becomes unrechabale
>>> > >2. CS marks the ESX servers as "down"
>>> > >3. We think this leads to: CS marks the instances as down as well 4.
>>> > >When vCenter becomes available again, CS stops the "marked as down"
>>> > >instances
>>> > >
>>> > >This is very bad as the Instances where running all the time and
>>> > >the the shutdown issued by CS is forcing a service interruption.
>>> > >
>>> > >My problem is that I cannot realy reporoduce as allot of testing
>>> > >is ongoing on the platform at the moment, so my question:
>>> > >
>>> > >Does someone else see this issue as well and can maybe reproduce?
>>> > >Is there a workaround to it, can I change some flag or something
>>> > >which tells CS to never shut down an instance by himself?
>>> > >Why are the ESX hosts getting marked as down and not unreachable
>>> > >or something?
>>> > >
>>> > >Best regards
>>> > >Andi
>>> >
>>> >
>>>
>>>
>>>
>
>
>
>
>


RE: Issues when vCenter becomes unavailable

Posted by "Musayev, Ilya" <im...@webmd.net>.
Abit Incomplete email as I was in train and mistakenly press send, correction below:.. sorry :)

-----Original Message-----
From: Musayev, Ilya [mailto:imusayev@webmd.net] 
Sent: Friday, February 22, 2013 6:49 PM
To: cloudstack-dev@incubator.apache.org; cloudstack-users@incubator.apache.org
Cc: Kelven Yang
Subject: RE: Issues when vCenter becomes unavailable

Summary:

I have 3 hypervisors
Hypervisor 1 and 2 are down, hypervisor 3 is up. All VMs live on hypervisor 3, however, the host_id in instance table for the VMs are not being updated to reflect the only hypervisor alive.

Details:

I physically powered off 2 hypervisors that had most of my VMs and left 1 online.

The VMs were brought back online by vcenter, however from then on, I experience what Dave and Andreas mentioned.

That is, VMWare VMs instances are bound to host id (hypervisor) and not vcenter and operations that would be executed on the VMs require for the hypervisor to stay up. If the hypervisor goes off line, while VMs still come up in VC, CS cannot comprehend that these VMs now live on another hypervisor. 

This is bad for production roll outs - because VMs are bound to a hypervisor ID and not virtual center and it appears its not getting updated - though I do see in the log that CS is trying to find it.

Did a little more digging, it looks like the host_ids don't get updated in mysql for vm in instances table. I need to double check on this because I totally messed 2 of test cloudstack clusters.

Can someone do the following test - if time allows - if not - I can try on monday:

1) Pick a hypervisor for a test crash and note 1 vm (I.e. i-2-89)
2) Navigate to "host" table in mysql and note the host_id for hypervisor that is about to be powered off.
3) In mysql goto instances table and note the last_host_id and host_id for a VM on test crash hypervisor.
4) Power off the hypervisor and let VCenter bring it back online
5) Attempt to launch a console on the VM was on crashed hypervisors and was powered back on by VC
6) If it fails - as it did in my case, alter the value of host_id to a next hypervisor its living on (my test is not clean because I've ruined the cluster that hosts my console vm and don't have time now to work on it ATM)
7) Launch console again to see if the issue resolved

I'm under suspicion the host_id does not get updated as I witnessed by examining mysql instance table, but I need to fix my env issues to confirm.

Regards
ilya


-----Original Message-----
From: Chiradeep Vittal [mailto:Chiradeep.Vittal@citrix.com]
Sent: Friday, February 22, 2013 3:41 PM
To: cloudstack-users@incubator.apache.org
Cc: Kelven Yang; CloudStack DeveloperList
Subject: Re: Issues when vCenter becomes unavailable

CC'ing Kelven to see if he has any ideas.

On 2/22/13 12:22 PM, "Dave Dunaway" <da...@gmail.com> wrote:

>If I may suggest also testing a disconnect of a host (hypervisor) from 
>vcenter, so that vcenter and CS can still talk, but vcenter cannot talk 
>to the hosts (hypervisors). CS marks the host as down or failed or whatever.
>
>When the host comes back up vcenter can it just fine and all seems good.
>That however is not the case (I had this with CS 3.0.5 and vmware esxi
>5.0)
>when CS tries to talk to vcenter and the previously disconnected host 
>(that is now recovered).
>
>What we experienced was that we had to migrate all guests off the 
>recovered host, and then destroy that host in CS, and re-create it.
>Then we could migrate back onto it the guests which had been previously 
>migrated.
>
>The curious thing is that while CS did not want to send commands to the 
>host (it kept on saying host id=X has timedout when whatever command 
>was sent to it), CS WAS polling the host for resources and getting the 
>correct numbers.... so CS could in some ways talk to the host (ie: it 
>knew the capabilities, number of VMs on it, etc).
>
>Luckily for me this all happened in a test environment. In production, 
>this would have been a real nightmare!
>
>
>dave
>
>
>On Fri, Feb 22, 2013 at 2:48 PM, Musayev, Ilya <im...@webmd.net> wrote:
>
>> Andi
>>
>> I'm on CS4.0. I simulated the VMWare VCenter 5 failure by adding a 
>>bogus  IP entry in /etc/hosts for 10 minutes for virtual center host.
>>That in turn  made VC unreachable by CS.
>>
>> I then began executing commands and sure enough commands failed or 
>> backlogged. Once I restored VC connectivity, the backlogged commands 
>> executed and I did not experience any abnormalities.
>>
>> I will redo this test and leave VC off for an hour - maybe a need a 
>>longer  outage.
>>
>> Regards
>> ilya
>>
>>
>>
>> -----Original Message-----
>> From: Musayev, Ilya
>> Sent: Thursday, February 21, 2013 2:43 PM
>> To: cloudstack-users@incubator.apache.org
>> Subject: RE: Issues when vCenter becomes unavailable
>>
>> This is definitely not the behavior we want with vcenter.
>>
>> I will test this out on my lab setup shortly.
>>
>> Thanks
>> ilya
>>
>> -----Original Message-----
>> From: Chip Childers [mailto:chip.childers@sungard.com]
>> Sent: Thursday, February 21, 2013 9:40 AM
>> To: cloudstack-users@incubator.apache.org
>> Subject: Re: Issues when vCenter becomes unavailable
>>
>> On Thu, Feb 21, 2013 at 08:59:14AM -0500, Mathias Mullins wrote:
>> > Andreas,
>> >
>> > The open source community doesn't support the Citrix version 3.0.6.
>> > You need to report this via your Citrix Support contract. Sounds 
>> > like this could be a bug.
>> >
>> > Community - this could be a possible issue in 4.0.0 / 4.0.1. I 
>> > don't know if this test case has been explored.
>>
>> Thx - I forwarded to cs-dev@i.a.o to get the test engineers in the 
>> community to take a look.
>>
>> >
>> > Thanks,
>> > Matt Mullins
>> > CloudPlatform Implementation Engineer Worldwide Cloud Services 
>> > Citrix System, Inc.
>> > +1 (407) 920-1107  Office/Cell Phone
>> > matt.mullins@citrix.com
>> >
>> >
>> >
>> > On 2/21/13 5:35 AM, "Fuchs, Andreas (SwissTXT)"
>> > <An...@swisstxt.ch> wrote:
>> >
>> > >Hi CS Users
>> > >
>> > >We are running CS 3.0.6 on a vSphere platform and found a strange 
>> > >behavior.
>> > >
>> > >When the vCenter becomes unavailable due to a reboot or some other 
>> > >issue, it seems that CS is shutting down instances when vCenter 
>> > >becomes available again.
>> > >
>> > >What we think what happens.
>> > >1. vCenter becomes unrechabale
>> > >2. CS marks the ESX servers as "down"
>> > >3. We think this leads to: CS marks the instances as down as well 4.
>> > >When vCenter becomes available again, CS stops the "marked as down"
>> > >instances
>> > >
>> > >This is very bad as the Instances where running all the time and 
>> > >the the shutdown issued by CS is forcing a service interruption.
>> > >
>> > >My problem is that I cannot realy reporoduce as allot of testing 
>> > >is ongoing on the platform at the moment, so my question:
>> > >
>> > >Does someone else see this issue as well and can maybe reproduce?
>> > >Is there a workaround to it, can I change some flag or something 
>> > >which tells CS to never shut down an instance by himself?
>> > >Why are the ESX hosts getting marked as down and not unreachable 
>> > >or something?
>> > >
>> > >Best regards
>> > >Andi
>> >
>> >
>>
>>
>>






RE: Issues when vCenter becomes unavailable

Posted by "Musayev, Ilya" <im...@webmd.net>.
Abit Incomplete email as I was in train and mistakenly press send, correction below:.. sorry :)

-----Original Message-----
From: Musayev, Ilya [mailto:imusayev@webmd.net] 
Sent: Friday, February 22, 2013 6:49 PM
To: cloudstack-dev@incubator.apache.org; cloudstack-users@incubator.apache.org
Cc: Kelven Yang
Subject: RE: Issues when vCenter becomes unavailable

Summary:

I have 3 hypervisors
Hypervisor 1 and 2 are down, hypervisor 3 is up. All VMs live on hypervisor 3, however, the host_id in instance table for the VMs are not being updated to reflect the only hypervisor alive.

Details:

I physically powered off 2 hypervisors that had most of my VMs and left 1 online.

The VMs were brought back online by vcenter, however from then on, I experience what Dave and Andreas mentioned.

That is, VMWare VMs instances are bound to host id (hypervisor) and not vcenter and operations that would be executed on the VMs require for the hypervisor to stay up. If the hypervisor goes off line, while VMs still come up in VC, CS cannot comprehend that these VMs now live on another hypervisor. 

This is bad for production roll outs - because VMs are bound to a hypervisor ID and not virtual center and it appears its not getting updated - though I do see in the log that CS is trying to find it.

Did a little more digging, it looks like the host_ids don't get updated in mysql for vm in instances table. I need to double check on this because I totally messed 2 of test cloudstack clusters.

Can someone do the following test - if time allows - if not - I can try on monday:

1) Pick a hypervisor for a test crash and note 1 vm (I.e. i-2-89)
2) Navigate to "host" table in mysql and note the host_id for hypervisor that is about to be powered off.
3) In mysql goto instances table and note the last_host_id and host_id for a VM on test crash hypervisor.
4) Power off the hypervisor and let VCenter bring it back online
5) Attempt to launch a console on the VM was on crashed hypervisors and was powered back on by VC
6) If it fails - as it did in my case, alter the value of host_id to a next hypervisor its living on (my test is not clean because I've ruined the cluster that hosts my console vm and don't have time now to work on it ATM)
7) Launch console again to see if the issue resolved

I'm under suspicion the host_id does not get updated as I witnessed by examining mysql instance table, but I need to fix my env issues to confirm.

Regards
ilya


-----Original Message-----
From: Chiradeep Vittal [mailto:Chiradeep.Vittal@citrix.com]
Sent: Friday, February 22, 2013 3:41 PM
To: cloudstack-users@incubator.apache.org
Cc: Kelven Yang; CloudStack DeveloperList
Subject: Re: Issues when vCenter becomes unavailable

CC'ing Kelven to see if he has any ideas.

On 2/22/13 12:22 PM, "Dave Dunaway" <da...@gmail.com> wrote:

>If I may suggest also testing a disconnect of a host (hypervisor) from 
>vcenter, so that vcenter and CS can still talk, but vcenter cannot talk 
>to the hosts (hypervisors). CS marks the host as down or failed or whatever.
>
>When the host comes back up vcenter can it just fine and all seems good.
>That however is not the case (I had this with CS 3.0.5 and vmware esxi
>5.0)
>when CS tries to talk to vcenter and the previously disconnected host 
>(that is now recovered).
>
>What we experienced was that we had to migrate all guests off the 
>recovered host, and then destroy that host in CS, and re-create it.
>Then we could migrate back onto it the guests which had been previously 
>migrated.
>
>The curious thing is that while CS did not want to send commands to the 
>host (it kept on saying host id=X has timedout when whatever command 
>was sent to it), CS WAS polling the host for resources and getting the 
>correct numbers.... so CS could in some ways talk to the host (ie: it 
>knew the capabilities, number of VMs on it, etc).
>
>Luckily for me this all happened in a test environment. In production, 
>this would have been a real nightmare!
>
>
>dave
>
>
>On Fri, Feb 22, 2013 at 2:48 PM, Musayev, Ilya <im...@webmd.net> wrote:
>
>> Andi
>>
>> I'm on CS4.0. I simulated the VMWare VCenter 5 failure by adding a 
>>bogus  IP entry in /etc/hosts for 10 minutes for virtual center host.
>>That in turn  made VC unreachable by CS.
>>
>> I then began executing commands and sure enough commands failed or 
>> backlogged. Once I restored VC connectivity, the backlogged commands 
>> executed and I did not experience any abnormalities.
>>
>> I will redo this test and leave VC off for an hour - maybe a need a 
>>longer  outage.
>>
>> Regards
>> ilya
>>
>>
>>
>> -----Original Message-----
>> From: Musayev, Ilya
>> Sent: Thursday, February 21, 2013 2:43 PM
>> To: cloudstack-users@incubator.apache.org
>> Subject: RE: Issues when vCenter becomes unavailable
>>
>> This is definitely not the behavior we want with vcenter.
>>
>> I will test this out on my lab setup shortly.
>>
>> Thanks
>> ilya
>>
>> -----Original Message-----
>> From: Chip Childers [mailto:chip.childers@sungard.com]
>> Sent: Thursday, February 21, 2013 9:40 AM
>> To: cloudstack-users@incubator.apache.org
>> Subject: Re: Issues when vCenter becomes unavailable
>>
>> On Thu, Feb 21, 2013 at 08:59:14AM -0500, Mathias Mullins wrote:
>> > Andreas,
>> >
>> > The open source community doesn't support the Citrix version 3.0.6.
>> > You need to report this via your Citrix Support contract. Sounds 
>> > like this could be a bug.
>> >
>> > Community - this could be a possible issue in 4.0.0 / 4.0.1. I 
>> > don't know if this test case has been explored.
>>
>> Thx - I forwarded to cs-dev@i.a.o to get the test engineers in the 
>> community to take a look.
>>
>> >
>> > Thanks,
>> > Matt Mullins
>> > CloudPlatform Implementation Engineer Worldwide Cloud Services 
>> > Citrix System, Inc.
>> > +1 (407) 920-1107  Office/Cell Phone
>> > matt.mullins@citrix.com
>> >
>> >
>> >
>> > On 2/21/13 5:35 AM, "Fuchs, Andreas (SwissTXT)"
>> > <An...@swisstxt.ch> wrote:
>> >
>> > >Hi CS Users
>> > >
>> > >We are running CS 3.0.6 on a vSphere platform and found a strange 
>> > >behavior.
>> > >
>> > >When the vCenter becomes unavailable due to a reboot or some other 
>> > >issue, it seems that CS is shutting down instances when vCenter 
>> > >becomes available again.
>> > >
>> > >What we think what happens.
>> > >1. vCenter becomes unrechabale
>> > >2. CS marks the ESX servers as "down"
>> > >3. We think this leads to: CS marks the instances as down as well 4.
>> > >When vCenter becomes available again, CS stops the "marked as down"
>> > >instances
>> > >
>> > >This is very bad as the Instances where running all the time and 
>> > >the the shutdown issued by CS is forcing a service interruption.
>> > >
>> > >My problem is that I cannot realy reporoduce as allot of testing 
>> > >is ongoing on the platform at the moment, so my question:
>> > >
>> > >Does someone else see this issue as well and can maybe reproduce?
>> > >Is there a workaround to it, can I change some flag or something 
>> > >which tells CS to never shut down an instance by himself?
>> > >Why are the ESX hosts getting marked as down and not unreachable 
>> > >or something?
>> > >
>> > >Best regards
>> > >Andi
>> >
>> >
>>
>>
>>






RE: Issues when vCenter becomes unavailable

Posted by "Musayev, Ilya" <im...@webmd.net>.
Summary:

I have 3 hypervisors 
Hypervisor 1 and 2 are down, hypervisor 3 is up. All VMs live on hypervisor 3, however, the host_id in instance table for the VMs are not being updated to reflect the only hypervisor alive.

Details:

I physically powered 2 hypervisors that had most of my VMs and left 1 online.

The VMs were brought back online by vcenter, however from then on, I experience what Dave and Andreas mentioned.

That is, VMWare VMs instances are bound to host id (hypervisor) and not vcenter and operations that would be executed on the VMs require for the hypervisor to stay up. If the hypervisor goes off line, while VMs still come up in VC, CS cannot comprehend that these VMs now live on another hypervisor. 

This is bad for production roll outs - because VMs are bound to a hypervisor ID and not virtual center and it appears its not getting updated - though I do see in the log that CS is 

Did a little more digging, it looks like the host_ids don't get updated in mysql for vm in instances table. I need to double check on this because I totally messed 2 of test cloudstack clusters.

Can someone do the following test - if time allows - if not - I can try on monday:

1) Pick a hypervisor for a test crash and note 1 vm (I.e. i-2-89)
2) Navigate to "host" table in mysql and note the host_id for hypervisor that is about to be powered off.
3) In mysql goto instances table and note the last_host_id and host_id for a VM on test crash hypervisor.
4) Power off the hypervisor and let VCenter bring it back online
5) Attempt to launch a console on the VM was on crashed hypervisors and was powered back on by VC
6) If it fails - as it did in my case, alter the value of host_id to a next hypervisor its living on (my test is not clean because I've ruined the cluster that hosts my console vm and don't have time now to work on it ATM)
7) Launch console again to see if the issue resolved

I'm under suspicion the host_id does not get updated as I witnessed by examining mysql instance table, but I need to fix my env issues to confirm.

Regards
ilya


-----Original Message-----
From: Chiradeep Vittal [mailto:Chiradeep.Vittal@citrix.com] 
Sent: Friday, February 22, 2013 3:41 PM
To: cloudstack-users@incubator.apache.org
Cc: Kelven Yang; CloudStack DeveloperList
Subject: Re: Issues when vCenter becomes unavailable

CC'ing Kelven to see if he has any ideas.

On 2/22/13 12:22 PM, "Dave Dunaway" <da...@gmail.com> wrote:

>If I may suggest also testing a disconnect of a host (hypervisor) from 
>vcenter, so that vcenter and CS can still talk, but vcenter cannot talk 
>to the hosts (hypervisors). CS marks the host as down or failed or whatever.
>
>When the host comes back up vcenter can it just fine and all seems good.
>That however is not the case (I had this with CS 3.0.5 and vmware esxi
>5.0)
>when CS tries to talk to vcenter and the previously disconnected host 
>(that is now recovered).
>
>What we experienced was that we had to migrate all guests off the 
>recovered host, and then destroy that host in CS, and re-create it. 
>Then we could migrate back onto it the guests which had been previously 
>migrated.
>
>The curious thing is that while CS did not want to send commands to the 
>host (it kept on saying host id=X has timedout when whatever command 
>was sent to it), CS WAS polling the host for resources and getting the 
>correct numbers.... so CS could in some ways talk to the host (ie: it 
>knew the capabilities, number of VMs on it, etc).
>
>Luckily for me this all happened in a test environment. In production, 
>this would have been a real nightmare!
>
>
>dave
>
>
>On Fri, Feb 22, 2013 at 2:48 PM, Musayev, Ilya <im...@webmd.net> wrote:
>
>> Andi
>>
>> I'm on CS4.0. I simulated the VMWare VCenter 5 failure by adding a 
>>bogus  IP entry in /etc/hosts for 10 minutes for virtual center host. 
>>That in turn  made VC unreachable by CS.
>>
>> I then began executing commands and sure enough commands failed or 
>> backlogged. Once I restored VC connectivity, the backlogged commands 
>> executed and I did not experience any abnormalities.
>>
>> I will redo this test and leave VC off for an hour - maybe a need a 
>>longer  outage.
>>
>> Regards
>> ilya
>>
>>
>>
>> -----Original Message-----
>> From: Musayev, Ilya
>> Sent: Thursday, February 21, 2013 2:43 PM
>> To: cloudstack-users@incubator.apache.org
>> Subject: RE: Issues when vCenter becomes unavailable
>>
>> This is definitely not the behavior we want with vcenter.
>>
>> I will test this out on my lab setup shortly.
>>
>> Thanks
>> ilya
>>
>> -----Original Message-----
>> From: Chip Childers [mailto:chip.childers@sungard.com]
>> Sent: Thursday, February 21, 2013 9:40 AM
>> To: cloudstack-users@incubator.apache.org
>> Subject: Re: Issues when vCenter becomes unavailable
>>
>> On Thu, Feb 21, 2013 at 08:59:14AM -0500, Mathias Mullins wrote:
>> > Andreas,
>> >
>> > The open source community doesn't support the Citrix version 3.0.6.
>> > You need to report this via your Citrix Support contract. Sounds 
>> > like this could be a bug.
>> >
>> > Community - this could be a possible issue in 4.0.0 / 4.0.1. I 
>> > don't know if this test case has been explored.
>>
>> Thx - I forwarded to cs-dev@i.a.o to get the test engineers in the 
>> community to take a look.
>>
>> >
>> > Thanks,
>> > Matt Mullins
>> > CloudPlatform Implementation Engineer Worldwide Cloud Services  
>> > Citrix System, Inc.
>> > +1 (407) 920-1107  Office/Cell Phone
>> > matt.mullins@citrix.com
>> >
>> >
>> >
>> > On 2/21/13 5:35 AM, "Fuchs, Andreas (SwissTXT)"
>> > <An...@swisstxt.ch> wrote:
>> >
>> > >Hi CS Users
>> > >
>> > >We are running CS 3.0.6 on a vSphere platform and found a strange 
>> > >behavior.
>> > >
>> > >When the vCenter becomes unavailable due to a reboot or some other 
>> > >issue, it seems that CS is shutting down instances when vCenter 
>> > >becomes available again.
>> > >
>> > >What we think what happens.
>> > >1. vCenter becomes unrechabale
>> > >2. CS marks the ESX servers as "down"
>> > >3. We think this leads to: CS marks the instances as down as well 4.
>> > >When vCenter becomes available again, CS stops the "marked as down"
>> > >instances
>> > >
>> > >This is very bad as the Instances where running all the time and 
>> > >the the shutdown issued by CS is forcing a service interruption.
>> > >
>> > >My problem is that I cannot realy reporoduce as allot of testing 
>> > >is ongoing on the platform at the moment, so my question:
>> > >
>> > >Does someone else see this issue as well and can maybe reproduce?
>> > >Is there a workaround to it, can I change some flag or something 
>> > >which tells CS to never shut down an instance by himself?
>> > >Why are the ESX hosts getting marked as down and not unreachable 
>> > >or something?
>> > >
>> > >Best regards
>> > >Andi
>> >
>> >
>>
>>
>>




RE: Issues when vCenter becomes unavailable

Posted by "Musayev, Ilya" <im...@webmd.net>.
Summary:

I have 3 hypervisors 
Hypervisor 1 and 2 are down, hypervisor 3 is up. All VMs live on hypervisor 3, however, the host_id in instance table for the VMs are not being updated to reflect the only hypervisor alive.

Details:

I physically powered 2 hypervisors that had most of my VMs and left 1 online.

The VMs were brought back online by vcenter, however from then on, I experience what Dave and Andreas mentioned.

That is, VMWare VMs instances are bound to host id (hypervisor) and not vcenter and operations that would be executed on the VMs require for the hypervisor to stay up. If the hypervisor goes off line, while VMs still come up in VC, CS cannot comprehend that these VMs now live on another hypervisor. 

This is bad for production roll outs - because VMs are bound to a hypervisor ID and not virtual center and it appears its not getting updated - though I do see in the log that CS is 

Did a little more digging, it looks like the host_ids don't get updated in mysql for vm in instances table. I need to double check on this because I totally messed 2 of test cloudstack clusters.

Can someone do the following test - if time allows - if not - I can try on monday:

1) Pick a hypervisor for a test crash and note 1 vm (I.e. i-2-89)
2) Navigate to "host" table in mysql and note the host_id for hypervisor that is about to be powered off.
3) In mysql goto instances table and note the last_host_id and host_id for a VM on test crash hypervisor.
4) Power off the hypervisor and let VCenter bring it back online
5) Attempt to launch a console on the VM was on crashed hypervisors and was powered back on by VC
6) If it fails - as it did in my case, alter the value of host_id to a next hypervisor its living on (my test is not clean because I've ruined the cluster that hosts my console vm and don't have time now to work on it ATM)
7) Launch console again to see if the issue resolved

I'm under suspicion the host_id does not get updated as I witnessed by examining mysql instance table, but I need to fix my env issues to confirm.

Regards
ilya


-----Original Message-----
From: Chiradeep Vittal [mailto:Chiradeep.Vittal@citrix.com] 
Sent: Friday, February 22, 2013 3:41 PM
To: cloudstack-users@incubator.apache.org
Cc: Kelven Yang; CloudStack DeveloperList
Subject: Re: Issues when vCenter becomes unavailable

CC'ing Kelven to see if he has any ideas.

On 2/22/13 12:22 PM, "Dave Dunaway" <da...@gmail.com> wrote:

>If I may suggest also testing a disconnect of a host (hypervisor) from 
>vcenter, so that vcenter and CS can still talk, but vcenter cannot talk 
>to the hosts (hypervisors). CS marks the host as down or failed or whatever.
>
>When the host comes back up vcenter can it just fine and all seems good.
>That however is not the case (I had this with CS 3.0.5 and vmware esxi
>5.0)
>when CS tries to talk to vcenter and the previously disconnected host 
>(that is now recovered).
>
>What we experienced was that we had to migrate all guests off the 
>recovered host, and then destroy that host in CS, and re-create it. 
>Then we could migrate back onto it the guests which had been previously 
>migrated.
>
>The curious thing is that while CS did not want to send commands to the 
>host (it kept on saying host id=X has timedout when whatever command 
>was sent to it), CS WAS polling the host for resources and getting the 
>correct numbers.... so CS could in some ways talk to the host (ie: it 
>knew the capabilities, number of VMs on it, etc).
>
>Luckily for me this all happened in a test environment. In production, 
>this would have been a real nightmare!
>
>
>dave
>
>
>On Fri, Feb 22, 2013 at 2:48 PM, Musayev, Ilya <im...@webmd.net> wrote:
>
>> Andi
>>
>> I'm on CS4.0. I simulated the VMWare VCenter 5 failure by adding a 
>>bogus  IP entry in /etc/hosts for 10 minutes for virtual center host. 
>>That in turn  made VC unreachable by CS.
>>
>> I then began executing commands and sure enough commands failed or 
>> backlogged. Once I restored VC connectivity, the backlogged commands 
>> executed and I did not experience any abnormalities.
>>
>> I will redo this test and leave VC off for an hour - maybe a need a 
>>longer  outage.
>>
>> Regards
>> ilya
>>
>>
>>
>> -----Original Message-----
>> From: Musayev, Ilya
>> Sent: Thursday, February 21, 2013 2:43 PM
>> To: cloudstack-users@incubator.apache.org
>> Subject: RE: Issues when vCenter becomes unavailable
>>
>> This is definitely not the behavior we want with vcenter.
>>
>> I will test this out on my lab setup shortly.
>>
>> Thanks
>> ilya
>>
>> -----Original Message-----
>> From: Chip Childers [mailto:chip.childers@sungard.com]
>> Sent: Thursday, February 21, 2013 9:40 AM
>> To: cloudstack-users@incubator.apache.org
>> Subject: Re: Issues when vCenter becomes unavailable
>>
>> On Thu, Feb 21, 2013 at 08:59:14AM -0500, Mathias Mullins wrote:
>> > Andreas,
>> >
>> > The open source community doesn't support the Citrix version 3.0.6.
>> > You need to report this via your Citrix Support contract. Sounds 
>> > like this could be a bug.
>> >
>> > Community - this could be a possible issue in 4.0.0 / 4.0.1. I 
>> > don't know if this test case has been explored.
>>
>> Thx - I forwarded to cs-dev@i.a.o to get the test engineers in the 
>> community to take a look.
>>
>> >
>> > Thanks,
>> > Matt Mullins
>> > CloudPlatform Implementation Engineer Worldwide Cloud Services  
>> > Citrix System, Inc.
>> > +1 (407) 920-1107  Office/Cell Phone
>> > matt.mullins@citrix.com
>> >
>> >
>> >
>> > On 2/21/13 5:35 AM, "Fuchs, Andreas (SwissTXT)"
>> > <An...@swisstxt.ch> wrote:
>> >
>> > >Hi CS Users
>> > >
>> > >We are running CS 3.0.6 on a vSphere platform and found a strange 
>> > >behavior.
>> > >
>> > >When the vCenter becomes unavailable due to a reboot or some other 
>> > >issue, it seems that CS is shutting down instances when vCenter 
>> > >becomes available again.
>> > >
>> > >What we think what happens.
>> > >1. vCenter becomes unrechabale
>> > >2. CS marks the ESX servers as "down"
>> > >3. We think this leads to: CS marks the instances as down as well 4.
>> > >When vCenter becomes available again, CS stops the "marked as down"
>> > >instances
>> > >
>> > >This is very bad as the Instances where running all the time and 
>> > >the the shutdown issued by CS is forcing a service interruption.
>> > >
>> > >My problem is that I cannot realy reporoduce as allot of testing 
>> > >is ongoing on the platform at the moment, so my question:
>> > >
>> > >Does someone else see this issue as well and can maybe reproduce?
>> > >Is there a workaround to it, can I change some flag or something 
>> > >which tells CS to never shut down an instance by himself?
>> > >Why are the ESX hosts getting marked as down and not unreachable 
>> > >or something?
>> > >
>> > >Best regards
>> > >Andi
>> >
>> >
>>
>>
>>