You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cloudstack.apache.org by "Musayev, Ilya" <im...@webmd.net> on 2013/02/26 19:23:32 UTC

[Blocker][ACS41] Issues when vCenter becomes unavailable

Dear CS Dev Community,

Please confirm this issue qualifies as blocker and what can be done about this issue.

Thanks
ilya

From: Musayev, Ilya
Sent: Tuesday, February 26, 2013 12:00 PM
To: Musayev, Ilya; kelven.yang@citrix.com; cloudstack-dev@incubator.apache.org; cloudstack-users@incubator.apache.org
Subject: RE: Issues when vCenter becomes unavailable

FYI, please note this JIRA Issue, if there is something I left out, please chime in.

Thanks
ilya

https://issues.apache.org/jira/browse/CLOUDSTACK-1411



From: Musayev, Ilya
Sent: Saturday, February 23, 2013 6:22 PM
To: kelven.yang@citrix.com<ma...@citrix.com>; cloudstack-dev@incubator.apache.org<ma...@incubator.apache.org>; cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Issues when vCenter becomes unavailable

Any chance of some sort of fix for 4.0 or 4.1?

I understand that CS-669 (feature/enhacement) patch missed the commit deadline and will be in 4.2, but there is a real issue here that impacts production now.

Also, this is not a feature but a bug, I don't know if bugs are also treated on the same schedule as features.

Technically, for testing - we don't need to fail hypervisors. vMotion would achieve the same effect and host ID will get out of sync. It's only a theory though.

I will open a bug request on JIRA and ask for some visibility.

Alternatively, we can probably have a hack that will query VC for hosts and vms, identify what's changed, and update db - I'm just trying to avoid hacks.

Kelven Yang <ke...@citrix.com>> wrote:
This is an issue that we are targeting to solve to sync states between
vCenter/Cloudstack in a controllable way. Please track the status of this
ticket for further progress

https://issues.apache.org/jira/browse/CLOUDSTACK-669


Kelven


On 2/22/13 3:51 PM, "Musayev, Ilya" <im...@webmd.net>> wrote:

>Abit Incomplete email as I was in train and mistakenly press send,
>correction below:.. sorry :)
>
>-----Original Message-----
>From: Musayev, Ilya [mailto:imusayev@webmd.net]
>Sent: Friday, February 22, 2013 6:49 PM
>To: cloudstack-dev@incubator.apache.org<ma...@incubator.apache.org>;
>cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
>Cc: Kelven Yang
>Subject: RE: Issues when vCenter becomes unavailable
>
>Summary:
>
>I have 3 hypervisors
>Hypervisor 1 and 2 are down, hypervisor 3 is up. All VMs live on
>hypervisor 3, however, the host_id in instance table for the VMs are not
>being updated to reflect the only hypervisor alive.
>
>Details:
>
>I physically powered off 2 hypervisors that had most of my VMs and left 1
>online.
>
>The VMs were brought back online by vcenter, however from then on, I
>experience what Dave and Andreas mentioned.
>
>That is, VMWare VMs instances are bound to host id (hypervisor) and not
>vcenter and operations that would be executed on the VMs require for the
>hypervisor to stay up. If the hypervisor goes off line, while VMs still
>come up in VC, CS cannot comprehend that these VMs now live on another
>hypervisor.
>
>This is bad for production roll outs - because VMs are bound to a
>hypervisor ID and not virtual center and it appears its not getting
>updated - though I do see in the log that CS is trying to find it.
>
>Did a little more digging, it looks like the host_ids don't get updated
>in mysql for vm in instances table. I need to double check on this
>because I totally messed 2 of test cloudstack clusters.
>
>Can someone do the following test - if time allows - if not - I can try
>on monday:
>
>1) Pick a hypervisor for a test crash and note 1 vm (I.e. i-2-89)
>2) Navigate to "host" table in mysql and note the host_id for hypervisor
>that is about to be powered off.
>3) In mysql goto instances table and note the last_host_id and host_id
>for a VM on test crash hypervisor.
>4) Power off the hypervisor and let VCenter bring it back online
>5) Attempt to launch a console on the VM was on crashed hypervisors and
>was powered back on by VC
>6) If it fails - as it did in my case, alter the value of host_id to a
>next hypervisor its living on (my test is not clean because I've ruined
>the cluster that hosts my console vm and don't have time now to work on
>it ATM)
>7) Launch console again to see if the issue resolved
>
>I'm under suspicion the host_id does not get updated as I witnessed by
>examining mysql instance table, but I need to fix my env issues to
>confirm.
>
>Regards
>ilya
>
>
>-----Original Message-----
>From: Chiradeep Vittal [mailto:Chiradeep.Vittal@citrix.com]
>Sent: Friday, February 22, 2013 3:41 PM
>To: cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
>Cc: Kelven Yang; CloudStack DeveloperList
>Subject: Re: Issues when vCenter becomes unavailable
>
>CC'ing Kelven to see if he has any ideas.
>
>On 2/22/13 12:22 PM, "Dave Dunaway" <da...@gmail.com>> wrote:
>
>>If I may suggest also testing a disconnect of a host (hypervisor) from
>>vcenter, so that vcenter and CS can still talk, but vcenter cannot talk
>>to the hosts (hypervisors). CS marks the host as down or failed or
>>whatever.
>>
>>When the host comes back up vcenter can it just fine and all seems good.
>>That however is not the case (I had this with CS 3.0.5 and vmware esxi
>>5.0)
>>when CS tries to talk to vcenter and the previously disconnected host
>>(that is now recovered).
>>
>>What we experienced was that we had to migrate all guests off the
>>recovered host, and then destroy that host in CS, and re-create it.
>>Then we could migrate back onto it the guests which had been previously
>>migrated.
>>
>>The curious thing is that while CS did not want to send commands to the
>>host (it kept on saying host id=X has timedout when whatever command
>>was sent to it), CS WAS polling the host for resources and getting the
>>correct numbers.... so CS could in some ways talk to the host (ie: it
>>knew the capabilities, number of VMs on it, etc).
>>
>>Luckily for me this all happened in a test environment. In production,
>>this would have been a real nightmare!
>>
>>
>>dave
>>
>>
>>On Fri, Feb 22, 2013 at 2:48 PM, Musayev, Ilya <im...@webmd.net>>
>>wrote:
>>
>>> Andi
>>>
>>> I'm on CS4.0. I simulated the VMWare VCenter 5 failure by adding a
>>>bogus  IP entry in /etc/hosts for 10 minutes for virtual center host.
>>>That in turn  made VC unreachable by CS.
>>>
>>> I then began executing commands and sure enough commands failed or
>>> backlogged. Once I restored VC connectivity, the backlogged commands
>>> executed and I did not experience any abnormalities.
>>>
>>> I will redo this test and leave VC off for an hour - maybe a need a
>>>longer  outage.
>>>
>>> Regards
>>> ilya
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Musayev, Ilya
>>> Sent: Thursday, February 21, 2013 2:43 PM
>>> To: cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
>>> Subject: RE: Issues when vCenter becomes unavailable
>>>
>>> This is definitely not the behavior we want with vcenter.
>>>
>>> I will test this out on my lab setup shortly.
>>>
>>> Thanks
>>> ilya
>>>
>>> -----Original Message-----
>>> From: Chip Childers [mailto:chip.childers@sungard.com]
>>> Sent: Thursday, February 21, 2013 9:40 AM
>>> To: cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
>>> Subject: Re: Issues when vCenter becomes unavailable
>>>
>>> On Thu, Feb 21, 2013 at 08:59:14AM -0500, Mathias Mullins wrote:
>>> > Andreas,
>>> >
>>> > The open source community doesn't support the Citrix version 3.0.6.
>>> > You need to report this via your Citrix Support contract. Sounds
>>> > like this could be a bug.
>>> >
>>> > Community - this could be a possible issue in 4.0.0 / 4.0.1. I
>>> > don't know if this test case has been explored.
>>>
>>> Thx - I forwarded to cs-dev@i.a.o<ma...@i.a.o> to get the test engineers in the
>>> community to take a look.
>>>
>>> >
>>> > Thanks,
>>> > Matt Mullins
>>> > CloudPlatform Implementation Engineer Worldwide Cloud Services
>>> > Citrix System, Inc.
>>> > +1 (407) 920-1107  Office/Cell Phone
>>> > matt.mullins@citrix.com<ma...@citrix.com>
>>> >
>>> >
>>> >
>>> > On 2/21/13 5:35 AM, "Fuchs, Andreas (SwissTXT)"
>>> > <An...@swisstxt.ch>> wrote:
>>> >
>>> > >Hi CS Users
>>> > >
>>> > >We are running CS 3.0.6 on a vSphere platform and found a strange
>>> > >behavior.
>>> > >
>>> > >When the vCenter becomes unavailable due to a reboot or some other
>>> > >issue, it seems that CS is shutting down instances when vCenter
>>> > >becomes available again.
>>> > >
>>> > >What we think what happens.
>>> > >1. vCenter becomes unrechabale
>>> > >2. CS marks the ESX servers as "down"
>>> > >3. We think this leads to: CS marks the instances as down as well 4.
>>> > >When vCenter becomes available again, CS stops the "marked as down"
>>> > >instances
>>> > >
>>> > >This is very bad as the Instances where running all the time and
>>> > >the the shutdown issued by CS is forcing a service interruption.
>>> > >
>>> > >My problem is that I cannot realy reporoduce as allot of testing
>>> > >is ongoing on the platform at the moment, so my question:
>>> > >
>>> > >Does someone else see this issue as well and can maybe reproduce?
>>> > >Is there a workaround to it, can I change some flag or something
>>> > >which tells CS to never shut down an instance by himself?
>>> > >Why are the ESX hosts getting marked as down and not unreachable
>>> > >or something?
>>> > >
>>> > >Best regards
>>> > >Andi
>>> >
>>> >
>>>
>>>
>>>
>
>
>
>
>

Re: [Blocker][ACS41] Issues when vCenter becomes unavailable

Posted by Chip Childers <ch...@sungard.com>.

CC'in Kelvin on this so that he can perhaps provide an opinion.

On Tue, Feb 26, 2013 at 06:23:32PM +0000, Musayev, Ilya wrote:
> Dear CS Dev Community,
> 
> Please confirm this issue qualifies as blocker and what can be done about this issue.
> 
> Thanks
> ilya
> 
> From: Musayev, Ilya
> Sent: Tuesday, February 26, 2013 12:00 PM
> To: Musayev, Ilya; kelven.yang@citrix.com; cloudstack-dev@incubator.apache.org; cloudstack-users@incubator.apache.org
> Subject: RE: Issues when vCenter becomes unavailable
> 
> FYI, please note this JIRA Issue, if there is something I left out, please chime in.
> 
> Thanks
> ilya
> 
> https://issues.apache.org/jira/browse/CLOUDSTACK-1411
> 
> 
> 
> From: Musayev, Ilya
> Sent: Saturday, February 23, 2013 6:22 PM
> To: kelven.yang@citrix.com<ma...@citrix.com>; cloudstack-dev@incubator.apache.org<ma...@incubator.apache.org>; cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
> Subject: Re: Issues when vCenter becomes unavailable
> 
> Any chance of some sort of fix for 4.0 or 4.1?
> 
> I understand that CS-669 (feature/enhacement) patch missed the commit deadline and will be in 4.2, but there is a real issue here that impacts production now.
> 
> Also, this is not a feature but a bug, I don't know if bugs are also treated on the same schedule as features.
> 
> Technically, for testing - we don't need to fail hypervisors. vMotion would achieve the same effect and host ID will get out of sync. It's only a theory though.
> 
> I will open a bug request on JIRA and ask for some visibility.
> 
> Alternatively, we can probably have a hack that will query VC for hosts and vms, identify what's changed, and update db - I'm just trying to avoid hacks.
> 
> Kelven Yang <ke...@citrix.com>> wrote:
> This is an issue that we are targeting to solve to sync states between
> vCenter/Cloudstack in a controllable way. Please track the status of this
> ticket for further progress
> 
> https://issues.apache.org/jira/browse/CLOUDSTACK-669
> 
> 
> Kelven
> 
> 
> On 2/22/13 3:51 PM, "Musayev, Ilya" <im...@webmd.net>> wrote:
> 
> >Abit Incomplete email as I was in train and mistakenly press send,
> >correction below:.. sorry :)
> >
> >-----Original Message-----
> >From: Musayev, Ilya [mailto:imusayev@webmd.net]
> >Sent: Friday, February 22, 2013 6:49 PM
> >To: cloudstack-dev@incubator.apache.org<ma...@incubator.apache.org>;
> >cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
> >Cc: Kelven Yang
> >Subject: RE: Issues when vCenter becomes unavailable
> >
> >Summary:
> >
> >I have 3 hypervisors
> >Hypervisor 1 and 2 are down, hypervisor 3 is up. All VMs live on
> >hypervisor 3, however, the host_id in instance table for the VMs are not
> >being updated to reflect the only hypervisor alive.
> >
> >Details:
> >
> >I physically powered off 2 hypervisors that had most of my VMs and left 1
> >online.
> >
> >The VMs were brought back online by vcenter, however from then on, I
> >experience what Dave and Andreas mentioned.
> >
> >That is, VMWare VMs instances are bound to host id (hypervisor) and not
> >vcenter and operations that would be executed on the VMs require for the
> >hypervisor to stay up. If the hypervisor goes off line, while VMs still
> >come up in VC, CS cannot comprehend that these VMs now live on another
> >hypervisor.
> >
> >This is bad for production roll outs - because VMs are bound to a
> >hypervisor ID and not virtual center and it appears its not getting
> >updated - though I do see in the log that CS is trying to find it.
> >
> >Did a little more digging, it looks like the host_ids don't get updated
> >in mysql for vm in instances table. I need to double check on this
> >because I totally messed 2 of test cloudstack clusters.
> >
> >Can someone do the following test - if time allows - if not - I can try
> >on monday:
> >
> >1) Pick a hypervisor for a test crash and note 1 vm (I.e. i-2-89)
> >2) Navigate to "host" table in mysql and note the host_id for hypervisor
> >that is about to be powered off.
> >3) In mysql goto instances table and note the last_host_id and host_id
> >for a VM on test crash hypervisor.
> >4) Power off the hypervisor and let VCenter bring it back online
> >5) Attempt to launch a console on the VM was on crashed hypervisors and
> >was powered back on by VC
> >6) If it fails - as it did in my case, alter the value of host_id to a
> >next hypervisor its living on (my test is not clean because I've ruined
> >the cluster that hosts my console vm and don't have time now to work on
> >it ATM)
> >7) Launch console again to see if the issue resolved
> >
> >I'm under suspicion the host_id does not get updated as I witnessed by
> >examining mysql instance table, but I need to fix my env issues to
> >confirm.
> >
> >Regards
> >ilya
> >
> >
> >-----Original Message-----
> >From: Chiradeep Vittal [mailto:Chiradeep.Vittal@citrix.com]
> >Sent: Friday, February 22, 2013 3:41 PM
> >To: cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
> >Cc: Kelven Yang; CloudStack DeveloperList
> >Subject: Re: Issues when vCenter becomes unavailable
> >
> >CC'ing Kelven to see if he has any ideas.
> >
> >On 2/22/13 12:22 PM, "Dave Dunaway" <da...@gmail.com>> wrote:
> >
> >>If I may suggest also testing a disconnect of a host (hypervisor) from
> >>vcenter, so that vcenter and CS can still talk, but vcenter cannot talk
> >>to the hosts (hypervisors). CS marks the host as down or failed or
> >>whatever.
> >>
> >>When the host comes back up vcenter can it just fine and all seems good.
> >>That however is not the case (I had this with CS 3.0.5 and vmware esxi
> >>5.0)
> >>when CS tries to talk to vcenter and the previously disconnected host
> >>(that is now recovered).
> >>
> >>What we experienced was that we had to migrate all guests off the
> >>recovered host, and then destroy that host in CS, and re-create it.
> >>Then we could migrate back onto it the guests which had been previously
> >>migrated.
> >>
> >>The curious thing is that while CS did not want to send commands to the
> >>host (it kept on saying host id=X has timedout when whatever command
> >>was sent to it), CS WAS polling the host for resources and getting the
> >>correct numbers.... so CS could in some ways talk to the host (ie: it
> >>knew the capabilities, number of VMs on it, etc).
> >>
> >>Luckily for me this all happened in a test environment. In production,
> >>this would have been a real nightmare!
> >>
> >>
> >>dave
> >>
> >>
> >>On Fri, Feb 22, 2013 at 2:48 PM, Musayev, Ilya <im...@webmd.net>>
> >>wrote:
> >>
> >>> Andi
> >>>
> >>> I'm on CS4.0. I simulated the VMWare VCenter 5 failure by adding a
> >>>bogus  IP entry in /etc/hosts for 10 minutes for virtual center host.
> >>>That in turn  made VC unreachable by CS.
> >>>
> >>> I then began executing commands and sure enough commands failed or
> >>> backlogged. Once I restored VC connectivity, the backlogged commands
> >>> executed and I did not experience any abnormalities.
> >>>
> >>> I will redo this test and leave VC off for an hour - maybe a need a
> >>>longer  outage.
> >>>
> >>> Regards
> >>> ilya
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Musayev, Ilya
> >>> Sent: Thursday, February 21, 2013 2:43 PM
> >>> To: cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
> >>> Subject: RE: Issues when vCenter becomes unavailable
> >>>
> >>> This is definitely not the behavior we want with vcenter.
> >>>
> >>> I will test this out on my lab setup shortly.
> >>>
> >>> Thanks
> >>> ilya
> >>>
> >>> -----Original Message-----
> >>> From: Chip Childers [mailto:chip.childers@sungard.com]
> >>> Sent: Thursday, February 21, 2013 9:40 AM
> >>> To: cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
> >>> Subject: Re: Issues when vCenter becomes unavailable
> >>>
> >>> On Thu, Feb 21, 2013 at 08:59:14AM -0500, Mathias Mullins wrote:
> >>> > Andreas,
> >>> >
> >>> > The open source community doesn't support the Citrix version 3.0.6.
> >>> > You need to report this via your Citrix Support contract. Sounds
> >>> > like this could be a bug.
> >>> >
> >>> > Community - this could be a possible issue in 4.0.0 / 4.0.1. I
> >>> > don't know if this test case has been explored.
> >>>
> >>> Thx - I forwarded to cs-dev@i.a.o<ma...@i.a.o> to get the test engineers in the
> >>> community to take a look.
> >>>
> >>> >
> >>> > Thanks,
> >>> > Matt Mullins
> >>> > CloudPlatform Implementation Engineer Worldwide Cloud Services
> >>> > Citrix System, Inc.
> >>> > +1 (407) 920-1107  Office/Cell Phone
> >>> > matt.mullins@citrix.com<ma...@citrix.com>
> >>> >
> >>> >
> >>> >
> >>> > On 2/21/13 5:35 AM, "Fuchs, Andreas (SwissTXT)"
> >>> > <An...@swisstxt.ch>> wrote:
> >>> >
> >>> > >Hi CS Users
> >>> > >
> >>> > >We are running CS 3.0.6 on a vSphere platform and found a strange
> >>> > >behavior.
> >>> > >
> >>> > >When the vCenter becomes unavailable due to a reboot or some other
> >>> > >issue, it seems that CS is shutting down instances when vCenter
> >>> > >becomes available again.
> >>> > >
> >>> > >What we think what happens.
> >>> > >1. vCenter becomes unrechabale
> >>> > >2. CS marks the ESX servers as "down"
> >>> > >3. We think this leads to: CS marks the instances as down as well 4.
> >>> > >When vCenter becomes available again, CS stops the "marked as down"
> >>> > >instances
> >>> > >
> >>> > >This is very bad as the Instances where running all the time and
> >>> > >the the shutdown issued by CS is forcing a service interruption.
> >>> > >
> >>> > >My problem is that I cannot realy reporoduce as allot of testing
> >>> > >is ongoing on the platform at the moment, so my question:
> >>> > >
> >>> > >Does someone else see this issue as well and can maybe reproduce?
> >>> > >Is there a workaround to it, can I change some flag or something
> >>> > >which tells CS to never shut down an instance by himself?
> >>> > >Why are the ESX hosts getting marked as down and not unreachable
> >>> > >or something?
> >>> > >
> >>> > >Best regards
> >>> > >Andi
> >>> >
> >>> >
> >>>
> >>>
> >>>
> >
> >
> >
> >
> >