You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cloudstack.apache.org by Valery Ciareszka <va...@gmail.com> on 2013/08/26 15:45:45 UTC

Re: cs 4.1 host disconnected status

Koushik,

Ok, imagine the server is offline (burned cpu/ power supply etc), and there
is no way to get the host back online within 1-2 hours.
However, CS management considers host as online.
What is the proper way to deal with this issue ?



On Fri, Jul 12, 2013 at 2:20 PM, Koushik Das <ko...@citrix.com> wrote:

> I looked at the logs and none of the existing investigators are able to
> determine that the host is down. I am not sure if there is a clean way to
> identify if a host is down in case of KVM. Consider the following cases:
>
> 1. Host is actually shutdown
> 2. Management nic of the host is plugged out of the network but host is up
> and running
>
> There is no clean way to distinguish these cases. Cloudstack should only
> mark the host as down in the first case. But not sure how one would achieve
> this.
>
> -Koushik
>
> > -----Original Message-----
> > From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> > Sent: Friday, July 12, 2013 2:39 PM
> > To: users@cloudstack.apache.org
> > Subject: Re: cs 4.1 host disconnected status
> >
> > I've simulated crash again and here is the log:
> > http://thesuki.org/temp/cs.log.txt
> > I stripped out of there GET requests with api keys.
> > Server was switched off at 8:36
> >
> > On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das <koushik.das@citrix.com
> >wrote:
> >
> > > Looks like the KVM investigator is not able to determine the state of
> > > the agent. Can you share the full log?
> > >
> > > > -----Original Message-----
> > > > From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> > > > Sent: Thursday, July 11, 2013 7:47 PM
> > > > To: users
> > > > Subject: cs 4.1 host disconnected status
> > > >
> > > > Hi all.
> > > >
> > > > I use the following environment: CS 4.1, KVM, Centos 6.4
> > > > (management+node1+node2), OpenIndiana NFS server as primary and
> > > > secondary storage.
> > > > and I have the following problem:
> > > > If I switch one hypervisor node off via ipmi (simulate server
> > > > crash), it
> > > never
> > > > goes to Disconnected status in management. Accordingly, ha-enabled
> > > > VMs are not restarted on another hypervisor node, because it
> > > > believes that disconnected node is still online.
> > > >
> > > >
> > > > I get following in management server logs:
> > > >
> > > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> > > > (AgentManager-Handler-13:null) Seq 19-1133189098:
> Processing:
> > > >  { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
> > > > [{"Answer":{"result":false,"details":     "Unable to ping computing
> host,
> > > > exiting","wait":0}}] }
> > > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> > > > (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: , MgmtId:
> > > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
> > > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> > > > (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
> > > > returning
> > > null
> > > > ('I don't know')
> > > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > > > (AgentTaskPool-1:null) could not reach agent, could   not reach
> agent's
> > > > host, returning that we don't have enough information
> > > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > > > (AgentTaskPool-1:null) null unable to determine  the state of the
> host.
> > > >  Moving on.
> > > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > > > (AgentTaskPool-1:null) null unable to determine  the state of the
> host.
> > > >  Moving on.
> > > > 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
> > > > (AgentTaskPool-1:null) Agent state cannot be           determined, do
> > > > nothing
> > > >
> > > >
> > > > If I power on dead node, it goes to state "Connecting" and then "Up"
> > > > in management interface.
> > > >
> > > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null)
> > > > Ping timeout for host 12, do invstigation
> > > > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null)
> > > > Ping timeout for host 12, do invstigation
> > > > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null)
> > > > Ping timeout for host 12, do invstigation
> > > > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
> > > > (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled,
> > > > Agent event = AgentConnected, Host id = 12, name =
> > > > ad112.colobridge.net]
> > > > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
> > > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
> > > > ad112.colobridge.net; old status = Up; event = AgentConnected; new
> > > status
> > > > = Connecting; old update count = 1285; new update count = 1286]
> > > > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
> > > > (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled,
> > > > Agent event = Ready, Host id = 12, name = ad112.colobridge.net]
> > > > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
> > > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
> > > > ad112.colobridge.net; old status = Connecting; event = Ready; new
> > > status =
> > > > Up; old update count = 1286; new update count = 1287]
> > > >
> > > >
> > > > If I restart cloud-management service, dead node goes to state
> > > > "Disconnected" in management interface.
> > > > (there is nothing special in logs in this case)
> > > >
> > > > If I do nothing,  dead node could stay in "Up" state forever (I
> > > > waited
> > > for
> > > > 12 hours) in management interface, throwing into logs "Agent state
> > > > cannot be determined, do nothing"
> > > >
> > > > Would appreciate if someone could help/suggest how to deal with this
> > > > problem.
> > > >
> > > > --
> > > > Regards,
> > > > Valery
> > > >
> > > > http://protocol.by/slayer
> > >
> >
> >
> >
> > --
> > Regards,
> > Valery
> >
> > http://protocol.by/slayer
>



-- 
Regards,
Valery

http://protocol.by/slayer

Re: cs 4.1 host disconnected status

Posted by Koushik Das <ko...@citrix.com>.
Checkout https://issues.apache.org/jira/browse/CLOUDSTACK-3535.

-Koushik

On 26-Aug-2013, at 7:16 PM, Valery Ciareszka <va...@gmail.com> wrote:

> Koushik,
> 
> Ok, imagine the server is offline (burned cpu/ power supply etc), and there
> is no way to get the host back online within 1-2 hours.
> However, CS management considers host as online.
> What is the proper way to deal with this issue ?
> 
> 
> 
> On Fri, Jul 12, 2013 at 2:20 PM, Koushik Das <ko...@citrix.com> wrote:
> 
>> I looked at the logs and none of the existing investigators are able to
>> determine that the host is down. I am not sure if there is a clean way to
>> identify if a host is down in case of KVM. Consider the following cases:
>> 
>> 1. Host is actually shutdown
>> 2. Management nic of the host is plugged out of the network but host is up
>> and running
>> 
>> There is no clean way to distinguish these cases. Cloudstack should only
>> mark the host as down in the first case. But not sure how one would achieve
>> this.
>> 
>> -Koushik
>> 
>>> -----Original Message-----
>>> From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
>>> Sent: Friday, July 12, 2013 2:39 PM
>>> To: users@cloudstack.apache.org
>>> Subject: Re: cs 4.1 host disconnected status
>>> 
>>> I've simulated crash again and here is the log:
>>> http://thesuki.org/temp/cs.log.txt
>>> I stripped out of there GET requests with api keys.
>>> Server was switched off at 8:36
>>> 
>>> On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das <koushik.das@citrix.com
>>> wrote:
>>> 
>>>> Looks like the KVM investigator is not able to determine the state of
>>>> the agent. Can you share the full log?
>>>> 
>>>>> -----Original Message-----
>>>>> From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
>>>>> Sent: Thursday, July 11, 2013 7:47 PM
>>>>> To: users
>>>>> Subject: cs 4.1 host disconnected status
>>>>> 
>>>>> Hi all.
>>>>> 
>>>>> I use the following environment: CS 4.1, KVM, Centos 6.4
>>>>> (management+node1+node2), OpenIndiana NFS server as primary and
>>>>> secondary storage.
>>>>> and I have the following problem:
>>>>> If I switch one hypervisor node off via ipmi (simulate server
>>>>> crash), it
>>>> never
>>>>> goes to Disconnected status in management. Accordingly, ha-enabled
>>>>> VMs are not restarted on another hypervisor node, because it
>>>>> believes that disconnected node is still online.
>>>>> 
>>>>> 
>>>>> I get following in management server logs:
>>>>> 
>>>>> 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>>>>> (AgentManager-Handler-13:null) Seq 19-1133189098:
>> Processing:
>>>>> { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
>>>>> [{"Answer":{"result":false,"details":     "Unable to ping computing
>> host,
>>>>> exiting","wait":0}}] }
>>>>> 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>>>>> (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: , MgmtId:
>>>>> 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
>>>>> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
>>>>> (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
>>>>> returning
>>>> null
>>>>> ('I don't know')
>>>>> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
>>>>> (AgentTaskPool-1:null) could not reach agent, could   not reach
>> agent's
>>>>> host, returning that we don't have enough information
>>>>> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>>>>> (AgentTaskPool-1:null) null unable to determine  the state of the
>> host.
>>>>> Moving on.
>>>>> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>>>>> (AgentTaskPool-1:null) null unable to determine  the state of the
>> host.
>>>>> Moving on.
>>>>> 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
>>>>> (AgentTaskPool-1:null) Agent state cannot be           determined, do
>>>>> nothing
>>>>> 
>>>>> 
>>>>> If I power on dead node, it goes to state "Connecting" and then "Up"
>>>>> in management interface.
>>>>> 
>>>>> 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null)
>>>>> Ping timeout for host 12, do invstigation
>>>>> 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null)
>>>>> Ping timeout for host 12, do invstigation
>>>>> 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null)
>>>>> Ping timeout for host 12, do invstigation
>>>>> 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
>>>>> (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled,
>>>>> Agent event = AgentConnected, Host id = 12, name =
>>>>> ad112.colobridge.net]
>>>>> 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
>>>>> (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
>>>>> ad112.colobridge.net; old status = Up; event = AgentConnected; new
>>>> status
>>>>> = Connecting; old update count = 1285; new update count = 1286]
>>>>> 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
>>>>> (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled,
>>>>> Agent event = Ready, Host id = 12, name = ad112.colobridge.net]
>>>>> 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
>>>>> (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
>>>>> ad112.colobridge.net; old status = Connecting; event = Ready; new
>>>> status =
>>>>> Up; old update count = 1286; new update count = 1287]
>>>>> 
>>>>> 
>>>>> If I restart cloud-management service, dead node goes to state
>>>>> "Disconnected" in management interface.
>>>>> (there is nothing special in logs in this case)
>>>>> 
>>>>> If I do nothing,  dead node could stay in "Up" state forever (I
>>>>> waited
>>>> for
>>>>> 12 hours) in management interface, throwing into logs "Agent state
>>>>> cannot be determined, do nothing"
>>>>> 
>>>>> Would appreciate if someone could help/suggest how to deal with this
>>>>> problem.
>>>>> 
>>>>> --
>>>>> Regards,
>>>>> Valery
>>>>> 
>>>>> http://protocol.by/slayer
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Regards,
>>> Valery
>>> 
>>> http://protocol.by/slayer
>> 
> 
> 
> 
> -- 
> Regards,
> Valery
> 
> http://protocol.by/slayer