You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cloudstack.apache.org by Valery Ciareszka <va...@gmail.com> on 2013/07/11 16:17:27 UTC

cs 4.1 host disconnected status

Hi all.

I use the following environment: CS 4.1, KVM, Centos 6.4
(management+node1+node2), OpenIndiana NFS server as primary and secondary
storage.
and I have the following problem:
If I switch one hypervisor node off via ipmi (simulate server crash), it
never goes to Disconnected status in management. Accordingly, ha-enabled
VMs are not restarted on another hypervisor node, because it believes that
disconnected node is still online.

I get following in management server logs:

2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
(AgentManager-Handler-13:null) Seq 19-1133189098:             Processing:
 { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
[{"Answer":{"result":false,"details":     "Unable to ping computing host,
exiting","wait":0}}] }
2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
(AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: , MgmtId:
161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
(AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged, returning
null ('I don't know')
2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
(AgentTaskPool-1:null) could not reach agent, could   not reach agent's
host, returning that we don't have enough information
2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
(AgentTaskPool-1:null) null unable to determine  the state of the host.
 Moving on.
2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
(AgentTaskPool-1:null) null unable to determine  the state of the host.
 Moving on.
2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
(AgentTaskPool-1:null) Agent state cannot be           determined, do
nothing


If I power on dead node, it goes to state "Connecting" and then "Up" in
management interface.

2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null) Ping
timeout for host 12, do invstigation
2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null) Ping
timeout for host 12, do invstigation
2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null) Ping
timeout for host 12, do invstigation
2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
(AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled, Agent
event = AgentConnected, Host id = 12, name = ad112.colobridge.net]
2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
(AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
ad112.colobridge.net; old status = Up; event = AgentConnected; new status =
Connecting; old update count = 1285; new update count = 1286]
2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
(AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled, Agent
event = Ready, Host id = 12, name = ad112.colobridge.net]
2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
(AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
ad112.colobridge.net; old status = Connecting; event = Ready; new status =
Up; old update count = 1286; new update count = 1287]


If I restart cloud-management service, dead node goes to state
"Disconnected" in management interface.
(there is nothing special in logs in this case)

If I do nothing,  dead node could stay in "Up" state forever (I waited for
12 hours) in management interface, throwing into logs "Agent state cannot
be determined, do nothing"

Would appreciate if someone could help/suggest how to deal with this
problem.

-- 
Regards,
Valery

http://protocol.by/slayer

Re: cs 4.1 host disconnected status

Posted by Koushik Das <ko...@citrix.com>.
Checkout https://issues.apache.org/jira/browse/CLOUDSTACK-3535.

-Koushik

On 26-Aug-2013, at 7:16 PM, Valery Ciareszka <va...@gmail.com> wrote:

> Koushik,
> 
> Ok, imagine the server is offline (burned cpu/ power supply etc), and there
> is no way to get the host back online within 1-2 hours.
> However, CS management considers host as online.
> What is the proper way to deal with this issue ?
> 
> 
> 
> On Fri, Jul 12, 2013 at 2:20 PM, Koushik Das <ko...@citrix.com> wrote:
> 
>> I looked at the logs and none of the existing investigators are able to
>> determine that the host is down. I am not sure if there is a clean way to
>> identify if a host is down in case of KVM. Consider the following cases:
>> 
>> 1. Host is actually shutdown
>> 2. Management nic of the host is plugged out of the network but host is up
>> and running
>> 
>> There is no clean way to distinguish these cases. Cloudstack should only
>> mark the host as down in the first case. But not sure how one would achieve
>> this.
>> 
>> -Koushik
>> 
>>> -----Original Message-----
>>> From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
>>> Sent: Friday, July 12, 2013 2:39 PM
>>> To: users@cloudstack.apache.org
>>> Subject: Re: cs 4.1 host disconnected status
>>> 
>>> I've simulated crash again and here is the log:
>>> http://thesuki.org/temp/cs.log.txt
>>> I stripped out of there GET requests with api keys.
>>> Server was switched off at 8:36
>>> 
>>> On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das <koushik.das@citrix.com
>>> wrote:
>>> 
>>>> Looks like the KVM investigator is not able to determine the state of
>>>> the agent. Can you share the full log?
>>>> 
>>>>> -----Original Message-----
>>>>> From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
>>>>> Sent: Thursday, July 11, 2013 7:47 PM
>>>>> To: users
>>>>> Subject: cs 4.1 host disconnected status
>>>>> 
>>>>> Hi all.
>>>>> 
>>>>> I use the following environment: CS 4.1, KVM, Centos 6.4
>>>>> (management+node1+node2), OpenIndiana NFS server as primary and
>>>>> secondary storage.
>>>>> and I have the following problem:
>>>>> If I switch one hypervisor node off via ipmi (simulate server
>>>>> crash), it
>>>> never
>>>>> goes to Disconnected status in management. Accordingly, ha-enabled
>>>>> VMs are not restarted on another hypervisor node, because it
>>>>> believes that disconnected node is still online.
>>>>> 
>>>>> 
>>>>> I get following in management server logs:
>>>>> 
>>>>> 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>>>>> (AgentManager-Handler-13:null) Seq 19-1133189098:
>> Processing:
>>>>> { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
>>>>> [{"Answer":{"result":false,"details":     "Unable to ping computing
>> host,
>>>>> exiting","wait":0}}] }
>>>>> 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>>>>> (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: , MgmtId:
>>>>> 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
>>>>> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
>>>>> (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
>>>>> returning
>>>> null
>>>>> ('I don't know')
>>>>> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
>>>>> (AgentTaskPool-1:null) could not reach agent, could   not reach
>> agent's
>>>>> host, returning that we don't have enough information
>>>>> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>>>>> (AgentTaskPool-1:null) null unable to determine  the state of the
>> host.
>>>>> Moving on.
>>>>> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>>>>> (AgentTaskPool-1:null) null unable to determine  the state of the
>> host.
>>>>> Moving on.
>>>>> 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
>>>>> (AgentTaskPool-1:null) Agent state cannot be           determined, do
>>>>> nothing
>>>>> 
>>>>> 
>>>>> If I power on dead node, it goes to state "Connecting" and then "Up"
>>>>> in management interface.
>>>>> 
>>>>> 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null)
>>>>> Ping timeout for host 12, do invstigation
>>>>> 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null)
>>>>> Ping timeout for host 12, do invstigation
>>>>> 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null)
>>>>> Ping timeout for host 12, do invstigation
>>>>> 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
>>>>> (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled,
>>>>> Agent event = AgentConnected, Host id = 12, name =
>>>>> ad112.colobridge.net]
>>>>> 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
>>>>> (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
>>>>> ad112.colobridge.net; old status = Up; event = AgentConnected; new
>>>> status
>>>>> = Connecting; old update count = 1285; new update count = 1286]
>>>>> 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
>>>>> (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled,
>>>>> Agent event = Ready, Host id = 12, name = ad112.colobridge.net]
>>>>> 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
>>>>> (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
>>>>> ad112.colobridge.net; old status = Connecting; event = Ready; new
>>>> status =
>>>>> Up; old update count = 1286; new update count = 1287]
>>>>> 
>>>>> 
>>>>> If I restart cloud-management service, dead node goes to state
>>>>> "Disconnected" in management interface.
>>>>> (there is nothing special in logs in this case)
>>>>> 
>>>>> If I do nothing,  dead node could stay in "Up" state forever (I
>>>>> waited
>>>> for
>>>>> 12 hours) in management interface, throwing into logs "Agent state
>>>>> cannot be determined, do nothing"
>>>>> 
>>>>> Would appreciate if someone could help/suggest how to deal with this
>>>>> problem.
>>>>> 
>>>>> --
>>>>> Regards,
>>>>> Valery
>>>>> 
>>>>> http://protocol.by/slayer
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Regards,
>>> Valery
>>> 
>>> http://protocol.by/slayer
>> 
> 
> 
> 
> -- 
> Regards,
> Valery
> 
> http://protocol.by/slayer


Re: cs 4.1 host disconnected status

Posted by Valery Ciareszka <va...@gmail.com>.
Koushik,

Ok, imagine the server is offline (burned cpu/ power supply etc), and there
is no way to get the host back online within 1-2 hours.
However, CS management considers host as online.
What is the proper way to deal with this issue ?



On Fri, Jul 12, 2013 at 2:20 PM, Koushik Das <ko...@citrix.com> wrote:

> I looked at the logs and none of the existing investigators are able to
> determine that the host is down. I am not sure if there is a clean way to
> identify if a host is down in case of KVM. Consider the following cases:
>
> 1. Host is actually shutdown
> 2. Management nic of the host is plugged out of the network but host is up
> and running
>
> There is no clean way to distinguish these cases. Cloudstack should only
> mark the host as down in the first case. But not sure how one would achieve
> this.
>
> -Koushik
>
> > -----Original Message-----
> > From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> > Sent: Friday, July 12, 2013 2:39 PM
> > To: users@cloudstack.apache.org
> > Subject: Re: cs 4.1 host disconnected status
> >
> > I've simulated crash again and here is the log:
> > http://thesuki.org/temp/cs.log.txt
> > I stripped out of there GET requests with api keys.
> > Server was switched off at 8:36
> >
> > On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das <koushik.das@citrix.com
> >wrote:
> >
> > > Looks like the KVM investigator is not able to determine the state of
> > > the agent. Can you share the full log?
> > >
> > > > -----Original Message-----
> > > > From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> > > > Sent: Thursday, July 11, 2013 7:47 PM
> > > > To: users
> > > > Subject: cs 4.1 host disconnected status
> > > >
> > > > Hi all.
> > > >
> > > > I use the following environment: CS 4.1, KVM, Centos 6.4
> > > > (management+node1+node2), OpenIndiana NFS server as primary and
> > > > secondary storage.
> > > > and I have the following problem:
> > > > If I switch one hypervisor node off via ipmi (simulate server
> > > > crash), it
> > > never
> > > > goes to Disconnected status in management. Accordingly, ha-enabled
> > > > VMs are not restarted on another hypervisor node, because it
> > > > believes that disconnected node is still online.
> > > >
> > > >
> > > > I get following in management server logs:
> > > >
> > > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> > > > (AgentManager-Handler-13:null) Seq 19-1133189098:
> Processing:
> > > >  { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
> > > > [{"Answer":{"result":false,"details":     "Unable to ping computing
> host,
> > > > exiting","wait":0}}] }
> > > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> > > > (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: , MgmtId:
> > > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
> > > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> > > > (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
> > > > returning
> > > null
> > > > ('I don't know')
> > > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > > > (AgentTaskPool-1:null) could not reach agent, could   not reach
> agent's
> > > > host, returning that we don't have enough information
> > > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > > > (AgentTaskPool-1:null) null unable to determine  the state of the
> host.
> > > >  Moving on.
> > > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > > > (AgentTaskPool-1:null) null unable to determine  the state of the
> host.
> > > >  Moving on.
> > > > 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
> > > > (AgentTaskPool-1:null) Agent state cannot be           determined, do
> > > > nothing
> > > >
> > > >
> > > > If I power on dead node, it goes to state "Connecting" and then "Up"
> > > > in management interface.
> > > >
> > > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null)
> > > > Ping timeout for host 12, do invstigation
> > > > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null)
> > > > Ping timeout for host 12, do invstigation
> > > > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null)
> > > > Ping timeout for host 12, do invstigation
> > > > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
> > > > (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled,
> > > > Agent event = AgentConnected, Host id = 12, name =
> > > > ad112.colobridge.net]
> > > > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
> > > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
> > > > ad112.colobridge.net; old status = Up; event = AgentConnected; new
> > > status
> > > > = Connecting; old update count = 1285; new update count = 1286]
> > > > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
> > > > (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled,
> > > > Agent event = Ready, Host id = 12, name = ad112.colobridge.net]
> > > > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
> > > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
> > > > ad112.colobridge.net; old status = Connecting; event = Ready; new
> > > status =
> > > > Up; old update count = 1286; new update count = 1287]
> > > >
> > > >
> > > > If I restart cloud-management service, dead node goes to state
> > > > "Disconnected" in management interface.
> > > > (there is nothing special in logs in this case)
> > > >
> > > > If I do nothing,  dead node could stay in "Up" state forever (I
> > > > waited
> > > for
> > > > 12 hours) in management interface, throwing into logs "Agent state
> > > > cannot be determined, do nothing"
> > > >
> > > > Would appreciate if someone could help/suggest how to deal with this
> > > > problem.
> > > >
> > > > --
> > > > Regards,
> > > > Valery
> > > >
> > > > http://protocol.by/slayer
> > >
> >
> >
> >
> > --
> > Regards,
> > Valery
> >
> > http://protocol.by/slayer
>



-- 
Regards,
Valery

http://protocol.by/slayer

RE: cs 4.1 host disconnected status

Posted by Koushik Das <ko...@citrix.com>.
I looked at the logs and none of the existing investigators are able to determine that the host is down. I am not sure if there is a clean way to identify if a host is down in case of KVM. Consider the following cases:

1. Host is actually shutdown
2. Management nic of the host is plugged out of the network but host is up and running

There is no clean way to distinguish these cases. Cloudstack should only mark the host as down in the first case. But not sure how one would achieve this.

-Koushik

> -----Original Message-----
> From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> Sent: Friday, July 12, 2013 2:39 PM
> To: users@cloudstack.apache.org
> Subject: Re: cs 4.1 host disconnected status
> 
> I've simulated crash again and here is the log:
> http://thesuki.org/temp/cs.log.txt
> I stripped out of there GET requests with api keys.
> Server was switched off at 8:36
> 
> On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das <ko...@citrix.com>wrote:
> 
> > Looks like the KVM investigator is not able to determine the state of
> > the agent. Can you share the full log?
> >
> > > -----Original Message-----
> > > From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> > > Sent: Thursday, July 11, 2013 7:47 PM
> > > To: users
> > > Subject: cs 4.1 host disconnected status
> > >
> > > Hi all.
> > >
> > > I use the following environment: CS 4.1, KVM, Centos 6.4
> > > (management+node1+node2), OpenIndiana NFS server as primary and
> > > secondary storage.
> > > and I have the following problem:
> > > If I switch one hypervisor node off via ipmi (simulate server
> > > crash), it
> > never
> > > goes to Disconnected status in management. Accordingly, ha-enabled
> > > VMs are not restarted on another hypervisor node, because it
> > > believes that disconnected node is still online.
> > >
> > >
> > > I get following in management server logs:
> > >
> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> > > (AgentManager-Handler-13:null) Seq 19-1133189098:             Processing:
> > >  { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
> > > [{"Answer":{"result":false,"details":     "Unable to ping computing host,
> > > exiting","wait":0}}] }
> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> > > (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: , MgmtId:
> > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> > > (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
> > > returning
> > null
> > > ('I don't know')
> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > > (AgentTaskPool-1:null) could not reach agent, could   not reach agent's
> > > host, returning that we don't have enough information
> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > > (AgentTaskPool-1:null) null unable to determine  the state of the host.
> > >  Moving on.
> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > > (AgentTaskPool-1:null) null unable to determine  the state of the host.
> > >  Moving on.
> > > 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-1:null) Agent state cannot be           determined, do
> > > nothing
> > >
> > >
> > > If I power on dead node, it goes to state "Connecting" and then "Up"
> > > in management interface.
> > >
> > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null)
> > > Ping timeout for host 12, do invstigation
> > > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null)
> > > Ping timeout for host 12, do invstigation
> > > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null)
> > > Ping timeout for host 12, do invstigation
> > > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
> > > (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled,
> > > Agent event = AgentConnected, Host id = 12, name =
> > > ad112.colobridge.net]
> > > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
> > > ad112.colobridge.net; old status = Up; event = AgentConnected; new
> > status
> > > = Connecting; old update count = 1285; new update count = 1286]
> > > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
> > > (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled,
> > > Agent event = Ready, Host id = 12, name = ad112.colobridge.net]
> > > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
> > > ad112.colobridge.net; old status = Connecting; event = Ready; new
> > status =
> > > Up; old update count = 1286; new update count = 1287]
> > >
> > >
> > > If I restart cloud-management service, dead node goes to state
> > > "Disconnected" in management interface.
> > > (there is nothing special in logs in this case)
> > >
> > > If I do nothing,  dead node could stay in "Up" state forever (I
> > > waited
> > for
> > > 12 hours) in management interface, throwing into logs "Agent state
> > > cannot be determined, do nothing"
> > >
> > > Would appreciate if someone could help/suggest how to deal with this
> > > problem.
> > >
> > > --
> > > Regards,
> > > Valery
> > >
> > > http://protocol.by/slayer
> >
> 
> 
> 
> --
> Regards,
> Valery
> 
> http://protocol.by/slayer

Re: cs 4.1 host disconnected status

Posted by Valery Ciareszka <va...@gmail.com>.
I've simulated crash again and here is the log:
http://thesuki.org/temp/cs.log.txt
I stripped out of there GET requests with api keys.
Server was switched off at 8:36

On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das <ko...@citrix.com>wrote:

> Looks like the KVM investigator is not able to determine the state of the
> agent. Can you share the full log?
>
> > -----Original Message-----
> > From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> > Sent: Thursday, July 11, 2013 7:47 PM
> > To: users
> > Subject: cs 4.1 host disconnected status
> >
> > Hi all.
> >
> > I use the following environment: CS 4.1, KVM, Centos 6.4
> > (management+node1+node2), OpenIndiana NFS server as primary and
> > secondary storage.
> > and I have the following problem:
> > If I switch one hypervisor node off via ipmi (simulate server crash), it
> never
> > goes to Disconnected status in management. Accordingly, ha-enabled VMs
> > are not restarted on another hypervisor node, because it believes that
> > disconnected node is still online.
> >
> >
> > I get following in management server logs:
> >
> > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> > (AgentManager-Handler-13:null) Seq 19-1133189098:             Processing:
> >  { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
> > [{"Answer":{"result":false,"details":     "Unable to ping computing host,
> > exiting","wait":0}}] }
> > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> > (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: , MgmtId:
> > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
> > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> > (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged, returning
> null
> > ('I don't know')
> > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > (AgentTaskPool-1:null) could not reach agent, could   not reach agent's
> > host, returning that we don't have enough information
> > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > (AgentTaskPool-1:null) null unable to determine  the state of the host.
> >  Moving on.
> > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > (AgentTaskPool-1:null) null unable to determine  the state of the host.
> >  Moving on.
> > 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-1:null) Agent state cannot be           determined, do
> > nothing
> >
> >
> > If I power on dead node, it goes to state "Connecting" and then "Up" in
> > management interface.
> >
> > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null) Ping
> > timeout for host 12, do invstigation
> > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null) Ping
> > timeout for host 12, do invstigation
> > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null) Ping
> > timeout for host 12, do invstigation
> > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
> > (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled, Agent
> > event = AgentConnected, Host id = 12, name = ad112.colobridge.net]
> > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
> > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
> > ad112.colobridge.net; old status = Up; event = AgentConnected; new
> status
> > = Connecting; old update count = 1285; new update count = 1286]
> > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
> > (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled, Agent
> > event = Ready, Host id = 12, name = ad112.colobridge.net]
> > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
> > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
> > ad112.colobridge.net; old status = Connecting; event = Ready; new
> status =
> > Up; old update count = 1286; new update count = 1287]
> >
> >
> > If I restart cloud-management service, dead node goes to state
> > "Disconnected" in management interface.
> > (there is nothing special in logs in this case)
> >
> > If I do nothing,  dead node could stay in "Up" state forever (I waited
> for
> > 12 hours) in management interface, throwing into logs "Agent state cannot
> > be determined, do nothing"
> >
> > Would appreciate if someone could help/suggest how to deal with this
> > problem.
> >
> > --
> > Regards,
> > Valery
> >
> > http://protocol.by/slayer
>



-- 
Regards,
Valery

http://protocol.by/slayer

RE: cs 4.1 host disconnected status

Posted by Koushik Das <ko...@citrix.com>.
Looks like the KVM investigator is not able to determine the state of the agent. Can you share the full log?

> -----Original Message-----
> From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> Sent: Thursday, July 11, 2013 7:47 PM
> To: users
> Subject: cs 4.1 host disconnected status
> 
> Hi all.
> 
> I use the following environment: CS 4.1, KVM, Centos 6.4
> (management+node1+node2), OpenIndiana NFS server as primary and
> secondary storage.
> and I have the following problem:
> If I switch one hypervisor node off via ipmi (simulate server crash), it never
> goes to Disconnected status in management. Accordingly, ha-enabled VMs
> are not restarted on another hypervisor node, because it believes that
> disconnected node is still online.
> 
> 
> I get following in management server logs:
> 
> 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> (AgentManager-Handler-13:null) Seq 19-1133189098:             Processing:
>  { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
> [{"Answer":{"result":false,"details":     "Unable to ping computing host,
> exiting","wait":0}}] }
> 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: , MgmtId:
> 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged, returning null
> ('I don't know')
> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (AgentTaskPool-1:null) could not reach agent, could   not reach agent's
> host, returning that we don't have enough information
> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-1:null) null unable to determine  the state of the host.
>  Moving on.
> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-1:null) null unable to determine  the state of the host.
>  Moving on.
> 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-1:null) Agent state cannot be           determined, do
> nothing
> 
> 
> If I power on dead node, it goes to state "Connecting" and then "Up" in
> management interface.
> 
> 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null) Ping
> timeout for host 12, do invstigation
> 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null) Ping
> timeout for host 12, do invstigation
> 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null) Ping
> timeout for host 12, do invstigation
> 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
> (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled, Agent
> event = AgentConnected, Host id = 12, name = ad112.colobridge.net]
> 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
> (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
> ad112.colobridge.net; old status = Up; event = AgentConnected; new status
> = Connecting; old update count = 1285; new update count = 1286]
> 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
> (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled, Agent
> event = Ready, Host id = 12, name = ad112.colobridge.net]
> 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
> (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
> ad112.colobridge.net; old status = Connecting; event = Ready; new status =
> Up; old update count = 1286; new update count = 1287]
> 
> 
> If I restart cloud-management service, dead node goes to state
> "Disconnected" in management interface.
> (there is nothing special in logs in this case)
> 
> If I do nothing,  dead node could stay in "Up" state forever (I waited for
> 12 hours) in management interface, throwing into logs "Agent state cannot
> be determined, do nothing"
> 
> Would appreciate if someone could help/suggest how to deal with this
> problem.
> 
> --
> Regards,
> Valery
> 
> http://protocol.by/slayer