You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Paul Angus <pa...@shapeblue.com> on 2013/07/15 09:31:25 UTC

[URGENT] KVM HA - (FW: cs 4.1 host disconnected status)

I bumped this from the user list as we've just come across the same issue.

CloudStack does not react or even change host status when contact is lost with a KVM host.

2013-07-13 17:53:56,695 DEBUG [cloud.ha.AbstractInvestigatorImpl] (AgentTaskPool-1:null) host (10.0.100.51) cannot be pinged, returning null ('I don't know')
2013-07-13 17:53:56,695 DEBUG [cloud.ha.UserVmDomRInvestigator] (AgentTaskPool-1:null) could not reach agent, could not reach agent's host, returning that we don't have enough information
2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-1:null) null unable to determine the state of the host.  Moving on.
2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-1:null) null unable to determine the state of the host.  Moving on.
2013-07-13 17:53:56,695 WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-1:null) Agent state cannot be determined, do nothing

HA for KVM is almost useless.

I suggest this a blocker for any release until fixed.


Regards,

Paul Angus
S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
paul.angus@shapeblue.com

-----Original Message-----
From: Koushik Das [mailto:koushik.das@citrix.com]
Sent: 12 July 2013 12:21
To: users@cloudstack.apache.org
Subject: RE: cs 4.1 host disconnected status

I looked at the logs and none of the existing investigators are able to determine that the host is down. I am not sure if there is a clean way to identify if a host is down in case of KVM. Consider the following cases:

1. Host is actually shutdown
2. Management nic of the host is plugged out of the network but host is up and running

There is no clean way to distinguish these cases. Cloudstack should only mark the host as down in the first case. But not sure how one would achieve this.

-Koushik

> -----Original Message-----
> From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> Sent: Friday, July 12, 2013 2:39 PM
> To: users@cloudstack.apache.org
> Subject: Re: cs 4.1 host disconnected status
>
> I've simulated crash again and here is the log:
> http://thesuki.org/temp/cs.log.txt
> I stripped out of there GET requests with api keys.
> Server was switched off at 8:36
>
> On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das <ko...@citrix.com>wrote:
>
> > Looks like the KVM investigator is not able to determine the state
> > of the agent. Can you share the full log?
> >
> > > -----Original Message-----
> > > From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> > > Sent: Thursday, July 11, 2013 7:47 PM
> > > To: users
> > > Subject: cs 4.1 host disconnected status
> > >
> > > Hi all.
> > >
> > > I use the following environment: CS 4.1, KVM, Centos 6.4
> > > (management+node1+node2), OpenIndiana NFS server as primary and
> > > secondary storage.
> > > and I have the following problem:
> > > If I switch one hypervisor node off via ipmi (simulate server
> > > crash), it
> > never
> > > goes to Disconnected status in management. Accordingly, ha-enabled
> > > VMs are not restarted on another hypervisor node, because it
> > > believes that disconnected node is still online.
> > >
> > >
> > > I get following in management server logs:
> > >
> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> > > (AgentManager-Handler-13:null) Seq 19-1133189098:             Processing:
> > >  { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
> > > [{"Answer":{"result":false,"details":     "Unable to ping computing host,
> > > exiting","wait":0}}] }
> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> > > (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: , MgmtId:
> > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> > > (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
> > > returning
> > null
> > > ('I don't know')
> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > > (AgentTaskPool-1:null) could not reach agent, could   not reach agent's
> > > host, returning that we don't have enough information
> > > 2013-07-11 10:19:16,153 DEBUG
> > > [cloud.ha.HighAvailabilityManagerImpl]
> > > (AgentTaskPool-1:null) null unable to determine  the state of the host.
> > >  Moving on.
> > > 2013-07-11 10:19:16,153 DEBUG
> > > [cloud.ha.HighAvailabilityManagerImpl]
> > > (AgentTaskPool-1:null) null unable to determine  the state of the host.
> > >  Moving on.
> > > 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-1:null) Agent state cannot be           determined, do
> > > nothing
> > >
> > >
> > > If I power on dead node, it goes to state "Connecting" and then "Up"
> > > in management interface.
> > >
> > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null)
> > > Ping timeout for host 12, do invstigation
> > > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null)
> > > Ping timeout for host 12, do invstigation
> > > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null)
> > > Ping timeout for host 12, do invstigation
> > > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
> > > Enabled, Agent event = AgentConnected, Host id = 12, name =
> > > ad112.colobridge.net]
> > > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
> > > = ad112.colobridge.net; old status = Up; event = AgentConnected;
> > > new
> > status
> > > = Connecting; old update count = 1285; new update count = 1286]
> > > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
> > > Enabled, Agent event = Ready, Host id = 12, name =
> > > ad112.colobridge.net]
> > > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
> > > = ad112.colobridge.net; old status = Connecting; event = Ready;
> > > new
> > status =
> > > Up; old update count = 1286; new update count = 1287]
> > >
> > >
> > > If I restart cloud-management service, dead node goes to state
> > > "Disconnected" in management interface.
> > > (there is nothing special in logs in this case)
> > >
> > > If I do nothing,  dead node could stay in "Up" state forever (I
> > > waited
> > for
> > > 12 hours) in management interface, throwing into logs "Agent state
> > > cannot be determined, do nothing"
> > >
> > > Would appreciate if someone could help/suggest how to deal with
> > > this problem.
> > >
> > > --
> > > Regards,
> > > Valery
> > >
> > > http://protocol.by/slayer
> >
>
>
>
> --
> Regards,
> Valery
>
> http://protocol.by/slayer
This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Shape Blue Ltd or related companies. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue is a registered trademark.

RE: [URGENT] KVM HA - (FW: cs 4.1 host disconnected status)

Posted by Paul Angus <pa...@shapeblue.com>.
Bug ID: CLOUDSTACK-3535

https://issues.apache.org/jira/browse/CLOUDSTACK-3535


Regards,

Paul Angus
S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
paul.angus@shapeblue.com

-----Original Message-----
From: Joe Brockmeier [mailto:jzb@zonker.net]
Sent: 15 July 2013 15:32
To: dev@cloudstack.apache.org
Subject: Re: [URGENT] KVM HA - (FW: cs 4.1 host disconnected status)

Hi Paul,

What's the bug ID for this so we can track it properly?

Thanks!

Joe

On Mon, Jul 15, 2013, at 02:31 AM, Paul Angus wrote:
> I bumped this from the user list as we've just come across the same
> issue.
>
> CloudStack does not react or even change host status when contact is
> lost with a KVM host.
>
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> (AgentTaskPool-1:null) host (10.0.100.51) cannot be pinged, returning
> null ('I don't know')
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (AgentTaskPool-1:null) could not reach agent, could not reach agent's
> host, returning that we don't have enough information
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-1:null) null unable to determine the state of the host.
> Moving on.
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-1:null) null unable to determine the state of the host.
> Moving on.
> 2013-07-13 17:53:56,695 WARN  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-1:null) Agent state cannot be determined, do nothing
>
> HA for KVM is almost useless.
>
> I suggest this a blocker for any release until fixed.
>
>
> Regards,
>
> Paul Angus
> S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
> paul.angus@shapeblue.com
>
> -----Original Message-----
> From: Koushik Das [mailto:koushik.das@citrix.com]
> Sent: 12 July 2013 12:21
> To: users@cloudstack.apache.org
> Subject: RE: cs 4.1 host disconnected status
>
> I looked at the logs and none of the existing investigators are able
> to determine that the host is down. I am not sure if there is a clean
> way to identify if a host is down in case of KVM. Consider the following cases:
>
> 1. Host is actually shutdown
> 2. Management nic of the host is plugged out of the network but host
> is up and running
>
> There is no clean way to distinguish these cases. Cloudstack should
> only mark the host as down in the first case. But not sure how one
> would achieve this.
>
> -Koushik
>
> > -----Original Message-----
> > From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> > Sent: Friday, July 12, 2013 2:39 PM
> > To: users@cloudstack.apache.org
> > Subject: Re: cs 4.1 host disconnected status
> >
> > I've simulated crash again and here is the log:
> > http://thesuki.org/temp/cs.log.txt
> > I stripped out of there GET requests with api keys.
> > Server was switched off at 8:36
> >
> > On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das <ko...@citrix.com>wrote:
> >
> > > Looks like the KVM investigator is not able to determine the state
> > > of the agent. Can you share the full log?
> > >
> > > > -----Original Message-----
> > > > From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> > > > Sent: Thursday, July 11, 2013 7:47 PM
> > > > To: users
> > > > Subject: cs 4.1 host disconnected status
> > > >
> > > > Hi all.
> > > >
> > > > I use the following environment: CS 4.1, KVM, Centos 6.4
> > > > (management+node1+node2), OpenIndiana NFS server as primary and
> > > > secondary storage.
> > > > and I have the following problem:
> > > > If I switch one hypervisor node off via ipmi (simulate server
> > > > crash), it
> > > never
> > > > goes to Disconnected status in management. Accordingly,
> > > > ha-enabled VMs are not restarted on another hypervisor node,
> > > > because it believes that disconnected node is still online.
> > > >
> > > >
> > > > I get following in management server logs:
> > > >
> > > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> > > > (AgentManager-Handler-13:null) Seq 19-1133189098:             Processing:
> > > >  { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
> > > > [{"Answer":{"result":false,"details":     "Unable to ping computing host,
> > > > exiting","wait":0}}] }
> > > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> > > > (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: , MgmtId:
> > > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
> > > > 2013-07-11 10:19:16,153 DEBUG
> > > > [cloud.ha.AbstractInvestigatorImpl]
> > > > (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
> > > > returning
> > > null
> > > > ('I don't know')
> > > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > > > (AgentTaskPool-1:null) could not reach agent, could   not reach agent's
> > > > host, returning that we don't have enough information
> > > > 2013-07-11 10:19:16,153 DEBUG
> > > > [cloud.ha.HighAvailabilityManagerImpl]
> > > > (AgentTaskPool-1:null) null unable to determine  the state of the host.
> > > >  Moving on.
> > > > 2013-07-11 10:19:16,153 DEBUG
> > > > [cloud.ha.HighAvailabilityManagerImpl]
> > > > (AgentTaskPool-1:null) null unable to determine  the state of the host.
> > > >  Moving on.
> > > > 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
> > > > (AgentTaskPool-1:null) Agent state cannot be           determined, do
> > > > nothing
> > > >
> > > >
> > > > If I power on dead node, it goes to state "Connecting" and then "Up"
> > > > in management interface.
> > > >
> > > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status]
> > > > (Thread-6:null) Ping timeout for host 12, do invstigation
> > > > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status]
> > > > (Thread-6:null) Ping timeout for host 12, do invstigation
> > > > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status]
> > > > (Thread-6:null) Ping timeout for host 12, do invstigation
> > > > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
> > > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
> > > > Enabled, Agent event = AgentConnected, Host id = 12, name =
> > > > ad112.colobridge.net]
> > > > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
> > > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12;
> > > > name = ad112.colobridge.net; old status = Up; event =
> > > > AgentConnected; new
> > > status
> > > > = Connecting; old update count = 1285; new update count = 1286]
> > > > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
> > > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
> > > > Enabled, Agent event = Ready, Host id = 12, name =
> > > > ad112.colobridge.net]
> > > > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
> > > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12;
> > > > name = ad112.colobridge.net; old status = Connecting; event =
> > > > Ready; new
> > > status =
> > > > Up; old update count = 1286; new update count = 1287]
> > > >
> > > >
> > > > If I restart cloud-management service, dead node goes to state
> > > > "Disconnected" in management interface.
> > > > (there is nothing special in logs in this case)
> > > >
> > > > If I do nothing,  dead node could stay in "Up" state forever (I
> > > > waited
> > > for
> > > > 12 hours) in management interface, throwing into logs "Agent
> > > > state cannot be determined, do nothing"
> > > >
> > > > Would appreciate if someone could help/suggest how to deal with
> > > > this problem.
> > > >
> > > > --
> > > > Regards,
> > > > Valery
> > > >
> > > > http://protocol.by/slayer
> > >
> >
> >
> >
> > --
> > Regards,
> > Valery
> >
> > http://protocol.by/slayer
> This email and any attachments to it may be confidential and are
> intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do
> not necessarily represent those of Shape Blue Ltd or related
> companies. If you are not the intended recipient of this email, you
> must neither take any action based upon its contents, nor copy or show
> it to anyone. Please contact the sender if you believe you have received this email in error.
> Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue
> Services India LLP is operated under license from Shape Blue Ltd.
> ShapeBlue is a registered trademark.


Best,

jzb
--
Joe Brockmeier
jzb@zonker.net
Twitter: @jzb
http://www.dissociatedpress.net/

This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Shape Blue Ltd or related companies. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue is a registered trademark.


Re: [URGENT] KVM HA - (FW: cs 4.1 host disconnected status)

Posted by Joe Brockmeier <jz...@zonker.net>.
Hi Paul,

What's the bug ID for this so we can track it properly?

Thanks!

Joe

On Mon, Jul 15, 2013, at 02:31 AM, Paul Angus wrote:
> I bumped this from the user list as we've just come across the same
> issue.
> 
> CloudStack does not react or even change host status when contact is lost
> with a KVM host.
> 
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> (AgentTaskPool-1:null) host (10.0.100.51) cannot be pinged, returning
> null ('I don't know')
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (AgentTaskPool-1:null) could not reach agent, could not reach agent's
> host, returning that we don't have enough information
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-1:null) null unable to determine the state of the host. 
> Moving on.
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-1:null) null unable to determine the state of the host. 
> Moving on.
> 2013-07-13 17:53:56,695 WARN  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-1:null) Agent state cannot be determined, do nothing
> 
> HA for KVM is almost useless.
> 
> I suggest this a blocker for any release until fixed.
> 
> 
> Regards,
> 
> Paul Angus
> S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
> paul.angus@shapeblue.com
> 
> -----Original Message-----
> From: Koushik Das [mailto:koushik.das@citrix.com]
> Sent: 12 July 2013 12:21
> To: users@cloudstack.apache.org
> Subject: RE: cs 4.1 host disconnected status
> 
> I looked at the logs and none of the existing investigators are able to
> determine that the host is down. I am not sure if there is a clean way to
> identify if a host is down in case of KVM. Consider the following cases:
> 
> 1. Host is actually shutdown
> 2. Management nic of the host is plugged out of the network but host is
> up and running
> 
> There is no clean way to distinguish these cases. Cloudstack should only
> mark the host as down in the first case. But not sure how one would
> achieve this.
> 
> -Koushik
> 
> > -----Original Message-----
> > From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> > Sent: Friday, July 12, 2013 2:39 PM
> > To: users@cloudstack.apache.org
> > Subject: Re: cs 4.1 host disconnected status
> >
> > I've simulated crash again and here is the log:
> > http://thesuki.org/temp/cs.log.txt
> > I stripped out of there GET requests with api keys.
> > Server was switched off at 8:36
> >
> > On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das <ko...@citrix.com>wrote:
> >
> > > Looks like the KVM investigator is not able to determine the state
> > > of the agent. Can you share the full log?
> > >
> > > > -----Original Message-----
> > > > From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> > > > Sent: Thursday, July 11, 2013 7:47 PM
> > > > To: users
> > > > Subject: cs 4.1 host disconnected status
> > > >
> > > > Hi all.
> > > >
> > > > I use the following environment: CS 4.1, KVM, Centos 6.4
> > > > (management+node1+node2), OpenIndiana NFS server as primary and
> > > > secondary storage.
> > > > and I have the following problem:
> > > > If I switch one hypervisor node off via ipmi (simulate server
> > > > crash), it
> > > never
> > > > goes to Disconnected status in management. Accordingly, ha-enabled
> > > > VMs are not restarted on another hypervisor node, because it
> > > > believes that disconnected node is still online.
> > > >
> > > >
> > > > I get following in management server logs:
> > > >
> > > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> > > > (AgentManager-Handler-13:null) Seq 19-1133189098:             Processing:
> > > >  { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
> > > > [{"Answer":{"result":false,"details":     "Unable to ping computing host,
> > > > exiting","wait":0}}] }
> > > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> > > > (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: , MgmtId:
> > > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
> > > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> > > > (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
> > > > returning
> > > null
> > > > ('I don't know')
> > > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > > > (AgentTaskPool-1:null) could not reach agent, could   not reach agent's
> > > > host, returning that we don't have enough information
> > > > 2013-07-11 10:19:16,153 DEBUG
> > > > [cloud.ha.HighAvailabilityManagerImpl]
> > > > (AgentTaskPool-1:null) null unable to determine  the state of the host.
> > > >  Moving on.
> > > > 2013-07-11 10:19:16,153 DEBUG
> > > > [cloud.ha.HighAvailabilityManagerImpl]
> > > > (AgentTaskPool-1:null) null unable to determine  the state of the host.
> > > >  Moving on.
> > > > 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
> > > > (AgentTaskPool-1:null) Agent state cannot be           determined, do
> > > > nothing
> > > >
> > > >
> > > > If I power on dead node, it goes to state "Connecting" and then "Up"
> > > > in management interface.
> > > >
> > > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null)
> > > > Ping timeout for host 12, do invstigation
> > > > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null)
> > > > Ping timeout for host 12, do invstigation
> > > > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null)
> > > > Ping timeout for host 12, do invstigation
> > > > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
> > > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
> > > > Enabled, Agent event = AgentConnected, Host id = 12, name =
> > > > ad112.colobridge.net]
> > > > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
> > > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
> > > > = ad112.colobridge.net; old status = Up; event = AgentConnected;
> > > > new
> > > status
> > > > = Connecting; old update count = 1285; new update count = 1286]
> > > > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
> > > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
> > > > Enabled, Agent event = Ready, Host id = 12, name =
> > > > ad112.colobridge.net]
> > > > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
> > > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
> > > > = ad112.colobridge.net; old status = Connecting; event = Ready;
> > > > new
> > > status =
> > > > Up; old update count = 1286; new update count = 1287]
> > > >
> > > >
> > > > If I restart cloud-management service, dead node goes to state
> > > > "Disconnected" in management interface.
> > > > (there is nothing special in logs in this case)
> > > >
> > > > If I do nothing,  dead node could stay in "Up" state forever (I
> > > > waited
> > > for
> > > > 12 hours) in management interface, throwing into logs "Agent state
> > > > cannot be determined, do nothing"
> > > >
> > > > Would appreciate if someone could help/suggest how to deal with
> > > > this problem.
> > > >
> > > > --
> > > > Regards,
> > > > Valery
> > > >
> > > > http://protocol.by/slayer
> > >
> >
> >
> >
> > --
> > Regards,
> > Valery
> >
> > http://protocol.by/slayer
> This email and any attachments to it may be confidential and are intended
> solely for the use of the individual to whom it is addressed. Any views
> or opinions expressed are solely those of the author and do not
> necessarily represent those of Shape Blue Ltd or related companies. If
> you are not the intended recipient of this email, you must neither take
> any action based upon its contents, nor copy or show it to anyone. Please
> contact the sender if you believe you have received this email in error.
> Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue
> Services India LLP is operated under license from Shape Blue Ltd.
> ShapeBlue is a registered trademark.


Best,

jzb
-- 
Joe Brockmeier
jzb@zonker.net
Twitter: @jzb
http://www.dissociatedpress.net/

RE: [URGENT] KVM HA - (FW: cs 4.1 host disconnected status)

Posted by Paul Angus <pa...@shapeblue.com>.
I thought you guys did the impossible - tricky should be a walk in the park....

Seriously though, surely HA is a fundamental feature of CloudStack and has to work.

The host is shown as UP when at least it should DISCONNECTED.
Also if the host that dies has a system VM on it CloudStack believes all is well and makes no attempt to restart them elsewhere causing much more widespread issues.

Pings fail so you know it's not just the agent that's crashed.

Regards,

Paul Angus
S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
paul.angus@shapeblue.com

-----Original Message-----
From: Chiradeep Vittal [mailto:Chiradeep.Vittal@citrix.com]
Sent: 15 July 2013 11:21
To: dev@cloudstack.apache.org
Subject: Re: [URGENT] KVM HA - (FW: cs 4.1 host disconnected status)

Indeed HA is very tricky as you note. In the generic case where the MS cannot communicate with the agent, nothing can be concluded and the MS does nothing.
I dug this up and posted it to the wiki
https://cwiki.apache.org/confluence/x/dwn8AQ


On 7/15/13 1:20 PM, "Marcus Sorensen" <sh...@gmail.com> wrote:

>I don't know much about HA in regards to management server/agent
>connectivity, but it seems to me like this is perilous ground.  If a
>host loses connection with the management server, it seems to me that
>the management server doesn't have the resources to determine whether
>it should start HA-enabled VMs elsewhere. You could very well end up
>with VMs running in two or three places at once, corrupting them, just
>because a host failed to check in. Maybe the agent was stopped (that
>happens all the time). The management server has no fencing capaiblity,
>hence the messages "I don't know, doing nothing", are the correct thing
>to do. That doesn't seem like it's KVM specific, however.
>
>I'm very interested in hearing the details on how this HA was intended
>to work, or how it might be working on other platforms.  One solution
>may be to leverage the secondary storage to create locks for VMs, then
>again, when VMs can run without the agent it seems prone to deadlock
>(how does another node take over when another host has the lock, but
>the host seems down, but is actually running the vm?).
>
>On Mon, Jul 15, 2013 at 1:31 AM, Paul Angus <pa...@shapeblue.com>
>wrote:
>> I bumped this from the user list as we've just come across the same
>>issue.
>>
>> CloudStack does not react or even change host status when contact is
>>lost with a KVM host.
>>
>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.AbstractInvestigatorImpl]
>>(AgentTaskPool-1:null) host (10.0.100.51) cannot be pinged, returning
>>null ('I don't know')
>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.UserVmDomRInvestigator]
>>(AgentTaskPool-1:null) could not reach agent, could not reach agent's
>>host, returning that we don't have enough information
>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>>(AgentTaskPool-1:null) null unable to determine the state of the host.
>>Moving on.
>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>>(AgentTaskPool-1:null) null unable to determine the state of the host.
>>Moving on.
>> 2013-07-13 17:53:56,695 WARN  [agent.manager.AgentManagerImpl]
>>(AgentTaskPool-1:null) Agent state cannot be determined, do nothing
>>
>> HA for KVM is almost useless.
>>
>> I suggest this a blocker for any release until fixed.
>>
>>
>> Regards,
>>
>> Paul Angus
>> S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
>> paul.angus@shapeblue.com
>>
>> -----Original Message-----
>> From: Koushik Das [mailto:koushik.das@citrix.com]
>> Sent: 12 July 2013 12:21
>> To: users@cloudstack.apache.org
>> Subject: RE: cs 4.1 host disconnected status
>>
>> I looked at the logs and none of the existing investigators are able
>>to determine that the host is down. I am not sure if there is a clean
>>way to identify if a host is down in case of KVM. Consider the
>>following
>>cases:
>>
>> 1. Host is actually shutdown
>> 2. Management nic of the host is plugged out of the network but host
>>is up and running
>>
>> There is no clean way to distinguish these cases. Cloudstack should
>>only mark the host as down in the first case. But not sure how one
>>would achieve this.
>>
>> -Koushik
>>
>>> -----Original Message-----
>>> From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
>>> Sent: Friday, July 12, 2013 2:39 PM
>>> To: users@cloudstack.apache.org
>>> Subject: Re: cs 4.1 host disconnected status
>>>
>>> I've simulated crash again and here is the log:
>>> http://thesuki.org/temp/cs.log.txt
>>> I stripped out of there GET requests with api keys.
>>> Server was switched off at 8:36
>>>
>>> On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das
>>><ko...@citrix.com>wrote:
>>>
>>> > Looks like the KVM investigator is not able to determine the state
>>> > of the agent. Can you share the full log?
>>> >
>>> > > -----Original Message-----
>>> > > From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
>>> > > Sent: Thursday, July 11, 2013 7:47 PM
>>> > > To: users
>>> > > Subject: cs 4.1 host disconnected status
>>> > >
>>> > > Hi all.
>>> > >
>>> > > I use the following environment: CS 4.1, KVM, Centos 6.4
>>> > > (management+node1+node2), OpenIndiana NFS server as primary and
>>> > > secondary storage.
>>> > > and I have the following problem:
>>> > > If I switch one hypervisor node off via ipmi (simulate server
>>> > > crash), it
>>> > never
>>> > > goes to Disconnected status in management. Accordingly,
>>> > > ha-enabled VMs are not restarted on another hypervisor node,
>>> > > because it believes that disconnected node is still online.
>>> > >
>>> > >
>>> > > I get following in management server logs:
>>> > >
>>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>>> > > (AgentManager-Handler-13:null) Seq 19-1133189098:
>>>Processing:
>>> > >  { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
>>> > > [{"Answer":{"result":false,"details":     "Unable to ping
>>>computing host,
>>> > > exiting","wait":0}}] }
>>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>>> > > (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: ,
>>>MgmtId:
>>> > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
>>> > > 2013-07-11 10:19:16,153 DEBUG
>>> > > [cloud.ha.AbstractInvestigatorImpl]
>>> > > (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
>>> > > returning
>>> > null
>>> > > ('I don't know')
>>> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
>>> > > (AgentTaskPool-1:null) could not reach agent, could   not reach
>>>agent's
>>> > > host, returning that we don't have enough information
>>> > > 2013-07-11 10:19:16,153 DEBUG
>>> > > [cloud.ha.HighAvailabilityManagerImpl]
>>> > > (AgentTaskPool-1:null) null unable to determine  the state of
>>> > > the
>>>host.
>>> > >  Moving on.
>>> > > 2013-07-11 10:19:16,153 DEBUG
>>> > > [cloud.ha.HighAvailabilityManagerImpl]
>>> > > (AgentTaskPool-1:null) null unable to determine  the state of
>>> > > the
>>>host.
>>> > >  Moving on.
>>> > > 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
>>> > > (AgentTaskPool-1:null) Agent state cannot be           determined,
>>>do
>>> > > nothing
>>> > >
>>> > >
>>> > > If I power on dead node, it goes to state "Connecting" and then
>>>"Up"
>>> > > in management interface.
>>> > >
>>> > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status]
>>> > > (Thread-6:null) Ping timeout for host 12, do invstigation
>>> > > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status]
>>> > > (Thread-6:null) Ping timeout for host 12, do invstigation
>>> > > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status]
>>> > > (Thread-6:null) Ping timeout for host 12, do invstigation
>>> > > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
>>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
>>> > > Enabled, Agent event = AgentConnected, Host id = 12, name =
>>> > > ad112.colobridge.net]
>>> > > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
>>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12;
>>> > > name = ad112.colobridge.net; old status = Up; event =
>>> > > AgentConnected; new
>>> > status
>>> > > = Connecting; old update count = 1285; new update count = 1286]
>>> > > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
>>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
>>> > > Enabled, Agent event = Ready, Host id = 12, name =
>>> > > ad112.colobridge.net]
>>> > > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
>>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12;
>>> > > name = ad112.colobridge.net; old status = Connecting; event =
>>> > > Ready; new
>>> > status =
>>> > > Up; old update count = 1286; new update count = 1287]
>>> > >
>>> > >
>>> > > If I restart cloud-management service, dead node goes to state
>>> > > "Disconnected" in management interface.
>>> > > (there is nothing special in logs in this case)
>>> > >
>>> > > If I do nothing,  dead node could stay in "Up" state forever (I
>>> > > waited
>>> > for
>>> > > 12 hours) in management interface, throwing into logs "Agent
>>> > > state cannot be determined, do nothing"
>>> > >
>>> > > Would appreciate if someone could help/suggest how to deal with
>>> > > this problem.
>>> > >
>>> > > --
>>> > > Regards,
>>> > > Valery
>>> > >
>>> > > http://protocol.by/slayer
>>> >
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Valery
>>>
>>> http://protocol.by/slayer
>> This email and any attachments to it may be confidential and are
>>intended solely for the use of the individual to whom it is addressed.
>>Any views or opinions expressed are solely those of the author and do
>>not necessarily represent those of Shape Blue Ltd or related companies.
>>If you are not the intended recipient of this email, you must neither
>>take any action based upon its contents, nor copy or show it to anyone.
>>Please contact the sender if you believe you have received this email
>>in error. Shape Blue Ltd is a company incorporated in England & Wales.
>>ShapeBlue Services India LLP is operated under license from Shape Blue
>>Ltd. ShapeBlue is a registered trademark.


This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Shape Blue Ltd or related companies. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue is a registered trademark.


Re: [URGENT] KVM HA - (FW: cs 4.1 host disconnected status)

Posted by Marcus Sorensen <sh...@gmail.com>.
By the way, I'm aware that KVM has a heartbeat function in the agent, but
that only works for NFS primary storage. Maybe the secondary storage could
have a similar function that keeps track of running guests per host...
Would still rely on the agent to not have died if the host is still up,
otherwise stale heartbeats end up with HA VMS running in multiple locations.
On Jul 15, 2013 1:50 AM, "Marcus Sorensen" <sh...@gmail.com> wrote:

> I don't know much about HA in regards to management server/agent
> connectivity, but it seems to me like this is perilous ground.  If a
> host loses connection with the management server, it seems to me that
> the management server doesn't have the resources to determine whether
> it should start HA-enabled VMs elsewhere. You could very well end up
> with VMs running in two or three places at once, corrupting them, just
> because a host failed to check in. Maybe the agent was stopped (that
> happens all the time). The management server has no fencing
> capaiblity, hence the messages "I don't know, doing nothing", are the
> correct thing to do. That doesn't seem like it's KVM specific,
> however.
>
> I'm very interested in hearing the details on how this HA was intended
> to work, or how it might be working on other platforms.  One solution
> may be to leverage the secondary storage to create locks for VMs, then
> again, when VMs can run without the agent it seems prone to deadlock
> (how does another node take over when another host has the lock, but
> the host seems down, but is actually running the vm?).
>
> On Mon, Jul 15, 2013 at 1:31 AM, Paul Angus <pa...@shapeblue.com>
> wrote:
> > I bumped this from the user list as we've just come across the same
> issue.
> >
> > CloudStack does not react or even change host status when contact is
> lost with a KVM host.
> >
> > 2013-07-13 17:53:56,695 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> (AgentTaskPool-1:null) host (10.0.100.51) cannot be pinged, returning null
> ('I don't know')
> > 2013-07-13 17:53:56,695 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (AgentTaskPool-1:null) could not reach agent, could not reach agent's host,
> returning that we don't have enough information
> > 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-1:null) null unable to determine the state of the host.
>  Moving on.
> > 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-1:null) null unable to determine the state of the host.
>  Moving on.
> > 2013-07-13 17:53:56,695 WARN  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-1:null) Agent state cannot be determined, do nothing
> >
> > HA for KVM is almost useless.
> >
> > I suggest this a blocker for any release until fixed.
> >
> >
> > Regards,
> >
> > Paul Angus
> > S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
> > paul.angus@shapeblue.com
> >
> > -----Original Message-----
> > From: Koushik Das [mailto:koushik.das@citrix.com]
> > Sent: 12 July 2013 12:21
> > To: users@cloudstack.apache.org
> > Subject: RE: cs 4.1 host disconnected status
> >
> > I looked at the logs and none of the existing investigators are able to
> determine that the host is down. I am not sure if there is a clean way to
> identify if a host is down in case of KVM. Consider the following cases:
> >
> > 1. Host is actually shutdown
> > 2. Management nic of the host is plugged out of the network but host is
> up and running
> >
> > There is no clean way to distinguish these cases. Cloudstack should only
> mark the host as down in the first case. But not sure how one would achieve
> this.
> >
> > -Koushik
> >
> >> -----Original Message-----
> >> From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> >> Sent: Friday, July 12, 2013 2:39 PM
> >> To: users@cloudstack.apache.org
> >> Subject: Re: cs 4.1 host disconnected status
> >>
> >> I've simulated crash again and here is the log:
> >> http://thesuki.org/temp/cs.log.txt
> >> I stripped out of there GET requests with api keys.
> >> Server was switched off at 8:36
> >>
> >> On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das <koushik.das@citrix.com
> >wrote:
> >>
> >> > Looks like the KVM investigator is not able to determine the state
> >> > of the agent. Can you share the full log?
> >> >
> >> > > -----Original Message-----
> >> > > From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> >> > > Sent: Thursday, July 11, 2013 7:47 PM
> >> > > To: users
> >> > > Subject: cs 4.1 host disconnected status
> >> > >
> >> > > Hi all.
> >> > >
> >> > > I use the following environment: CS 4.1, KVM, Centos 6.4
> >> > > (management+node1+node2), OpenIndiana NFS server as primary and
> >> > > secondary storage.
> >> > > and I have the following problem:
> >> > > If I switch one hypervisor node off via ipmi (simulate server
> >> > > crash), it
> >> > never
> >> > > goes to Disconnected status in management. Accordingly, ha-enabled
> >> > > VMs are not restarted on another hypervisor node, because it
> >> > > believes that disconnected node is still online.
> >> > >
> >> > >
> >> > > I get following in management server logs:
> >> > >
> >> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> >> > > (AgentManager-Handler-13:null) Seq 19-1133189098:
> Processing:
> >> > >  { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
> >> > > [{"Answer":{"result":false,"details":     "Unable to ping computing
> host,
> >> > > exiting","wait":0}}] }
> >> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> >> > > (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: ,
> MgmtId:
> >> > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
> >> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> >> > > (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
> >> > > returning
> >> > null
> >> > > ('I don't know')
> >> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
> >> > > (AgentTaskPool-1:null) could not reach agent, could   not reach
> agent's
> >> > > host, returning that we don't have enough information
> >> > > 2013-07-11 10:19:16,153 DEBUG
> >> > > [cloud.ha.HighAvailabilityManagerImpl]
> >> > > (AgentTaskPool-1:null) null unable to determine  the state of the
> host.
> >> > >  Moving on.
> >> > > 2013-07-11 10:19:16,153 DEBUG
> >> > > [cloud.ha.HighAvailabilityManagerImpl]
> >> > > (AgentTaskPool-1:null) null unable to determine  the state of the
> host.
> >> > >  Moving on.
> >> > > 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
> >> > > (AgentTaskPool-1:null) Agent state cannot be           determined,
> do
> >> > > nothing
> >> > >
> >> > >
> >> > > If I power on dead node, it goes to state "Connecting" and then "Up"
> >> > > in management interface.
> >> > >
> >> > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null)
> >> > > Ping timeout for host 12, do invstigation
> >> > > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null)
> >> > > Ping timeout for host 12, do invstigation
> >> > > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null)
> >> > > Ping timeout for host 12, do invstigation
> >> > > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
> >> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
> >> > > Enabled, Agent event = AgentConnected, Host id = 12, name =
> >> > > ad112.colobridge.net]
> >> > > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
> >> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
> >> > > = ad112.colobridge.net; old status = Up; event = AgentConnected;
> >> > > new
> >> > status
> >> > > = Connecting; old update count = 1285; new update count = 1286]
> >> > > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
> >> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
> >> > > Enabled, Agent event = Ready, Host id = 12, name =
> >> > > ad112.colobridge.net]
> >> > > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
> >> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
> >> > > = ad112.colobridge.net; old status = Connecting; event = Ready;
> >> > > new
> >> > status =
> >> > > Up; old update count = 1286; new update count = 1287]
> >> > >
> >> > >
> >> > > If I restart cloud-management service, dead node goes to state
> >> > > "Disconnected" in management interface.
> >> > > (there is nothing special in logs in this case)
> >> > >
> >> > > If I do nothing,  dead node could stay in "Up" state forever (I
> >> > > waited
> >> > for
> >> > > 12 hours) in management interface, throwing into logs "Agent state
> >> > > cannot be determined, do nothing"
> >> > >
> >> > > Would appreciate if someone could help/suggest how to deal with
> >> > > this problem.
> >> > >
> >> > > --
> >> > > Regards,
> >> > > Valery
> >> > >
> >> > > http://protocol.by/slayer
> >> >
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Valery
> >>
> >> http://protocol.by/slayer
> > This email and any attachments to it may be confidential and are
> intended solely for the use of the individual to whom it is addressed. Any
> views or opinions expressed are solely those of the author and do not
> necessarily represent those of Shape Blue Ltd or related companies. If you
> are not the intended recipient of this email, you must neither take any
> action based upon its contents, nor copy or show it to anyone. Please
> contact the sender if you believe you have received this email in error.
> Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue
> Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue
> is a registered trademark.
>

Re: [URGENT] KVM HA - (FW: cs 4.1 host disconnected status)

Posted by Marcus Sorensen <sh...@gmail.com>.
My strong preference would be to avoid any cluster locking libraries
or similar on the agent side, if possible. I've just seen too many
clustering products that are brittle and easily deadlock-able, where
you end up having to reboot *everything* if something goes wrong on
one host.

It should be fairly straightforward to build in IPMI fencing, with
some sort of configuration timeout, into the management server.
Haven't heard from that host in 60 seconds? Power it off, move on.  We
could also track which storage types are 'HA compatible', and VMs on
those storage types would roll over to other hosts automatically
rather than getting stuck in limbo. For example, our storage plugin,
upon VM startup, communicates with the SAN and revokes access to
everyone but the server currently starting the VM. I gather the
SolidFire plugin does the same. I believe access control is  built
into the new pluggable storage architecture and implementable by
anyone.

On Mon, Jul 15, 2013 at 1:50 AM, Marcus Sorensen <sh...@gmail.com> wrote:
> I don't know much about HA in regards to management server/agent
> connectivity, but it seems to me like this is perilous ground.  If a
> host loses connection with the management server, it seems to me that
> the management server doesn't have the resources to determine whether
> it should start HA-enabled VMs elsewhere. You could very well end up
> with VMs running in two or three places at once, corrupting them, just
> because a host failed to check in. Maybe the agent was stopped (that
> happens all the time). The management server has no fencing
> capaiblity, hence the messages "I don't know, doing nothing", are the
> correct thing to do. That doesn't seem like it's KVM specific,
> however.
>
> I'm very interested in hearing the details on how this HA was intended
> to work, or how it might be working on other platforms.  One solution
> may be to leverage the secondary storage to create locks for VMs, then
> again, when VMs can run without the agent it seems prone to deadlock
> (how does another node take over when another host has the lock, but
> the host seems down, but is actually running the vm?).
>
> On Mon, Jul 15, 2013 at 1:31 AM, Paul Angus <pa...@shapeblue.com> wrote:
>> I bumped this from the user list as we've just come across the same issue.
>>
>> CloudStack does not react or even change host status when contact is lost with a KVM host.
>>
>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.AbstractInvestigatorImpl] (AgentTaskPool-1:null) host (10.0.100.51) cannot be pinged, returning null ('I don't know')
>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.UserVmDomRInvestigator] (AgentTaskPool-1:null) could not reach agent, could not reach agent's host, returning that we don't have enough information
>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-1:null) null unable to determine the state of the host.  Moving on.
>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-1:null) null unable to determine the state of the host.  Moving on.
>> 2013-07-13 17:53:56,695 WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-1:null) Agent state cannot be determined, do nothing
>>
>> HA for KVM is almost useless.
>>
>> I suggest this a blocker for any release until fixed.
>>
>>
>> Regards,
>>
>> Paul Angus
>> S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
>> paul.angus@shapeblue.com
>>
>> -----Original Message-----
>> From: Koushik Das [mailto:koushik.das@citrix.com]
>> Sent: 12 July 2013 12:21
>> To: users@cloudstack.apache.org
>> Subject: RE: cs 4.1 host disconnected status
>>
>> I looked at the logs and none of the existing investigators are able to determine that the host is down. I am not sure if there is a clean way to identify if a host is down in case of KVM. Consider the following cases:
>>
>> 1. Host is actually shutdown
>> 2. Management nic of the host is plugged out of the network but host is up and running
>>
>> There is no clean way to distinguish these cases. Cloudstack should only mark the host as down in the first case. But not sure how one would achieve this.
>>
>> -Koushik
>>
>>> -----Original Message-----
>>> From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
>>> Sent: Friday, July 12, 2013 2:39 PM
>>> To: users@cloudstack.apache.org
>>> Subject: Re: cs 4.1 host disconnected status
>>>
>>> I've simulated crash again and here is the log:
>>> http://thesuki.org/temp/cs.log.txt
>>> I stripped out of there GET requests with api keys.
>>> Server was switched off at 8:36
>>>
>>> On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das <ko...@citrix.com>wrote:
>>>
>>> > Looks like the KVM investigator is not able to determine the state
>>> > of the agent. Can you share the full log?
>>> >
>>> > > -----Original Message-----
>>> > > From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
>>> > > Sent: Thursday, July 11, 2013 7:47 PM
>>> > > To: users
>>> > > Subject: cs 4.1 host disconnected status
>>> > >
>>> > > Hi all.
>>> > >
>>> > > I use the following environment: CS 4.1, KVM, Centos 6.4
>>> > > (management+node1+node2), OpenIndiana NFS server as primary and
>>> > > secondary storage.
>>> > > and I have the following problem:
>>> > > If I switch one hypervisor node off via ipmi (simulate server
>>> > > crash), it
>>> > never
>>> > > goes to Disconnected status in management. Accordingly, ha-enabled
>>> > > VMs are not restarted on another hypervisor node, because it
>>> > > believes that disconnected node is still online.
>>> > >
>>> > >
>>> > > I get following in management server logs:
>>> > >
>>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>>> > > (AgentManager-Handler-13:null) Seq 19-1133189098:             Processing:
>>> > >  { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
>>> > > [{"Answer":{"result":false,"details":     "Unable to ping computing host,
>>> > > exiting","wait":0}}] }
>>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>>> > > (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: , MgmtId:
>>> > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
>>> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
>>> > > (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
>>> > > returning
>>> > null
>>> > > ('I don't know')
>>> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
>>> > > (AgentTaskPool-1:null) could not reach agent, could   not reach agent's
>>> > > host, returning that we don't have enough information
>>> > > 2013-07-11 10:19:16,153 DEBUG
>>> > > [cloud.ha.HighAvailabilityManagerImpl]
>>> > > (AgentTaskPool-1:null) null unable to determine  the state of the host.
>>> > >  Moving on.
>>> > > 2013-07-11 10:19:16,153 DEBUG
>>> > > [cloud.ha.HighAvailabilityManagerImpl]
>>> > > (AgentTaskPool-1:null) null unable to determine  the state of the host.
>>> > >  Moving on.
>>> > > 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
>>> > > (AgentTaskPool-1:null) Agent state cannot be           determined, do
>>> > > nothing
>>> > >
>>> > >
>>> > > If I power on dead node, it goes to state "Connecting" and then "Up"
>>> > > in management interface.
>>> > >
>>> > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null)
>>> > > Ping timeout for host 12, do invstigation
>>> > > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null)
>>> > > Ping timeout for host 12, do invstigation
>>> > > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null)
>>> > > Ping timeout for host 12, do invstigation
>>> > > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
>>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
>>> > > Enabled, Agent event = AgentConnected, Host id = 12, name =
>>> > > ad112.colobridge.net]
>>> > > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
>>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
>>> > > = ad112.colobridge.net; old status = Up; event = AgentConnected;
>>> > > new
>>> > status
>>> > > = Connecting; old update count = 1285; new update count = 1286]
>>> > > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
>>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
>>> > > Enabled, Agent event = Ready, Host id = 12, name =
>>> > > ad112.colobridge.net]
>>> > > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
>>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
>>> > > = ad112.colobridge.net; old status = Connecting; event = Ready;
>>> > > new
>>> > status =
>>> > > Up; old update count = 1286; new update count = 1287]
>>> > >
>>> > >
>>> > > If I restart cloud-management service, dead node goes to state
>>> > > "Disconnected" in management interface.
>>> > > (there is nothing special in logs in this case)
>>> > >
>>> > > If I do nothing,  dead node could stay in "Up" state forever (I
>>> > > waited
>>> > for
>>> > > 12 hours) in management interface, throwing into logs "Agent state
>>> > > cannot be determined, do nothing"
>>> > >
>>> > > Would appreciate if someone could help/suggest how to deal with
>>> > > this problem.
>>> > >
>>> > > --
>>> > > Regards,
>>> > > Valery
>>> > >
>>> > > http://protocol.by/slayer
>>> >
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Valery
>>>
>>> http://protocol.by/slayer
>> This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Shape Blue Ltd or related companies. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue is a registered trademark.

Re: [URGENT] KVM HA - (FW: cs 4.1 host disconnected status)

Posted by Marcus Sorensen <sh...@gmail.com>.
For open stack, look to the current state of "evacuate".

http://www.mirantis.com/blog/cloud-prizefight-vmware-vs-openstack/

"there is no official support for VM-level HA in OpenStack—it was initially
planned for the Folsom release but was later dropped/postponed. There is
currently an incubation project called Evacuate that is adding support for
VM-level HA to OpenStack."
On Jul 15, 2013 7:25 AM, "Shanker Balan" <sh...@shapeblue.com>
wrote:

>  On 15-Jul-2013, at 12:03 PM, Chiradeep Vittal <
> Chiradeep.Vittal@citrix.com> wrote:
>
> A robust solution would probably involve Apache Zookeeper (using Curator
> perhaps) to perform robust distributed locking and/or leader election.
>
>
>
>  Just curious - Any idea as to how OpenStack deals with a failed KVM host
> in a cluster?
>
>
>
> On 7/15/13 3:51 PM, "Chiradeep Vittal" <Ch...@citrix.com>
> wrote:
>
> Indeed HA is very tricky as you note. In the generic case where the MS
> cannot communicate with the agent, nothing can be concluded and the MS
> does nothing.
> I dug this up and posted it to the wiki
> https://cwiki.apache.org/confluence/x/dwn8AQ
>
>
> On 7/15/13 1:20 PM, "Marcus Sorensen" <sh...@gmail.com> wrote:
>
> I don't know much about HA in regards to management server/agent
> connectivity, but it seems to me like this is perilous ground.  If a
> host loses connection with the management server, it seems to me that
> the management server doesn't have the resources to determine whether
> it should start HA-enabled VMs elsewhere. You could very well end up
> with VMs running in two or three places at once, corrupting them, just
> because a host failed to check in. Maybe the agent was stopped (that
> happens all the time). The management server has no fencing
> capaiblity, hence the messages "I don't know, doing nothing", are the
> correct thing to do. That doesn't seem like it's KVM specific,
> however.
>
> I'm very interested in hearing the details on how this HA was intended
> to work, or how it might be working on other platforms.  One solution
> may be to leverage the secondary storage to create locks for VMs, then
> again, when VMs can run without the agent it seems prone to deadlock
> (how does another node take over when another host has the lock, but
> the host seems down, but is actually running the vm?).
>
> On Mon, Jul 15, 2013 at 1:31 AM, Paul Angus <pa...@shapeblue.com>
> wrote:
>
> I bumped this from the user list as we've just come across the same
> issue.
>
> CloudStack does not react or even change host status when contact is
> lost with a KVM host.
>
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> (AgentTaskPool-1:null) host (10.0.100.51) cannot be pinged, returning
> null ('I don't know')
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (AgentTaskPool-1:null) could not reach agent, could not reach agent's
> host, returning that we don't have enough information
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-1:null) null unable to determine the state of the host.
> Moving on.
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-1:null) null unable to determine the state of the host.
> Moving on.
> 2013-07-13 17:53:56,695 WARN  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-1:null) Agent state cannot be determined, do nothing
>
> HA for KVM is almost useless.
>
> I suggest this a blocker for any release until fixed.
>
>
> Regards,
>
> Paul Angus
> S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
> paul.angus@shapeblue.com
>
> -----Original Message-----
> From: Koushik Das [mailto:koushik.das@citrix.com]
> Sent: 12 July 2013 12:21
> To: users@cloudstack.apache.org
> Subject: RE: cs 4.1 host disconnected status
>
> I looked at the logs and none of the existing investigators are able to
> determine that the host is down. I am not sure if there is a clean way
> to identify if a host is down in case of KVM. Consider the following
> cases:
>
> 1. Host is actually shutdown
> 2. Management nic of the host is plugged out of the network but host is
> up and running
>
> There is no clean way to distinguish these cases. Cloudstack should
> only mark the host as down in the first case. But not sure how one would
> achieve this.
>
> -Koushik
>
> -----Original Message-----
> From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> Sent: Friday, July 12, 2013 2:39 PM
> To: users@cloudstack.apache.org
> Subject: Re: cs 4.1 host disconnected status
>
> I've simulated crash again and here is the log:
> http://thesuki.org/temp/cs.log.txt
> I stripped out of there GET requests with api keys.
> Server was switched off at 8:36
>
> On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das
> <ko...@citrix.com>wrote:
>
> Looks like the KVM investigator is not able to determine the state
> of the agent. Can you share the full log?
>
> -----Original Message-----
> From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
> Sent: Thursday, July 11, 2013 7:47 PM
> To: users
> Subject: cs 4.1 host disconnected status
>
> Hi all.
>
> I use the following environment: CS 4.1, KVM, Centos 6.4
> (management+node1+node2), OpenIndiana NFS server as primary and
> secondary storage.
> and I have the following problem:
> If I switch one hypervisor node off via ipmi (simulate server
> crash), it
>
> never
>
> goes to Disconnected status in management. Accordingly, ha-enabled
> VMs are not restarted on another hypervisor node, because it
> believes that disconnected node is still online.
>
>
> I get following in management server logs:
>
> 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> (AgentManager-Handler-13:null) Seq 19-1133189098:
>
>  Processing:
>
> { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
> [{"Answer":{"result":false,"details":     "Unable to ping
>
>  computing host,
>
> exiting","wait":0}}] }
> 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: ,
>
>  MgmtId:
>
> 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
> returning
>
> null
>
> ('I don't know')
> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (AgentTaskPool-1:null) could not reach agent, could   not reach
>
>  agent's
>
> host, returning that we don't have enough information
> 2013-07-11 10:19:16,153 DEBUG
> [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-1:null) null unable to determine  the state of the
>
>  host.
>
> Moving on.
> 2013-07-11 10:19:16,153 DEBUG
> [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-1:null) null unable to determine  the state of the
>
>  host.
>
> Moving on.
> 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-1:null) Agent state cannot be           determined,
>
>  do
>
> nothing
>
>
> If I power on dead node, it goes to state "Connecting" and then
>
>  "Up"
>
> in management interface.
>
> 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null)
> Ping timeout for host 12, do invstigation
> 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null)
> Ping timeout for host 12, do invstigation
> 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null)
> Ping timeout for host 12, do invstigation
> 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
> (AgentConnectTaskPool-5:null) Transition:[Resource state =
> Enabled, Agent event = AgentConnected, Host id = 12, name =
> ad112.colobridge.net]
> 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
> (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
> = ad112.colobridge.net; old status = Up; event = AgentConnected;
> new
>
> status
>
> = Connecting; old update count = 1285; new update count = 1286]
> 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
> (AgentConnectTaskPool-5:null) Transition:[Resource state =
> Enabled, Agent event = Ready, Host id = 12, name =
> ad112.colobridge.net]
> 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
> (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
> = ad112.colobridge.net; old status = Connecting; event = Ready;
> new
>
> status =
>
> Up; old update count = 1286; new update count = 1287]
>
>
> If I restart cloud-management service, dead node goes to state
> "Disconnected" in management interface.
> (there is nothing special in logs in this case)
>
> If I do nothing,  dead node could stay in "Up" state forever (I
> waited
>
> for
>
> 12 hours) in management interface, throwing into logs "Agent state
> cannot be determined, do nothing"
>
> Would appreciate if someone could help/suggest how to deal with
> this problem.
>
> --
> Regards,
> Valery
>
> http://protocol.by/slayer
>
>
>
>
>
> --
> Regards,
> Valery
>
> http://protocol.by/slayer
>
> This email and any attachments to it may be confidential and are
> intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do
> not necessarily represent those of Shape Blue Ltd or related companies.
> If you are not the intended recipient of this email, you must neither
> take any action based upon its contents, nor copy or show it to anyone.
> Please contact the sender if you believe you have received this email in
> error. Shape Blue Ltd is a company incorporated in England & Wales.
> ShapeBlue Services India LLP is operated under license from Shape Blue
> Ltd. ShapeBlue is a registered trademark.
>
>
>
>
>
> --
> Shanker Balan
> Managing Consultant
>
>
>
>  M: +91 98860 60539
>  shanker.balan@shapeblue.com | www.shapeblue.com | Twitter:@shapeblue
>  ShapeBlue India, 22nd floor, Unit 2201A, World Trade Centre, Bangalore -
> 560 055
>
> This email and any attachments to it may be confidential and are intended
> solely for the use of the individual to whom it is addressed. Any views or
> opinions expressed are solely those of the author and do not necessarily
> represent those of Shape Blue Ltd or related companies. If you are not the
> intended recipient of this email, you must neither take any action based
> upon its contents, nor copy or show it to anyone. Please contact the sender
> if you believe you have received this email in error. Shape Blue Ltd is a
> company incorporated in England & Wales. ShapeBlue Services India LLP is
> operated under license from Shape Blue Ltd. ShapeBlue is a registered
> trademark.
>

Re: [URGENT] KVM HA - (FW: cs 4.1 host disconnected status)

Posted by Shanker Balan <sh...@shapeblue.com>.
On 15-Jul-2013, at 12:03 PM, Chiradeep Vittal <Ch...@citrix.com>> wrote:

A robust solution would probably involve Apache Zookeeper (using Curator
perhaps) to perform robust distributed locking and/or leader election.


Just curious - Any idea as to how OpenStack deals with a failed KVM host in a cluster?



On 7/15/13 3:51 PM, "Chiradeep Vittal" <Ch...@citrix.com>> wrote:

Indeed HA is very tricky as you note. In the generic case where the MS
cannot communicate with the agent, nothing can be concluded and the MS
does nothing.
I dug this up and posted it to the wiki
https://cwiki.apache.org/confluence/x/dwn8AQ


On 7/15/13 1:20 PM, "Marcus Sorensen" <sh...@gmail.com> wrote:

I don't know much about HA in regards to management server/agent
connectivity, but it seems to me like this is perilous ground.  If a
host loses connection with the management server, it seems to me that
the management server doesn't have the resources to determine whether
it should start HA-enabled VMs elsewhere. You could very well end up
with VMs running in two or three places at once, corrupting them, just
because a host failed to check in. Maybe the agent was stopped (that
happens all the time). The management server has no fencing
capaiblity, hence the messages "I don't know, doing nothing", are the
correct thing to do. That doesn't seem like it's KVM specific,
however.

I'm very interested in hearing the details on how this HA was intended
to work, or how it might be working on other platforms.  One solution
may be to leverage the secondary storage to create locks for VMs, then
again, when VMs can run without the agent it seems prone to deadlock
(how does another node take over when another host has the lock, but
the host seems down, but is actually running the vm?).

On Mon, Jul 15, 2013 at 1:31 AM, Paul Angus <pa...@shapeblue.com>
wrote:
I bumped this from the user list as we've just come across the same
issue.

CloudStack does not react or even change host status when contact is
lost with a KVM host.

2013-07-13 17:53:56,695 DEBUG [cloud.ha.AbstractInvestigatorImpl]
(AgentTaskPool-1:null) host (10.0.100.51) cannot be pinged, returning
null ('I don't know')
2013-07-13 17:53:56,695 DEBUG [cloud.ha.UserVmDomRInvestigator]
(AgentTaskPool-1:null) could not reach agent, could not reach agent's
host, returning that we don't have enough information
2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
(AgentTaskPool-1:null) null unable to determine the state of the host.
Moving on.
2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
(AgentTaskPool-1:null) null unable to determine the state of the host.
Moving on.
2013-07-13 17:53:56,695 WARN  [agent.manager.AgentManagerImpl]
(AgentTaskPool-1:null) Agent state cannot be determined, do nothing

HA for KVM is almost useless.

I suggest this a blocker for any release until fixed.


Regards,

Paul Angus
S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
paul.angus@shapeblue.com

-----Original Message-----
From: Koushik Das [mailto:koushik.das@citrix.com]
Sent: 12 July 2013 12:21
To: users@cloudstack.apache.org
Subject: RE: cs 4.1 host disconnected status

I looked at the logs and none of the existing investigators are able to
determine that the host is down. I am not sure if there is a clean way
to identify if a host is down in case of KVM. Consider the following
cases:

1. Host is actually shutdown
2. Management nic of the host is plugged out of the network but host is
up and running

There is no clean way to distinguish these cases. Cloudstack should
only mark the host as down in the first case. But not sure how one would
achieve this.

-Koushik

-----Original Message-----
From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
Sent: Friday, July 12, 2013 2:39 PM
To: users@cloudstack.apache.org
Subject: Re: cs 4.1 host disconnected status

I've simulated crash again and here is the log:
http://thesuki.org/temp/cs.log.txt
I stripped out of there GET requests with api keys.
Server was switched off at 8:36

On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das
<ko...@citrix.com>wrote:

Looks like the KVM investigator is not able to determine the state
of the agent. Can you share the full log?

-----Original Message-----
From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
Sent: Thursday, July 11, 2013 7:47 PM
To: users
Subject: cs 4.1 host disconnected status

Hi all.

I use the following environment: CS 4.1, KVM, Centos 6.4
(management+node1+node2), OpenIndiana NFS server as primary and
secondary storage.
and I have the following problem:
If I switch one hypervisor node off via ipmi (simulate server
crash), it
never
goes to Disconnected status in management. Accordingly, ha-enabled
VMs are not restarted on another hypervisor node, because it
believes that disconnected node is still online.


I get following in management server logs:

2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
(AgentManager-Handler-13:null) Seq 19-1133189098:
Processing:
{ Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
[{"Answer":{"result":false,"details":     "Unable to ping
computing host,
exiting","wait":0}}] }
2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
(AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: ,
MgmtId:
161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
(AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
returning
null
('I don't know')
2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
(AgentTaskPool-1:null) could not reach agent, could   not reach
agent's
host, returning that we don't have enough information
2013-07-11 10:19:16,153 DEBUG
[cloud.ha.HighAvailabilityManagerImpl]
(AgentTaskPool-1:null) null unable to determine  the state of the
host.
Moving on.
2013-07-11 10:19:16,153 DEBUG
[cloud.ha.HighAvailabilityManagerImpl]
(AgentTaskPool-1:null) null unable to determine  the state of the
host.
Moving on.
2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
(AgentTaskPool-1:null) Agent state cannot be           determined,
do
nothing


If I power on dead node, it goes to state "Connecting" and then
"Up"
in management interface.

2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null)
Ping timeout for host 12, do invstigation
2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null)
Ping timeout for host 12, do invstigation
2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null)
Ping timeout for host 12, do invstigation
2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
(AgentConnectTaskPool-5:null) Transition:[Resource state =
Enabled, Agent event = AgentConnected, Host id = 12, name =
ad112.colobridge.net]
2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
(AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
= ad112.colobridge.net; old status = Up; event = AgentConnected;
new
status
= Connecting; old update count = 1285; new update count = 1286]
2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
(AgentConnectTaskPool-5:null) Transition:[Resource state =
Enabled, Agent event = Ready, Host id = 12, name =
ad112.colobridge.net]
2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
(AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
= ad112.colobridge.net; old status = Connecting; event = Ready;
new
status =
Up; old update count = 1286; new update count = 1287]


If I restart cloud-management service, dead node goes to state
"Disconnected" in management interface.
(there is nothing special in logs in this case)

If I do nothing,  dead node could stay in "Up" state forever (I
waited
for
12 hours) in management interface, throwing into logs "Agent state
cannot be determined, do nothing"

Would appreciate if someone could help/suggest how to deal with
this problem.

--
Regards,
Valery

http://protocol.by/slayer




--
Regards,
Valery

http://protocol.by/slayer
This email and any attachments to it may be confidential and are
intended solely for the use of the individual to whom it is addressed.
Any views or opinions expressed are solely those of the author and do
not necessarily represent those of Shape Blue Ltd or related companies.
If you are not the intended recipient of this email, you must neither
take any action based upon its contents, nor copy or show it to anyone.
Please contact the sender if you believe you have received this email in
error. Shape Blue Ltd is a company incorporated in England & Wales.
ShapeBlue Services India LLP is operated under license from Shape Blue
Ltd. ShapeBlue is a registered trademark.




--
Shanker Balan
Managing Consultant

[cid:E7CE8425-E245-4C99-B967-713DF2967392@local]

M: +91 98860 60539
shanker.balan@shapeblue.com<ma...@shapeblue.com> | www.shapeblue.com<http://www.shapeblue.com> | Twitter:@shapeblue
ShapeBlue India, 22nd floor, Unit 2201A, World Trade Centre, Bangalore - 560 055

This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Shape Blue Ltd or related companies. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue is a registered trademark.

Re: [URGENT] KVM HA - (FW: cs 4.1 host disconnected status)

Posted by Chiradeep Vittal <Ch...@citrix.com>.
A robust solution would probably involve Apache Zookeeper (using Curator
perhaps) to perform robust distributed locking and/or leader election.

On 7/15/13 3:51 PM, "Chiradeep Vittal" <Ch...@citrix.com> wrote:

>Indeed HA is very tricky as you note. In the generic case where the MS
>cannot communicate with the agent, nothing can be concluded and the MS
>does nothing.
>I dug this up and posted it to the wiki
>https://cwiki.apache.org/confluence/x/dwn8AQ
>
>
>On 7/15/13 1:20 PM, "Marcus Sorensen" <sh...@gmail.com> wrote:
>
>>I don't know much about HA in regards to management server/agent
>>connectivity, but it seems to me like this is perilous ground.  If a
>>host loses connection with the management server, it seems to me that
>>the management server doesn't have the resources to determine whether
>>it should start HA-enabled VMs elsewhere. You could very well end up
>>with VMs running in two or three places at once, corrupting them, just
>>because a host failed to check in. Maybe the agent was stopped (that
>>happens all the time). The management server has no fencing
>>capaiblity, hence the messages "I don't know, doing nothing", are the
>>correct thing to do. That doesn't seem like it's KVM specific,
>>however.
>>
>>I'm very interested in hearing the details on how this HA was intended
>>to work, or how it might be working on other platforms.  One solution
>>may be to leverage the secondary storage to create locks for VMs, then
>>again, when VMs can run without the agent it seems prone to deadlock
>>(how does another node take over when another host has the lock, but
>>the host seems down, but is actually running the vm?).
>>
>>On Mon, Jul 15, 2013 at 1:31 AM, Paul Angus <pa...@shapeblue.com>
>>wrote:
>>> I bumped this from the user list as we've just come across the same
>>>issue.
>>>
>>> CloudStack does not react or even change host status when contact is
>>>lost with a KVM host.
>>>
>>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.AbstractInvestigatorImpl]
>>>(AgentTaskPool-1:null) host (10.0.100.51) cannot be pinged, returning
>>>null ('I don't know')
>>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.UserVmDomRInvestigator]
>>>(AgentTaskPool-1:null) could not reach agent, could not reach agent's
>>>host, returning that we don't have enough information
>>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>>>(AgentTaskPool-1:null) null unable to determine the state of the host.
>>>Moving on.
>>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>>>(AgentTaskPool-1:null) null unable to determine the state of the host.
>>>Moving on.
>>> 2013-07-13 17:53:56,695 WARN  [agent.manager.AgentManagerImpl]
>>>(AgentTaskPool-1:null) Agent state cannot be determined, do nothing
>>>
>>> HA for KVM is almost useless.
>>>
>>> I suggest this a blocker for any release until fixed.
>>>
>>>
>>> Regards,
>>>
>>> Paul Angus
>>> S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
>>> paul.angus@shapeblue.com
>>>
>>> -----Original Message-----
>>> From: Koushik Das [mailto:koushik.das@citrix.com]
>>> Sent: 12 July 2013 12:21
>>> To: users@cloudstack.apache.org
>>> Subject: RE: cs 4.1 host disconnected status
>>>
>>> I looked at the logs and none of the existing investigators are able to
>>>determine that the host is down. I am not sure if there is a clean way
>>>to identify if a host is down in case of KVM. Consider the following
>>>cases:
>>>
>>> 1. Host is actually shutdown
>>> 2. Management nic of the host is plugged out of the network but host is
>>>up and running
>>>
>>> There is no clean way to distinguish these cases. Cloudstack should
>>>only mark the host as down in the first case. But not sure how one would
>>>achieve this.
>>>
>>> -Koushik
>>>
>>>> -----Original Message-----
>>>> From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
>>>> Sent: Friday, July 12, 2013 2:39 PM
>>>> To: users@cloudstack.apache.org
>>>> Subject: Re: cs 4.1 host disconnected status
>>>>
>>>> I've simulated crash again and here is the log:
>>>> http://thesuki.org/temp/cs.log.txt
>>>> I stripped out of there GET requests with api keys.
>>>> Server was switched off at 8:36
>>>>
>>>> On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das
>>>><ko...@citrix.com>wrote:
>>>>
>>>> > Looks like the KVM investigator is not able to determine the state
>>>> > of the agent. Can you share the full log?
>>>> >
>>>> > > -----Original Message-----
>>>> > > From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
>>>> > > Sent: Thursday, July 11, 2013 7:47 PM
>>>> > > To: users
>>>> > > Subject: cs 4.1 host disconnected status
>>>> > >
>>>> > > Hi all.
>>>> > >
>>>> > > I use the following environment: CS 4.1, KVM, Centos 6.4
>>>> > > (management+node1+node2), OpenIndiana NFS server as primary and
>>>> > > secondary storage.
>>>> > > and I have the following problem:
>>>> > > If I switch one hypervisor node off via ipmi (simulate server
>>>> > > crash), it
>>>> > never
>>>> > > goes to Disconnected status in management. Accordingly, ha-enabled
>>>> > > VMs are not restarted on another hypervisor node, because it
>>>> > > believes that disconnected node is still online.
>>>> > >
>>>> > >
>>>> > > I get following in management server logs:
>>>> > >
>>>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>>>> > > (AgentManager-Handler-13:null) Seq 19-1133189098:
>>>>Processing:
>>>> > >  { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
>>>> > > [{"Answer":{"result":false,"details":     "Unable to ping
>>>>computing host,
>>>> > > exiting","wait":0}}] }
>>>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>>>> > > (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: ,
>>>>MgmtId:
>>>> > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
>>>> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
>>>> > > (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
>>>> > > returning
>>>> > null
>>>> > > ('I don't know')
>>>> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
>>>> > > (AgentTaskPool-1:null) could not reach agent, could   not reach
>>>>agent's
>>>> > > host, returning that we don't have enough information
>>>> > > 2013-07-11 10:19:16,153 DEBUG
>>>> > > [cloud.ha.HighAvailabilityManagerImpl]
>>>> > > (AgentTaskPool-1:null) null unable to determine  the state of the
>>>>host.
>>>> > >  Moving on.
>>>> > > 2013-07-11 10:19:16,153 DEBUG
>>>> > > [cloud.ha.HighAvailabilityManagerImpl]
>>>> > > (AgentTaskPool-1:null) null unable to determine  the state of the
>>>>host.
>>>> > >  Moving on.
>>>> > > 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
>>>> > > (AgentTaskPool-1:null) Agent state cannot be           determined,
>>>>do
>>>> > > nothing
>>>> > >
>>>> > >
>>>> > > If I power on dead node, it goes to state "Connecting" and then
>>>>"Up"
>>>> > > in management interface.
>>>> > >
>>>> > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null)
>>>> > > Ping timeout for host 12, do invstigation
>>>> > > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null)
>>>> > > Ping timeout for host 12, do invstigation
>>>> > > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null)
>>>> > > Ping timeout for host 12, do invstigation
>>>> > > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
>>>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
>>>> > > Enabled, Agent event = AgentConnected, Host id = 12, name =
>>>> > > ad112.colobridge.net]
>>>> > > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
>>>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
>>>> > > = ad112.colobridge.net; old status = Up; event = AgentConnected;
>>>> > > new
>>>> > status
>>>> > > = Connecting; old update count = 1285; new update count = 1286]
>>>> > > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
>>>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
>>>> > > Enabled, Agent event = Ready, Host id = 12, name =
>>>> > > ad112.colobridge.net]
>>>> > > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
>>>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
>>>> > > = ad112.colobridge.net; old status = Connecting; event = Ready;
>>>> > > new
>>>> > status =
>>>> > > Up; old update count = 1286; new update count = 1287]
>>>> > >
>>>> > >
>>>> > > If I restart cloud-management service, dead node goes to state
>>>> > > "Disconnected" in management interface.
>>>> > > (there is nothing special in logs in this case)
>>>> > >
>>>> > > If I do nothing,  dead node could stay in "Up" state forever (I
>>>> > > waited
>>>> > for
>>>> > > 12 hours) in management interface, throwing into logs "Agent state
>>>> > > cannot be determined, do nothing"
>>>> > >
>>>> > > Would appreciate if someone could help/suggest how to deal with
>>>> > > this problem.
>>>> > >
>>>> > > --
>>>> > > Regards,
>>>> > > Valery
>>>> > >
>>>> > > http://protocol.by/slayer
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Valery
>>>>
>>>> http://protocol.by/slayer
>>> This email and any attachments to it may be confidential and are
>>>intended solely for the use of the individual to whom it is addressed.
>>>Any views or opinions expressed are solely those of the author and do
>>>not necessarily represent those of Shape Blue Ltd or related companies.
>>>If you are not the intended recipient of this email, you must neither
>>>take any action based upon its contents, nor copy or show it to anyone.
>>>Please contact the sender if you believe you have received this email in
>>>error. Shape Blue Ltd is a company incorporated in England & Wales.
>>>ShapeBlue Services India LLP is operated under license from Shape Blue
>>>Ltd. ShapeBlue is a registered trademark.
>


Re: [URGENT] KVM HA - (FW: cs 4.1 host disconnected status)

Posted by Chiradeep Vittal <Ch...@citrix.com>.
Indeed HA is very tricky as you note. In the generic case where the MS
cannot communicate with the agent, nothing can be concluded and the MS
does nothing.
I dug this up and posted it to the wiki
https://cwiki.apache.org/confluence/x/dwn8AQ


On 7/15/13 1:20 PM, "Marcus Sorensen" <sh...@gmail.com> wrote:

>I don't know much about HA in regards to management server/agent
>connectivity, but it seems to me like this is perilous ground.  If a
>host loses connection with the management server, it seems to me that
>the management server doesn't have the resources to determine whether
>it should start HA-enabled VMs elsewhere. You could very well end up
>with VMs running in two or three places at once, corrupting them, just
>because a host failed to check in. Maybe the agent was stopped (that
>happens all the time). The management server has no fencing
>capaiblity, hence the messages "I don't know, doing nothing", are the
>correct thing to do. That doesn't seem like it's KVM specific,
>however.
>
>I'm very interested in hearing the details on how this HA was intended
>to work, or how it might be working on other platforms.  One solution
>may be to leverage the secondary storage to create locks for VMs, then
>again, when VMs can run without the agent it seems prone to deadlock
>(how does another node take over when another host has the lock, but
>the host seems down, but is actually running the vm?).
>
>On Mon, Jul 15, 2013 at 1:31 AM, Paul Angus <pa...@shapeblue.com>
>wrote:
>> I bumped this from the user list as we've just come across the same
>>issue.
>>
>> CloudStack does not react or even change host status when contact is
>>lost with a KVM host.
>>
>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.AbstractInvestigatorImpl]
>>(AgentTaskPool-1:null) host (10.0.100.51) cannot be pinged, returning
>>null ('I don't know')
>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.UserVmDomRInvestigator]
>>(AgentTaskPool-1:null) could not reach agent, could not reach agent's
>>host, returning that we don't have enough information
>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>>(AgentTaskPool-1:null) null unable to determine the state of the host.
>>Moving on.
>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>>(AgentTaskPool-1:null) null unable to determine the state of the host.
>>Moving on.
>> 2013-07-13 17:53:56,695 WARN  [agent.manager.AgentManagerImpl]
>>(AgentTaskPool-1:null) Agent state cannot be determined, do nothing
>>
>> HA for KVM is almost useless.
>>
>> I suggest this a blocker for any release until fixed.
>>
>>
>> Regards,
>>
>> Paul Angus
>> S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
>> paul.angus@shapeblue.com
>>
>> -----Original Message-----
>> From: Koushik Das [mailto:koushik.das@citrix.com]
>> Sent: 12 July 2013 12:21
>> To: users@cloudstack.apache.org
>> Subject: RE: cs 4.1 host disconnected status
>>
>> I looked at the logs and none of the existing investigators are able to
>>determine that the host is down. I am not sure if there is a clean way
>>to identify if a host is down in case of KVM. Consider the following
>>cases:
>>
>> 1. Host is actually shutdown
>> 2. Management nic of the host is plugged out of the network but host is
>>up and running
>>
>> There is no clean way to distinguish these cases. Cloudstack should
>>only mark the host as down in the first case. But not sure how one would
>>achieve this.
>>
>> -Koushik
>>
>>> -----Original Message-----
>>> From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
>>> Sent: Friday, July 12, 2013 2:39 PM
>>> To: users@cloudstack.apache.org
>>> Subject: Re: cs 4.1 host disconnected status
>>>
>>> I've simulated crash again and here is the log:
>>> http://thesuki.org/temp/cs.log.txt
>>> I stripped out of there GET requests with api keys.
>>> Server was switched off at 8:36
>>>
>>> On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das
>>><ko...@citrix.com>wrote:
>>>
>>> > Looks like the KVM investigator is not able to determine the state
>>> > of the agent. Can you share the full log?
>>> >
>>> > > -----Original Message-----
>>> > > From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
>>> > > Sent: Thursday, July 11, 2013 7:47 PM
>>> > > To: users
>>> > > Subject: cs 4.1 host disconnected status
>>> > >
>>> > > Hi all.
>>> > >
>>> > > I use the following environment: CS 4.1, KVM, Centos 6.4
>>> > > (management+node1+node2), OpenIndiana NFS server as primary and
>>> > > secondary storage.
>>> > > and I have the following problem:
>>> > > If I switch one hypervisor node off via ipmi (simulate server
>>> > > crash), it
>>> > never
>>> > > goes to Disconnected status in management. Accordingly, ha-enabled
>>> > > VMs are not restarted on another hypervisor node, because it
>>> > > believes that disconnected node is still online.
>>> > >
>>> > >
>>> > > I get following in management server logs:
>>> > >
>>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>>> > > (AgentManager-Handler-13:null) Seq 19-1133189098:
>>>Processing:
>>> > >  { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
>>> > > [{"Answer":{"result":false,"details":     "Unable to ping
>>>computing host,
>>> > > exiting","wait":0}}] }
>>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>>> > > (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: ,
>>>MgmtId:
>>> > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
>>> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
>>> > > (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
>>> > > returning
>>> > null
>>> > > ('I don't know')
>>> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
>>> > > (AgentTaskPool-1:null) could not reach agent, could   not reach
>>>agent's
>>> > > host, returning that we don't have enough information
>>> > > 2013-07-11 10:19:16,153 DEBUG
>>> > > [cloud.ha.HighAvailabilityManagerImpl]
>>> > > (AgentTaskPool-1:null) null unable to determine  the state of the
>>>host.
>>> > >  Moving on.
>>> > > 2013-07-11 10:19:16,153 DEBUG
>>> > > [cloud.ha.HighAvailabilityManagerImpl]
>>> > > (AgentTaskPool-1:null) null unable to determine  the state of the
>>>host.
>>> > >  Moving on.
>>> > > 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
>>> > > (AgentTaskPool-1:null) Agent state cannot be           determined,
>>>do
>>> > > nothing
>>> > >
>>> > >
>>> > > If I power on dead node, it goes to state "Connecting" and then
>>>"Up"
>>> > > in management interface.
>>> > >
>>> > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null)
>>> > > Ping timeout for host 12, do invstigation
>>> > > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null)
>>> > > Ping timeout for host 12, do invstigation
>>> > > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null)
>>> > > Ping timeout for host 12, do invstigation
>>> > > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
>>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
>>> > > Enabled, Agent event = AgentConnected, Host id = 12, name =
>>> > > ad112.colobridge.net]
>>> > > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
>>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
>>> > > = ad112.colobridge.net; old status = Up; event = AgentConnected;
>>> > > new
>>> > status
>>> > > = Connecting; old update count = 1285; new update count = 1286]
>>> > > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
>>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
>>> > > Enabled, Agent event = Ready, Host id = 12, name =
>>> > > ad112.colobridge.net]
>>> > > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
>>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
>>> > > = ad112.colobridge.net; old status = Connecting; event = Ready;
>>> > > new
>>> > status =
>>> > > Up; old update count = 1286; new update count = 1287]
>>> > >
>>> > >
>>> > > If I restart cloud-management service, dead node goes to state
>>> > > "Disconnected" in management interface.
>>> > > (there is nothing special in logs in this case)
>>> > >
>>> > > If I do nothing,  dead node could stay in "Up" state forever (I
>>> > > waited
>>> > for
>>> > > 12 hours) in management interface, throwing into logs "Agent state
>>> > > cannot be determined, do nothing"
>>> > >
>>> > > Would appreciate if someone could help/suggest how to deal with
>>> > > this problem.
>>> > >
>>> > > --
>>> > > Regards,
>>> > > Valery
>>> > >
>>> > > http://protocol.by/slayer
>>> >
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Valery
>>>
>>> http://protocol.by/slayer
>> This email and any attachments to it may be confidential and are
>>intended solely for the use of the individual to whom it is addressed.
>>Any views or opinions expressed are solely those of the author and do
>>not necessarily represent those of Shape Blue Ltd or related companies.
>>If you are not the intended recipient of this email, you must neither
>>take any action based upon its contents, nor copy or show it to anyone.
>>Please contact the sender if you believe you have received this email in
>>error. Shape Blue Ltd is a company incorporated in England & Wales.
>>ShapeBlue Services India LLP is operated under license from Shape Blue
>>Ltd. ShapeBlue is a registered trademark.


Re: [URGENT] KVM HA - (FW: cs 4.1 host disconnected status)

Posted by Marcus Sorensen <sh...@gmail.com>.
I don't know much about HA in regards to management server/agent
connectivity, but it seems to me like this is perilous ground.  If a
host loses connection with the management server, it seems to me that
the management server doesn't have the resources to determine whether
it should start HA-enabled VMs elsewhere. You could very well end up
with VMs running in two or three places at once, corrupting them, just
because a host failed to check in. Maybe the agent was stopped (that
happens all the time). The management server has no fencing
capaiblity, hence the messages "I don't know, doing nothing", are the
correct thing to do. That doesn't seem like it's KVM specific,
however.

I'm very interested in hearing the details on how this HA was intended
to work, or how it might be working on other platforms.  One solution
may be to leverage the secondary storage to create locks for VMs, then
again, when VMs can run without the agent it seems prone to deadlock
(how does another node take over when another host has the lock, but
the host seems down, but is actually running the vm?).

On Mon, Jul 15, 2013 at 1:31 AM, Paul Angus <pa...@shapeblue.com> wrote:
> I bumped this from the user list as we've just come across the same issue.
>
> CloudStack does not react or even change host status when contact is lost with a KVM host.
>
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.AbstractInvestigatorImpl] (AgentTaskPool-1:null) host (10.0.100.51) cannot be pinged, returning null ('I don't know')
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.UserVmDomRInvestigator] (AgentTaskPool-1:null) could not reach agent, could not reach agent's host, returning that we don't have enough information
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-1:null) null unable to determine the state of the host.  Moving on.
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-1:null) null unable to determine the state of the host.  Moving on.
> 2013-07-13 17:53:56,695 WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-1:null) Agent state cannot be determined, do nothing
>
> HA for KVM is almost useless.
>
> I suggest this a blocker for any release until fixed.
>
>
> Regards,
>
> Paul Angus
> S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
> paul.angus@shapeblue.com
>
> -----Original Message-----
> From: Koushik Das [mailto:koushik.das@citrix.com]
> Sent: 12 July 2013 12:21
> To: users@cloudstack.apache.org
> Subject: RE: cs 4.1 host disconnected status
>
> I looked at the logs and none of the existing investigators are able to determine that the host is down. I am not sure if there is a clean way to identify if a host is down in case of KVM. Consider the following cases:
>
> 1. Host is actually shutdown
> 2. Management nic of the host is plugged out of the network but host is up and running
>
> There is no clean way to distinguish these cases. Cloudstack should only mark the host as down in the first case. But not sure how one would achieve this.
>
> -Koushik
>
>> -----Original Message-----
>> From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
>> Sent: Friday, July 12, 2013 2:39 PM
>> To: users@cloudstack.apache.org
>> Subject: Re: cs 4.1 host disconnected status
>>
>> I've simulated crash again and here is the log:
>> http://thesuki.org/temp/cs.log.txt
>> I stripped out of there GET requests with api keys.
>> Server was switched off at 8:36
>>
>> On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das <ko...@citrix.com>wrote:
>>
>> > Looks like the KVM investigator is not able to determine the state
>> > of the agent. Can you share the full log?
>> >
>> > > -----Original Message-----
>> > > From: Valery Ciareszka [mailto:valery.tereshko@gmail.com]
>> > > Sent: Thursday, July 11, 2013 7:47 PM
>> > > To: users
>> > > Subject: cs 4.1 host disconnected status
>> > >
>> > > Hi all.
>> > >
>> > > I use the following environment: CS 4.1, KVM, Centos 6.4
>> > > (management+node1+node2), OpenIndiana NFS server as primary and
>> > > secondary storage.
>> > > and I have the following problem:
>> > > If I switch one hypervisor node off via ipmi (simulate server
>> > > crash), it
>> > never
>> > > goes to Disconnected status in management. Accordingly, ha-enabled
>> > > VMs are not restarted on another hypervisor node, because it
>> > > believes that disconnected node is still online.
>> > >
>> > >
>> > > I get following in management server logs:
>> > >
>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>> > > (AgentManager-Handler-13:null) Seq 19-1133189098:             Processing:
>> > >  { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
>> > > [{"Answer":{"result":false,"details":     "Unable to ping computing host,
>> > > exiting","wait":0}}] }
>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>> > > (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: , MgmtId:
>> > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
>> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
>> > > (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
>> > > returning
>> > null
>> > > ('I don't know')
>> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
>> > > (AgentTaskPool-1:null) could not reach agent, could   not reach agent's
>> > > host, returning that we don't have enough information
>> > > 2013-07-11 10:19:16,153 DEBUG
>> > > [cloud.ha.HighAvailabilityManagerImpl]
>> > > (AgentTaskPool-1:null) null unable to determine  the state of the host.
>> > >  Moving on.
>> > > 2013-07-11 10:19:16,153 DEBUG
>> > > [cloud.ha.HighAvailabilityManagerImpl]
>> > > (AgentTaskPool-1:null) null unable to determine  the state of the host.
>> > >  Moving on.
>> > > 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
>> > > (AgentTaskPool-1:null) Agent state cannot be           determined, do
>> > > nothing
>> > >
>> > >
>> > > If I power on dead node, it goes to state "Connecting" and then "Up"
>> > > in management interface.
>> > >
>> > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null)
>> > > Ping timeout for host 12, do invstigation
>> > > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null)
>> > > Ping timeout for host 12, do invstigation
>> > > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null)
>> > > Ping timeout for host 12, do invstigation
>> > > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
>> > > Enabled, Agent event = AgentConnected, Host id = 12, name =
>> > > ad112.colobridge.net]
>> > > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
>> > > = ad112.colobridge.net; old status = Up; event = AgentConnected;
>> > > new
>> > status
>> > > = Connecting; old update count = 1285; new update count = 1286]
>> > > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
>> > > Enabled, Agent event = Ready, Host id = 12, name =
>> > > ad112.colobridge.net]
>> > > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
>> > > = ad112.colobridge.net; old status = Connecting; event = Ready;
>> > > new
>> > status =
>> > > Up; old update count = 1286; new update count = 1287]
>> > >
>> > >
>> > > If I restart cloud-management service, dead node goes to state
>> > > "Disconnected" in management interface.
>> > > (there is nothing special in logs in this case)
>> > >
>> > > If I do nothing,  dead node could stay in "Up" state forever (I
>> > > waited
>> > for
>> > > 12 hours) in management interface, throwing into logs "Agent state
>> > > cannot be determined, do nothing"
>> > >
>> > > Would appreciate if someone could help/suggest how to deal with
>> > > this problem.
>> > >
>> > > --
>> > > Regards,
>> > > Valery
>> > >
>> > > http://protocol.by/slayer
>> >
>>
>>
>>
>> --
>> Regards,
>> Valery
>>
>> http://protocol.by/slayer
> This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Shape Blue Ltd or related companies. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue is a registered trademark.