You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cloudstack.apache.org by Marty Sweet <ms...@gmail.com> on 2013/08/17 19:58:54 UTC

Production Agent Disconnect

Hi Guys,

I have just had a VMHost randomly disconnect in production and subsequently
take down some VMs.
I have attached the logs (happened to be running agent trace on this node),
but it would seem that the agent (or management?) waited 25 seconds before
erroring, and then the cloudstack agent froze until 1800.
I assume the agent syslog stack traces were caused by force closes of VMs,
no other nodes were affected during this time period.

While the host was in disconnect mode, I could connect to a VM which was
running on that host, although Cloudstack was already reporting that is was
down.
 Would it be a good idea to ping VM's (their allocated IPs before
attempting to start them on other nodes - especially in a HA setup)?

If someone could look at the logs and let me know if there is something
obvious it would be most appreciated, I have included the management bond
for reference that the link didn't go down.

Thanks in advance,
Marty

Re: Production Agent Disconnect

Posted by Marty Sweet <ms...@gmail.com>.
Following this up, I just found the following errors on my management
server. Very odd as they are resolved within the same second, ping.interval
= 5, ping.timeout (multiplier) = 2

Thanks again,
Marty

Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentMonitor] (Thread-6:) Found the following agents behind
on ping: [40, 27, 37, 38, 29, 39]
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-15:) Investigating why
host 40 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-8:) Investigating why host
27 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-4:) Investigating why host
37 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-5:) Investigating why host
38 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-16:) Investigating why
host 29 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-9:) Investigating why host
39 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-5:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-5:) Agent is determined to
be up and running
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-4:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-4:) Agent is determined to
be up and running
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-8:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-8:) Agent is determined to
be up and running
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-15:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-15:) Agent is determined
to be up and running
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-16:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-16:) Agent is determined
to be up and running
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-9:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-9:) Agent is determined to
be up and running



On Sat, Aug 17, 2013 at 6:58 PM, Marty Sweet <ms...@gmail.com> wrote:

> Hi Guys,
>
> I have just had a VMHost randomly disconnect in production and
> subsequently take down some VMs.
> I have attached the logs (happened to be running agent trace on this
> node), but it would seem that the agent (or management?) waited 25 seconds
> before erroring, and then the cloudstack agent froze until 1800.
> I assume the agent syslog stack traces were caused by force closes of VMs,
> no other nodes were affected during this time period.
>
> While the host was in disconnect mode, I could connect to a VM which was
> running on that host, although Cloudstack was already reporting that is was
> down.
>  Would it be a good idea to ping VM's (their allocated IPs before
> attempting to start them on other nodes - especially in a HA setup)?
>
> If someone could look at the logs and let me know if there is something
> obvious it would be most appreciated, I have included the management bond
> for reference that the link didn't go down.
>
> Thanks in advance,
> Marty
>

Re: Production Agent Disconnect

Posted by Marty Sweet <ms...@gmail.com>.
Following this up, I just found the following errors on my management
server. Very odd as they are resolved within the same second, ping.interval
= 5, ping.timeout (multiplier) = 2

Thanks again,
Marty

Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentMonitor] (Thread-6:) Found the following agents behind
on ping: [40, 27, 37, 38, 29, 39]
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-15:) Investigating why
host 40 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-8:) Investigating why host
27 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-4:) Investigating why host
37 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-5:) Investigating why host
38 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-16:) Investigating why
host 29 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-9:) Investigating why host
39 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-5:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-5:) Agent is determined to
be up and running
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-4:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-4:) Agent is determined to
be up and running
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-8:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-8:) Agent is determined to
be up and running
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-15:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-15:) Agent is determined
to be up and running
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-16:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-16:) Agent is determined
to be up and running
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-9:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-9:) Agent is determined to
be up and running



On Sat, Aug 17, 2013 at 6:58 PM, Marty Sweet <ms...@gmail.com> wrote:

> Hi Guys,
>
> I have just had a VMHost randomly disconnect in production and
> subsequently take down some VMs.
> I have attached the logs (happened to be running agent trace on this
> node), but it would seem that the agent (or management?) waited 25 seconds
> before erroring, and then the cloudstack agent froze until 1800.
> I assume the agent syslog stack traces were caused by force closes of VMs,
> no other nodes were affected during this time period.
>
> While the host was in disconnect mode, I could connect to a VM which was
> running on that host, although Cloudstack was already reporting that is was
> down.
>  Would it be a good idea to ping VM's (their allocated IPs before
> attempting to start them on other nodes - especially in a HA setup)?
>
> If someone could look at the logs and let me know if there is something
> obvious it would be most appreciated, I have included the management bond
> for reference that the link didn't go down.
>
> Thanks in advance,
> Marty
>