You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cloudstack.apache.org by Indra Pramana <in...@sg.or.id> on 2016/02/05 17:24:00 UTC

KVM host agent disconnection

Dear all,

We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD for primary
storage. In the past one week, many of our KVM host agents would often be
disconnected from the management server, causing the VMs to go down because
of HA work. While we used to have host disconnection in the past, normally
it would only affect just one host, but this time round, when the problem
happens, it would happen on multiple hosts, up to 4-5 hosts at the same
time.

Nothing much I can find on both the management-server.log and agent.log,
with no significant warn, error or exceptions logged before the
disconnection. Here are the sample logs from the agent:

===
2016-02-05 03:20:28,820 ERROR [cloud.agent.Agent] (UgentTask-7:null) Ping
Interval has gone past 300000.  Attempting to reconnect.
2016-02-05 03:20:28,825 DEBUG [cloud.agent.Agent] (UgentTask-7:null)
Clearing watch list: 2
2016-02-05 03:20:28,825 DEBUG [utils.nio.NioConnection]
(Agent-Selector:null) Closing socket
Socket[addr=/*.*.3.3,port=8250,localport=50489]
2016-02-05 03:20:33,825 INFO  [cloud.agent.Agent] (UgentTask-7:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:20:38,826 INFO  [cloud.agent.Agent] (UgentTask-7:null)
Reconnecting...
2016-02-05 03:20:38,829 INFO  [utils.nio.NioClient] (Agent-Selector:null)
Connecting to *.*.3.3:8250
2016-02-05 03:20:38,925 INFO  [utils.nio.NioClient] (Agent-Selector:null)
SSL: Handshake done
2016-02-05 03:20:38,926 INFO  [utils.nio.NioClient] (Agent-Selector:null)
Connected to *.*.3.3:8250
2016-02-05 03:20:43,926 INFO  [cloud.agent.Agent] (UgentTask-7:null)
Connected to the server
===

Sometimes, the Cloudstack agent will not be able to re-connect unless if we
stop and start the agent again manually:

===
2016-02-05 03:22:20,330 ERROR [cloud.agent.Agent] (UgentTask-6:null) Ping
Interval has gone past 300000.  Attempting to reconnect.
2016-02-05 03:22:20,331 DEBUG [cloud.agent.Agent] (UgentTask-6:null)
Clearing watch list: 2
2016-02-05 03:22:20,353 DEBUG [utils.nio.NioConnection]
(Agent-Selector:null) Closing socket
Socket[addr=/*.*.3.3,port=8250,localport=46231]
2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:22:35,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:22:35,334 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:22:40,334 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:22:40,335 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:23:00,338 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:23:00,339 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:23:05,339 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:23:05,340 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:23:16,945 INFO  [cloud.agent.Agent]
(AgentShutdownThread:null) Stopping the agent: Reason = sig.kill
2016-02-05 03:23:16,946 DEBUG [cloud.agent.Agent]
(AgentShutdownThread:null) Sending shutdown to management server
2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
===

During the time when the problem happens, I can confirm that the CPU and
memory load on the server is healthy and not overloaded. I understand that
such problem can happen when the hypervisor host is overloaded, however
this is not the case -- especially since when such problem happens now, it
will strike up to 4-5 hypervisor hosts at the same time.

On the management server, here are the typical logs:

===
2016-02-05 03:20:28,828 INFO  [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Investigating why host 72 has disconnected with
event AgentDisconnected
2016-02-05 03:20:28,828 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) checking if agent (72) is alive
2016-02-05 03:20:28,833 DEBUG [agent.transport.Request]
(AgentTaskPool-10:null) Seq 72-215557373: Sending  { Cmd , MgmtId:
161342671900, via: 72, Ver: v1, Flags: 100011
, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
2016-02-05 03:20:28,833 INFO  [agent.manager.AgentAttache]
(AgentTaskPool-10:null) Seq 72-215557373: Unable to send due to Resource
[Host:72] is unreachable: Host 72: C
hannel is closed
2016-02-05 03:20:28,833 DEBUG [agent.manager.AgentAttache]
(AgentTaskPool-10:null) Seq 72-215557373: Cancelling.
2016-02-05 03:20:28,833 WARN  [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Resource [Host:72] is unreachable: Host 72: Channel
is closed
2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
(AgentTaskPool-10:null) SimpleInvestigator unable to determine the state of
the host.  Moving on.
2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
(AgentTaskPool-10:null) XenServerInvestigator unable to determine the state
of the host.  Moving on
.
2016-02-05 03:20:28,836 DEBUG [cloud.ha.UserVmDomRInvestigator]
(AgentTaskPool-10:null) checking if agent (72) is alive
2016-02-05 03:20:28,840 DEBUG [cloud.ha.UserVmDomRInvestigator]
(AgentTaskPool-10:null) sending ping from (34) to agent's host ip address
(*.*.3.25)
2016-02-05 03:20:28,843 DEBUG [agent.transport.Request]
(AgentTaskPool-10:null) Seq 34-815540048: Sending  { Cmd , MgmtId:
161342671900, via: 34, Ver: v1, Flags: 100011
,
[{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"*.*.3.25","wait":20}}]
}
2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
(AgentManager-Handler-14:null) Seq 34-815540048: Processing:  { Ans: ,
MgmtId: 161342671900, via: 34, Ver: v1, F
lags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"wait":0}}] }
2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
(AgentTaskPool-10:null) Seq 34-815540048: Received:  { Ans: , MgmtId:
161342671900, via: 34, Ver: v1, Flags: 10,
 { Answer } }
2016-02-05 03:20:29,024 DEBUG [cloud.ha.AbstractInvestigatorImpl]
(AgentTaskPool-10:null) host (*.*.3.25) has been successfully pinged,
returning that host is up
2016-02-05 03:20:29,024 DEBUG [cloud.ha.UserVmDomRInvestigator]
(AgentTaskPool-10:null) ping from (34) to agent's host ip address
(*.*.3.25) successful, returning th
at agent is disconnected
2016-02-05 03:20:29,024 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
(AgentTaskPool-10:null) null was able to determine host 72 is in
Disconnected
2016-02-05 03:20:29,024 INFO  [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) The state determined is Disconnected
2016-02-05 03:20:29,024 WARN  [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Agent is disconnected but the host is still up:
72-**-kvm-05
2016-02-05 03:20:29,025 WARN  [apache.cloudstack.alerts]
(AgentTaskPool-10:null)  alertType:: 7 // dataCenterId:: 6 // podId:: 6 //
clusterId:: null // message:: Host disconnected, name: **-kvm-05 (id:72),
availability zone: *****-******-01, pod: *****-******-Pod-01
2016-02-05 03:20:29,086 INFO  [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Host 72 is disconnecting with event
AgentDisconnected
2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) The next status of agent 72is Alert, current status
is Up
2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Deregistering link for 72 with state Alert
2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Remove Agent : 72
2016-02-05 03:20:29,089 DEBUG [agent.manager.ConnectedAgentAttache]
(AgentTaskPool-10:null) Processing Disconnect.
2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentAttache]
(AgentTaskPool-10:null) Seq 72-215547906: Sending disconnect to class
com.cloud.network.security.SecurityGroupListener
2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Sending Disconnect to listener:
com.cloud.hypervisor.xen.discoverer.XcpServerDiscoverer_EnhancerByCloudStack_8f0e0536
2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Sending Disconnect to listener:
com.cloud.deploy.DeploymentPlanningManagerImpl_EnhancerByCloudStack_55d28fac
2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Sending Disconnect to listener:
com.cloud.network.NetworkManagerImpl_EnhancerByCloudStack_6763a13f
2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Sending Disconnect to listener:
com.cloud.storage.secondary.SecondaryStorageListener
===

Can anyone advise how can I start troubleshooting to find the root cause of
the problem? So that I can take whatever measures to prevent similar
problem from happening.

Any advice is greatly appreciated.

Looking forward to your reply, thank you.

Cheers.

-ip-

Re: KVM host agent disconnection

Posted by Indra Pramana <in...@sg.or.id>.
Hi Wido and all,

Good day to you.

In addition to my previous email, I noted that the latest released version
of ACS is 4.7. May I know if the problem is resolved by 4.7? I don't
think 4.8 is already available from ACS repository, unless if we get the
source and compile ourselves.

https://cloudstack.apache.org/downloads.html

I also noted that the latest version of ACS 4.7 only supports Ubuntu 14.04,
we are using Ubuntu 12.04 for all our management server and KVM host
agents. Will the latest version of ACS 4.7 work on Ubuntu 12.04?

I found below documentation on how to upgrade from 4.2 to 4.7:

http://docs.cloudstack.apache.org/projects/cloudstack-release-notes/en/4.7.0/upgrade/upgrade-4.2.html

It seems to be quite straight-forward but I noticed that the upgrade
involves installing new system VM templates and restarting of all the
system VMs, which will cause downtime.

Anyone has performed upgrade from ACS version 4.2 to 4.7 before and able to
share your experience and give some advice and/or tips?

Thank you.


On Sat, Feb 6, 2016 at 8:01 PM, Indra Pramana <in...@sg.or.id> wrote:

> Hi Wido,
>
> Good day to you, and thanks for your reply. Nice to hear from you again. :)
>
> So is this caused by a bug on 4.2 which is resolved on newer version of
> ACS? Any specific information on the bug, e.g. bug ID and description on
> how to fix it? Is there a way I can resolve the problem without having to
> upgrade?
>
> Is there any documentation I can follow on how to upgrade from 4.2 to 4.8?
> Will this be quite straight-forward or will this involve many steps? We are
> running a production environment and we don't have staging / test
> environment to play with.
>
> Looking forward to your reply, thank you.
>
> Cheers.
>
> On Sat, Feb 6, 2016 at 3:48 PM, Wido den Hollander <wi...@widodh.nl> wrote:
>
>> Hi,
>>
>> > Op 5 februari 2016 om 17:24 schreef Indra Pramana <in...@sg.or.id>:
>> >
>> >
>> > Dear all,
>> >
>> > We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD for primary
>> > storage. In the past one week, many of our KVM host agents would often
>> be
>> > disconnected from the management server, causing the VMs to go down
>> because
>> > of HA work. While we used to have host disconnection in the past,
>> normally
>> > it would only affect just one host, but this time round, when the
>> problem
>> > happens, it would happen on multiple hosts, up to 4-5 hosts at the same
>> > time.
>> >
>>
>> Any reason to still run 4.2? I've seen this happen as well and I haven't
>> seen
>> this with recent versions of ACS.
>>
>> Could you maybe upgrade to 4.8?
>>
>> Wido
>>
>> > Nothing much I can find on both the management-server.log and agent.log,
>> > with no significant warn, error or exceptions logged before the
>> > disconnection. Here are the sample logs from the agent:
>> >
>> > ===
>> > 2016-02-05 03:20:28,820 ERROR [cloud.agent.Agent] (UgentTask-7:null)
>> Ping
>> > Interval has gone past 300000.  Attempting to reconnect.
>> > 2016-02-05 03:20:28,825 DEBUG [cloud.agent.Agent] (UgentTask-7:null)
>> > Clearing watch list: 2
>> > 2016-02-05 03:20:28,825 DEBUG [utils.nio.NioConnection]
>> > (Agent-Selector:null) Closing socket
>> > Socket[addr=/*.*.3.3,port=8250,localport=50489]
>> > 2016-02-05 03:20:33,825 INFO  [cloud.agent.Agent] (UgentTask-7:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:20:38,826 INFO  [cloud.agent.Agent] (UgentTask-7:null)
>> > Reconnecting...
>> > 2016-02-05 03:20:38,829 INFO  [utils.nio.NioClient]
>> (Agent-Selector:null)
>> > Connecting to *.*.3.3:8250
>> > 2016-02-05 03:20:38,925 INFO  [utils.nio.NioClient]
>> (Agent-Selector:null)
>> > SSL: Handshake done
>> > 2016-02-05 03:20:38,926 INFO  [utils.nio.NioClient]
>> (Agent-Selector:null)
>> > Connected to *.*.3.3:8250
>> > 2016-02-05 03:20:43,926 INFO  [cloud.agent.Agent] (UgentTask-7:null)
>> > Connected to the server
>> > ===
>> >
>> > Sometimes, the Cloudstack agent will not be able to re-connect unless
>> if we
>> > stop and start the agent again manually:
>> >
>> > ===
>> > 2016-02-05 03:22:20,330 ERROR [cloud.agent.Agent] (UgentTask-6:null)
>> Ping
>> > Interval has gone past 300000.  Attempting to reconnect.
>> > 2016-02-05 03:22:20,331 DEBUG [cloud.agent.Agent] (UgentTask-6:null)
>> > Clearing watch list: 2
>> > 2016-02-05 03:22:20,353 DEBUG [utils.nio.NioConnection]
>> > (Agent-Selector:null) Closing socket
>> > Socket[addr=/*.*.3.3,port=8250,localport=46231]
>> > 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:35,333 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:35,334 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:40,334 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:40,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:23:00,338 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:23:00,339 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:23:05,339 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:23:05,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:23:16,945 INFO  [cloud.agent.Agent]
>> > (AgentShutdownThread:null) Stopping the agent: Reason = sig.kill
>> > 2016-02-05 03:23:16,946 DEBUG [cloud.agent.Agent]
>> > (AgentShutdownThread:null) Sending shutdown to management server
>> > 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > ===
>> >
>> > During the time when the problem happens, I can confirm that the CPU and
>> > memory load on the server is healthy and not overloaded. I understand
>> that
>> > such problem can happen when the hypervisor host is overloaded, however
>> > this is not the case -- especially since when such problem happens now,
>> it
>> > will strike up to 4-5 hypervisor hosts at the same time.
>> >
>> > On the management server, here are the typical logs:
>> >
>> > ===
>> > 2016-02-05 03:20:28,828 INFO  [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Investigating why host 72 has disconnected with
>> > event AgentDisconnected
>> > 2016-02-05 03:20:28,828 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) checking if agent (72) is alive
>> > 2016-02-05 03:20:28,833 DEBUG [agent.transport.Request]
>> > (AgentTaskPool-10:null) Seq 72-215557373: Sending  { Cmd , MgmtId:
>> > 161342671900, via: 72, Ver: v1, Flags: 100011
>> > , [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
>> > 2016-02-05 03:20:28,833 INFO  [agent.manager.AgentAttache]
>> > (AgentTaskPool-10:null) Seq 72-215557373: Unable to send due to Resource
>> > [Host:72] is unreachable: Host 72: C
>> > hannel is closed
>> > 2016-02-05 03:20:28,833 DEBUG [agent.manager.AgentAttache]
>> > (AgentTaskPool-10:null) Seq 72-215557373: Cancelling.
>> > 2016-02-05 03:20:28,833 WARN  [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Resource [Host:72] is unreachable: Host 72:
>> Channel
>> > is closed
>> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>> > (AgentTaskPool-10:null) SimpleInvestigator unable to determine the
>> state of
>> > the host.  Moving on.
>> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>> > (AgentTaskPool-10:null) XenServerInvestigator unable to determine the
>> state
>> > of the host.  Moving on
>> > .
>> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.UserVmDomRInvestigator]
>> > (AgentTaskPool-10:null) checking if agent (72) is alive
>> > 2016-02-05 03:20:28,840 DEBUG [cloud.ha.UserVmDomRInvestigator]
>> > (AgentTaskPool-10:null) sending ping from (34) to agent's host ip
>> address
>> > (*.*.3.25)
>> > 2016-02-05 03:20:28,843 DEBUG [agent.transport.Request]
>> > (AgentTaskPool-10:null) Seq 34-815540048: Sending  { Cmd , MgmtId:
>> > 161342671900, via: 34, Ver: v1, Flags: 100011
>> > ,
>> >
>> [{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"*.*.3.25","wait":20}}]
>> > }
>> > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
>> > (AgentManager-Handler-14:null) Seq 34-815540048: Processing:  { Ans: ,
>> > MgmtId: 161342671900, via: 34, Ver: v1, F
>> > lags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"wait":0}}] }
>> > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
>> > (AgentTaskPool-10:null) Seq 34-815540048: Received:  { Ans: , MgmtId:
>> > 161342671900, via: 34, Ver: v1, Flags: 10,
>> >  { Answer } }
>> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.AbstractInvestigatorImpl]
>> > (AgentTaskPool-10:null) host (*.*.3.25) has been successfully pinged,
>> > returning that host is up
>> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.UserVmDomRInvestigator]
>> > (AgentTaskPool-10:null) ping from (34) to agent's host ip address
>> > (*.*.3.25) successful, returning th
>> > at agent is disconnected
>> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>> > (AgentTaskPool-10:null) null was able to determine host 72 is in
>> > Disconnected
>> > 2016-02-05 03:20:29,024 INFO  [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) The state determined is Disconnected
>> > 2016-02-05 03:20:29,024 WARN  [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Agent is disconnected but the host is still up:
>> > 72-**-kvm-05
>> > 2016-02-05 03:20:29,025 WARN  [apache.cloudstack.alerts]
>> > (AgentTaskPool-10:null)  alertType:: 7 // dataCenterId:: 6 // podId:: 6
>> //
>> > clusterId:: null // message:: Host disconnected, name: **-kvm-05
>> (id:72),
>> > availability zone: *****-******-01, pod: *****-******-Pod-01
>> > 2016-02-05 03:20:29,086 INFO  [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Host 72 is disconnecting with event
>> > AgentDisconnected
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) The next status of agent 72is Alert, current
>> status
>> > is Up
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Deregistering link for 72 with state Alert
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Remove Agent : 72
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.ConnectedAgentAttache]
>> > (AgentTaskPool-10:null) Processing Disconnect.
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentAttache]
>> > (AgentTaskPool-10:null) Seq 72-215547906: Sending disconnect to class
>> > com.cloud.network.security.SecurityGroupListener
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Sending Disconnect to listener:
>> >
>> com.cloud.hypervisor.xen.discoverer.XcpServerDiscoverer_EnhancerByCloudStack_8f0e0536
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Sending Disconnect to listener:
>> >
>> com.cloud.deploy.DeploymentPlanningManagerImpl_EnhancerByCloudStack_55d28fac
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Sending Disconnect to listener:
>> > com.cloud.network.NetworkManagerImpl_EnhancerByCloudStack_6763a13f
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Sending Disconnect to listener:
>> > com.cloud.storage.secondary.SecondaryStorageListener
>> > ===
>> >
>> > Can anyone advise how can I start troubleshooting to find the root
>> cause of
>> > the problem? So that I can take whatever measures to prevent similar
>> > problem from happening.
>> >
>> > Any advice is greatly appreciated.
>> >
>> > Looking forward to your reply, thank you.
>> >
>> > Cheers.
>> >
>> > -ip-
>>
>
>

Re: KVM host agent disconnection

Posted by Indra Pramana <in...@sg.or.id>.
Hi Wido and all,

Good day to you.

In addition to my previous email, I noted that the latest released version
of ACS is 4.7. May I know if the problem is resolved by 4.7? I don't
think 4.8 is already available from ACS repository, unless if we get the
source and compile ourselves.

https://cloudstack.apache.org/downloads.html

I also noted that the latest version of ACS 4.7 only supports Ubuntu 14.04,
we are using Ubuntu 12.04 for all our management server and KVM host
agents. Will the latest version of ACS 4.7 work on Ubuntu 12.04?

I found below documentation on how to upgrade from 4.2 to 4.7:

http://docs.cloudstack.apache.org/projects/cloudstack-release-notes/en/4.7.0/upgrade/upgrade-4.2.html

It seems to be quite straight-forward but I noticed that the upgrade
involves installing new system VM templates and restarting of all the
system VMs, which will cause downtime.

Anyone has performed upgrade from ACS version 4.2 to 4.7 before and able to
share your experience and give some advice and/or tips?

Thank you.


On Sat, Feb 6, 2016 at 8:01 PM, Indra Pramana <in...@sg.or.id> wrote:

> Hi Wido,
>
> Good day to you, and thanks for your reply. Nice to hear from you again. :)
>
> So is this caused by a bug on 4.2 which is resolved on newer version of
> ACS? Any specific information on the bug, e.g. bug ID and description on
> how to fix it? Is there a way I can resolve the problem without having to
> upgrade?
>
> Is there any documentation I can follow on how to upgrade from 4.2 to 4.8?
> Will this be quite straight-forward or will this involve many steps? We are
> running a production environment and we don't have staging / test
> environment to play with.
>
> Looking forward to your reply, thank you.
>
> Cheers.
>
> On Sat, Feb 6, 2016 at 3:48 PM, Wido den Hollander <wi...@widodh.nl> wrote:
>
>> Hi,
>>
>> > Op 5 februari 2016 om 17:24 schreef Indra Pramana <in...@sg.or.id>:
>> >
>> >
>> > Dear all,
>> >
>> > We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD for primary
>> > storage. In the past one week, many of our KVM host agents would often
>> be
>> > disconnected from the management server, causing the VMs to go down
>> because
>> > of HA work. While we used to have host disconnection in the past,
>> normally
>> > it would only affect just one host, but this time round, when the
>> problem
>> > happens, it would happen on multiple hosts, up to 4-5 hosts at the same
>> > time.
>> >
>>
>> Any reason to still run 4.2? I've seen this happen as well and I haven't
>> seen
>> this with recent versions of ACS.
>>
>> Could you maybe upgrade to 4.8?
>>
>> Wido
>>
>> > Nothing much I can find on both the management-server.log and agent.log,
>> > with no significant warn, error or exceptions logged before the
>> > disconnection. Here are the sample logs from the agent:
>> >
>> > ===
>> > 2016-02-05 03:20:28,820 ERROR [cloud.agent.Agent] (UgentTask-7:null)
>> Ping
>> > Interval has gone past 300000.  Attempting to reconnect.
>> > 2016-02-05 03:20:28,825 DEBUG [cloud.agent.Agent] (UgentTask-7:null)
>> > Clearing watch list: 2
>> > 2016-02-05 03:20:28,825 DEBUG [utils.nio.NioConnection]
>> > (Agent-Selector:null) Closing socket
>> > Socket[addr=/*.*.3.3,port=8250,localport=50489]
>> > 2016-02-05 03:20:33,825 INFO  [cloud.agent.Agent] (UgentTask-7:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:20:38,826 INFO  [cloud.agent.Agent] (UgentTask-7:null)
>> > Reconnecting...
>> > 2016-02-05 03:20:38,829 INFO  [utils.nio.NioClient]
>> (Agent-Selector:null)
>> > Connecting to *.*.3.3:8250
>> > 2016-02-05 03:20:38,925 INFO  [utils.nio.NioClient]
>> (Agent-Selector:null)
>> > SSL: Handshake done
>> > 2016-02-05 03:20:38,926 INFO  [utils.nio.NioClient]
>> (Agent-Selector:null)
>> > Connected to *.*.3.3:8250
>> > 2016-02-05 03:20:43,926 INFO  [cloud.agent.Agent] (UgentTask-7:null)
>> > Connected to the server
>> > ===
>> >
>> > Sometimes, the Cloudstack agent will not be able to re-connect unless
>> if we
>> > stop and start the agent again manually:
>> >
>> > ===
>> > 2016-02-05 03:22:20,330 ERROR [cloud.agent.Agent] (UgentTask-6:null)
>> Ping
>> > Interval has gone past 300000.  Attempting to reconnect.
>> > 2016-02-05 03:22:20,331 DEBUG [cloud.agent.Agent] (UgentTask-6:null)
>> > Clearing watch list: 2
>> > 2016-02-05 03:22:20,353 DEBUG [utils.nio.NioConnection]
>> > (Agent-Selector:null) Closing socket
>> > Socket[addr=/*.*.3.3,port=8250,localport=46231]
>> > 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:35,333 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:35,334 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:40,334 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:40,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:23:00,338 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:23:00,339 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:23:05,339 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:23:05,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:23:16,945 INFO  [cloud.agent.Agent]
>> > (AgentShutdownThread:null) Stopping the agent: Reason = sig.kill
>> > 2016-02-05 03:23:16,946 DEBUG [cloud.agent.Agent]
>> > (AgentShutdownThread:null) Sending shutdown to management server
>> > 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > ===
>> >
>> > During the time when the problem happens, I can confirm that the CPU and
>> > memory load on the server is healthy and not overloaded. I understand
>> that
>> > such problem can happen when the hypervisor host is overloaded, however
>> > this is not the case -- especially since when such problem happens now,
>> it
>> > will strike up to 4-5 hypervisor hosts at the same time.
>> >
>> > On the management server, here are the typical logs:
>> >
>> > ===
>> > 2016-02-05 03:20:28,828 INFO  [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Investigating why host 72 has disconnected with
>> > event AgentDisconnected
>> > 2016-02-05 03:20:28,828 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) checking if agent (72) is alive
>> > 2016-02-05 03:20:28,833 DEBUG [agent.transport.Request]
>> > (AgentTaskPool-10:null) Seq 72-215557373: Sending  { Cmd , MgmtId:
>> > 161342671900, via: 72, Ver: v1, Flags: 100011
>> > , [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
>> > 2016-02-05 03:20:28,833 INFO  [agent.manager.AgentAttache]
>> > (AgentTaskPool-10:null) Seq 72-215557373: Unable to send due to Resource
>> > [Host:72] is unreachable: Host 72: C
>> > hannel is closed
>> > 2016-02-05 03:20:28,833 DEBUG [agent.manager.AgentAttache]
>> > (AgentTaskPool-10:null) Seq 72-215557373: Cancelling.
>> > 2016-02-05 03:20:28,833 WARN  [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Resource [Host:72] is unreachable: Host 72:
>> Channel
>> > is closed
>> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>> > (AgentTaskPool-10:null) SimpleInvestigator unable to determine the
>> state of
>> > the host.  Moving on.
>> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>> > (AgentTaskPool-10:null) XenServerInvestigator unable to determine the
>> state
>> > of the host.  Moving on
>> > .
>> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.UserVmDomRInvestigator]
>> > (AgentTaskPool-10:null) checking if agent (72) is alive
>> > 2016-02-05 03:20:28,840 DEBUG [cloud.ha.UserVmDomRInvestigator]
>> > (AgentTaskPool-10:null) sending ping from (34) to agent's host ip
>> address
>> > (*.*.3.25)
>> > 2016-02-05 03:20:28,843 DEBUG [agent.transport.Request]
>> > (AgentTaskPool-10:null) Seq 34-815540048: Sending  { Cmd , MgmtId:
>> > 161342671900, via: 34, Ver: v1, Flags: 100011
>> > ,
>> >
>> [{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"*.*.3.25","wait":20}}]
>> > }
>> > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
>> > (AgentManager-Handler-14:null) Seq 34-815540048: Processing:  { Ans: ,
>> > MgmtId: 161342671900, via: 34, Ver: v1, F
>> > lags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"wait":0}}] }
>> > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
>> > (AgentTaskPool-10:null) Seq 34-815540048: Received:  { Ans: , MgmtId:
>> > 161342671900, via: 34, Ver: v1, Flags: 10,
>> >  { Answer } }
>> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.AbstractInvestigatorImpl]
>> > (AgentTaskPool-10:null) host (*.*.3.25) has been successfully pinged,
>> > returning that host is up
>> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.UserVmDomRInvestigator]
>> > (AgentTaskPool-10:null) ping from (34) to agent's host ip address
>> > (*.*.3.25) successful, returning th
>> > at agent is disconnected
>> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>> > (AgentTaskPool-10:null) null was able to determine host 72 is in
>> > Disconnected
>> > 2016-02-05 03:20:29,024 INFO  [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) The state determined is Disconnected
>> > 2016-02-05 03:20:29,024 WARN  [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Agent is disconnected but the host is still up:
>> > 72-**-kvm-05
>> > 2016-02-05 03:20:29,025 WARN  [apache.cloudstack.alerts]
>> > (AgentTaskPool-10:null)  alertType:: 7 // dataCenterId:: 6 // podId:: 6
>> //
>> > clusterId:: null // message:: Host disconnected, name: **-kvm-05
>> (id:72),
>> > availability zone: *****-******-01, pod: *****-******-Pod-01
>> > 2016-02-05 03:20:29,086 INFO  [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Host 72 is disconnecting with event
>> > AgentDisconnected
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) The next status of agent 72is Alert, current
>> status
>> > is Up
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Deregistering link for 72 with state Alert
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Remove Agent : 72
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.ConnectedAgentAttache]
>> > (AgentTaskPool-10:null) Processing Disconnect.
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentAttache]
>> > (AgentTaskPool-10:null) Seq 72-215547906: Sending disconnect to class
>> > com.cloud.network.security.SecurityGroupListener
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Sending Disconnect to listener:
>> >
>> com.cloud.hypervisor.xen.discoverer.XcpServerDiscoverer_EnhancerByCloudStack_8f0e0536
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Sending Disconnect to listener:
>> >
>> com.cloud.deploy.DeploymentPlanningManagerImpl_EnhancerByCloudStack_55d28fac
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Sending Disconnect to listener:
>> > com.cloud.network.NetworkManagerImpl_EnhancerByCloudStack_6763a13f
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Sending Disconnect to listener:
>> > com.cloud.storage.secondary.SecondaryStorageListener
>> > ===
>> >
>> > Can anyone advise how can I start troubleshooting to find the root
>> cause of
>> > the problem? So that I can take whatever measures to prevent similar
>> > problem from happening.
>> >
>> > Any advice is greatly appreciated.
>> >
>> > Looking forward to your reply, thank you.
>> >
>> > Cheers.
>> >
>> > -ip-
>>
>
>

Re: KVM host agent disconnection

Posted by Wido den Hollander <wi...@widodh.nl>.
> Op 6 februari 2016 om 13:01 schreef Indra Pramana <in...@sg.or.id>:
> 
> 
> Hi Wido,
> 
> Good day to you, and thanks for your reply. Nice to hear from you again. :)
> 
> So is this caused by a bug on 4.2 which is resolved on newer version of
> ACS? Any specific information on the bug, e.g. bug ID and description on
> how to fix it? Is there a way I can resolve the problem without having to
> upgrade?
> 

I don't know which might it might be. I just haven't seen it after 4.2 myself.

> Is there any documentation I can follow on how to upgrade from 4.2 to 4.8?
> Will this be quite straight-forward or will this involve many steps? We are
> running a production environment and we don't have staging / test
> environment to play with.
> 

Follow the docs as you should. Make sure the new SSVM is registered and you have
a backup of the SQL database before you start.

The docs should tell you.

Wido

> Looking forward to your reply, thank you.
> 
> Cheers.
> 
> On Sat, Feb 6, 2016 at 3:48 PM, Wido den Hollander <wi...@widodh.nl> wrote:
> 
> > Hi,
> >
> > > Op 5 februari 2016 om 17:24 schreef Indra Pramana <in...@sg.or.id>:
> > >
> > >
> > > Dear all,
> > >
> > > We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD for primary
> > > storage. In the past one week, many of our KVM host agents would often be
> > > disconnected from the management server, causing the VMs to go down
> > because
> > > of HA work. While we used to have host disconnection in the past,
> > normally
> > > it would only affect just one host, but this time round, when the problem
> > > happens, it would happen on multiple hosts, up to 4-5 hosts at the same
> > > time.
> > >
> >
> > Any reason to still run 4.2? I've seen this happen as well and I haven't
> > seen
> > this with recent versions of ACS.
> >
> > Could you maybe upgrade to 4.8?
> >
> > Wido
> >
> > > Nothing much I can find on both the management-server.log and agent.log,
> > > with no significant warn, error or exceptions logged before the
> > > disconnection. Here are the sample logs from the agent:
> > >
> > > ===
> > > 2016-02-05 03:20:28,820 ERROR [cloud.agent.Agent] (UgentTask-7:null) Ping
> > > Interval has gone past 300000.  Attempting to reconnect.
> > > 2016-02-05 03:20:28,825 DEBUG [cloud.agent.Agent] (UgentTask-7:null)
> > > Clearing watch list: 2
> > > 2016-02-05 03:20:28,825 DEBUG [utils.nio.NioConnection]
> > > (Agent-Selector:null) Closing socket
> > > Socket[addr=/*.*.3.3,port=8250,localport=50489]
> > > 2016-02-05 03:20:33,825 INFO  [cloud.agent.Agent] (UgentTask-7:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:20:38,826 INFO  [cloud.agent.Agent] (UgentTask-7:null)
> > > Reconnecting...
> > > 2016-02-05 03:20:38,829 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > > Connecting to *.*.3.3:8250
> > > 2016-02-05 03:20:38,925 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > > SSL: Handshake done
> > > 2016-02-05 03:20:38,926 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > > Connected to *.*.3.3:8250
> > > 2016-02-05 03:20:43,926 INFO  [cloud.agent.Agent] (UgentTask-7:null)
> > > Connected to the server
> > > ===
> > >
> > > Sometimes, the Cloudstack agent will not be able to re-connect unless if
> > we
> > > stop and start the agent again manually:
> > >
> > > ===
> > > 2016-02-05 03:22:20,330 ERROR [cloud.agent.Agent] (UgentTask-6:null) Ping
> > > Interval has gone past 300000.  Attempting to reconnect.
> > > 2016-02-05 03:22:20,331 DEBUG [cloud.agent.Agent] (UgentTask-6:null)
> > > Clearing watch list: 2
> > > 2016-02-05 03:22:20,353 DEBUG [utils.nio.NioConnection]
> > > (Agent-Selector:null) Closing socket
> > > Socket[addr=/*.*.3.3,port=8250,localport=46231]
> > > 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:22:35,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:22:35,334 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:22:40,334 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:22:40,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:23:00,338 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:23:00,339 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:23:05,339 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:23:05,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:23:16,945 INFO  [cloud.agent.Agent]
> > > (AgentShutdownThread:null) Stopping the agent: Reason = sig.kill
> > > 2016-02-05 03:23:16,946 DEBUG [cloud.agent.Agent]
> > > (AgentShutdownThread:null) Sending shutdown to management server
> > > 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > ===
> > >
> > > During the time when the problem happens, I can confirm that the CPU and
> > > memory load on the server is healthy and not overloaded. I understand
> > that
> > > such problem can happen when the hypervisor host is overloaded, however
> > > this is not the case -- especially since when such problem happens now,
> > it
> > > will strike up to 4-5 hypervisor hosts at the same time.
> > >
> > > On the management server, here are the typical logs:
> > >
> > > ===
> > > 2016-02-05 03:20:28,828 INFO  [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Investigating why host 72 has disconnected with
> > > event AgentDisconnected
> > > 2016-02-05 03:20:28,828 DEBUG [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) checking if agent (72) is alive
> > > 2016-02-05 03:20:28,833 DEBUG [agent.transport.Request]
> > > (AgentTaskPool-10:null) Seq 72-215557373: Sending  { Cmd , MgmtId:
> > > 161342671900, via: 72, Ver: v1, Flags: 100011
> > > , [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
> > > 2016-02-05 03:20:28,833 INFO  [agent.manager.AgentAttache]
> > > (AgentTaskPool-10:null) Seq 72-215557373: Unable to send due to Resource
> > > [Host:72] is unreachable: Host 72: C
> > > hannel is closed
> > > 2016-02-05 03:20:28,833 DEBUG [agent.manager.AgentAttache]
> > > (AgentTaskPool-10:null) Seq 72-215557373: Cancelling.
> > > 2016-02-05 03:20:28,833 WARN  [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Resource [Host:72] is unreachable: Host 72:
> > Channel
> > > is closed
> > > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > > (AgentTaskPool-10:null) SimpleInvestigator unable to determine the state
> > of
> > > the host.  Moving on.
> > > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > > (AgentTaskPool-10:null) XenServerInvestigator unable to determine the
> > state
> > > of the host.  Moving on
> > > .
> > > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > > (AgentTaskPool-10:null) checking if agent (72) is alive
> > > 2016-02-05 03:20:28,840 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > > (AgentTaskPool-10:null) sending ping from (34) to agent's host ip address
> > > (*.*.3.25)
> > > 2016-02-05 03:20:28,843 DEBUG [agent.transport.Request]
> > > (AgentTaskPool-10:null) Seq 34-815540048: Sending  { Cmd , MgmtId:
> > > 161342671900, via: 34, Ver: v1, Flags: 100011
> > > ,
> > >
> > [{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"*.*.3.25","wait":20}}]
> > > }
> > > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
> > > (AgentManager-Handler-14:null) Seq 34-815540048: Processing:  { Ans: ,
> > > MgmtId: 161342671900, via: 34, Ver: v1, F
> > > lags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"wait":0}}] }
> > > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
> > > (AgentTaskPool-10:null) Seq 34-815540048: Received:  { Ans: , MgmtId:
> > > 161342671900, via: 34, Ver: v1, Flags: 10,
> > >  { Answer } }
> > > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> > > (AgentTaskPool-10:null) host (*.*.3.25) has been successfully pinged,
> > > returning that host is up
> > > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > > (AgentTaskPool-10:null) ping from (34) to agent's host ip address
> > > (*.*.3.25) successful, returning th
> > > at agent is disconnected
> > > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > > (AgentTaskPool-10:null) null was able to determine host 72 is in
> > > Disconnected
> > > 2016-02-05 03:20:29,024 INFO  [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) The state determined is Disconnected
> > > 2016-02-05 03:20:29,024 WARN  [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Agent is disconnected but the host is still up:
> > > 72-**-kvm-05
> > > 2016-02-05 03:20:29,025 WARN  [apache.cloudstack.alerts]
> > > (AgentTaskPool-10:null)  alertType:: 7 // dataCenterId:: 6 // podId:: 6
> > //
> > > clusterId:: null // message:: Host disconnected, name: **-kvm-05 (id:72),
> > > availability zone: *****-******-01, pod: *****-******-Pod-01
> > > 2016-02-05 03:20:29,086 INFO  [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Host 72 is disconnecting with event
> > > AgentDisconnected
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) The next status of agent 72is Alert, current
> > status
> > > is Up
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Deregistering link for 72 with state Alert
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Remove Agent : 72
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.ConnectedAgentAttache]
> > > (AgentTaskPool-10:null) Processing Disconnect.
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentAttache]
> > > (AgentTaskPool-10:null) Seq 72-215547906: Sending disconnect to class
> > > com.cloud.network.security.SecurityGroupListener
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Sending Disconnect to listener:
> > >
> > com.cloud.hypervisor.xen.discoverer.XcpServerDiscoverer_EnhancerByCloudStack_8f0e0536
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Sending Disconnect to listener:
> > >
> > com.cloud.deploy.DeploymentPlanningManagerImpl_EnhancerByCloudStack_55d28fac
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Sending Disconnect to listener:
> > > com.cloud.network.NetworkManagerImpl_EnhancerByCloudStack_6763a13f
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Sending Disconnect to listener:
> > > com.cloud.storage.secondary.SecondaryStorageListener
> > > ===
> > >
> > > Can anyone advise how can I start troubleshooting to find the root cause
> > of
> > > the problem? So that I can take whatever measures to prevent similar
> > > problem from happening.
> > >
> > > Any advice is greatly appreciated.
> > >
> > > Looking forward to your reply, thank you.
> > >
> > > Cheers.
> > >
> > > -ip-
> >

Re: KVM host agent disconnection

Posted by Wido den Hollander <wi...@widodh.nl>.
> Op 6 februari 2016 om 13:01 schreef Indra Pramana <in...@sg.or.id>:
> 
> 
> Hi Wido,
> 
> Good day to you, and thanks for your reply. Nice to hear from you again. :)
> 
> So is this caused by a bug on 4.2 which is resolved on newer version of
> ACS? Any specific information on the bug, e.g. bug ID and description on
> how to fix it? Is there a way I can resolve the problem without having to
> upgrade?
> 

I don't know which might it might be. I just haven't seen it after 4.2 myself.

> Is there any documentation I can follow on how to upgrade from 4.2 to 4.8?
> Will this be quite straight-forward or will this involve many steps? We are
> running a production environment and we don't have staging / test
> environment to play with.
> 

Follow the docs as you should. Make sure the new SSVM is registered and you have
a backup of the SQL database before you start.

The docs should tell you.

Wido

> Looking forward to your reply, thank you.
> 
> Cheers.
> 
> On Sat, Feb 6, 2016 at 3:48 PM, Wido den Hollander <wi...@widodh.nl> wrote:
> 
> > Hi,
> >
> > > Op 5 februari 2016 om 17:24 schreef Indra Pramana <in...@sg.or.id>:
> > >
> > >
> > > Dear all,
> > >
> > > We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD for primary
> > > storage. In the past one week, many of our KVM host agents would often be
> > > disconnected from the management server, causing the VMs to go down
> > because
> > > of HA work. While we used to have host disconnection in the past,
> > normally
> > > it would only affect just one host, but this time round, when the problem
> > > happens, it would happen on multiple hosts, up to 4-5 hosts at the same
> > > time.
> > >
> >
> > Any reason to still run 4.2? I've seen this happen as well and I haven't
> > seen
> > this with recent versions of ACS.
> >
> > Could you maybe upgrade to 4.8?
> >
> > Wido
> >
> > > Nothing much I can find on both the management-server.log and agent.log,
> > > with no significant warn, error or exceptions logged before the
> > > disconnection. Here are the sample logs from the agent:
> > >
> > > ===
> > > 2016-02-05 03:20:28,820 ERROR [cloud.agent.Agent] (UgentTask-7:null) Ping
> > > Interval has gone past 300000.  Attempting to reconnect.
> > > 2016-02-05 03:20:28,825 DEBUG [cloud.agent.Agent] (UgentTask-7:null)
> > > Clearing watch list: 2
> > > 2016-02-05 03:20:28,825 DEBUG [utils.nio.NioConnection]
> > > (Agent-Selector:null) Closing socket
> > > Socket[addr=/*.*.3.3,port=8250,localport=50489]
> > > 2016-02-05 03:20:33,825 INFO  [cloud.agent.Agent] (UgentTask-7:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:20:38,826 INFO  [cloud.agent.Agent] (UgentTask-7:null)
> > > Reconnecting...
> > > 2016-02-05 03:20:38,829 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > > Connecting to *.*.3.3:8250
> > > 2016-02-05 03:20:38,925 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > > SSL: Handshake done
> > > 2016-02-05 03:20:38,926 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > > Connected to *.*.3.3:8250
> > > 2016-02-05 03:20:43,926 INFO  [cloud.agent.Agent] (UgentTask-7:null)
> > > Connected to the server
> > > ===
> > >
> > > Sometimes, the Cloudstack agent will not be able to re-connect unless if
> > we
> > > stop and start the agent again manually:
> > >
> > > ===
> > > 2016-02-05 03:22:20,330 ERROR [cloud.agent.Agent] (UgentTask-6:null) Ping
> > > Interval has gone past 300000.  Attempting to reconnect.
> > > 2016-02-05 03:22:20,331 DEBUG [cloud.agent.Agent] (UgentTask-6:null)
> > > Clearing watch list: 2
> > > 2016-02-05 03:22:20,353 DEBUG [utils.nio.NioConnection]
> > > (Agent-Selector:null) Closing socket
> > > Socket[addr=/*.*.3.3,port=8250,localport=46231]
> > > 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:22:35,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:22:35,334 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:22:40,334 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:22:40,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:23:00,338 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:23:00,339 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:23:05,339 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:23:05,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > 2016-02-05 03:23:16,945 INFO  [cloud.agent.Agent]
> > > (AgentShutdownThread:null) Stopping the agent: Reason = sig.kill
> > > 2016-02-05 03:23:16,946 DEBUG [cloud.agent.Agent]
> > > (AgentShutdownThread:null) Sending shutdown to management server
> > > 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > > connection to the server. Dealing with the remaining commands...
> > > 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> > Cannot
> > > connect because we still have 3 commands in progress.
> > > ===
> > >
> > > During the time when the problem happens, I can confirm that the CPU and
> > > memory load on the server is healthy and not overloaded. I understand
> > that
> > > such problem can happen when the hypervisor host is overloaded, however
> > > this is not the case -- especially since when such problem happens now,
> > it
> > > will strike up to 4-5 hypervisor hosts at the same time.
> > >
> > > On the management server, here are the typical logs:
> > >
> > > ===
> > > 2016-02-05 03:20:28,828 INFO  [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Investigating why host 72 has disconnected with
> > > event AgentDisconnected
> > > 2016-02-05 03:20:28,828 DEBUG [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) checking if agent (72) is alive
> > > 2016-02-05 03:20:28,833 DEBUG [agent.transport.Request]
> > > (AgentTaskPool-10:null) Seq 72-215557373: Sending  { Cmd , MgmtId:
> > > 161342671900, via: 72, Ver: v1, Flags: 100011
> > > , [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
> > > 2016-02-05 03:20:28,833 INFO  [agent.manager.AgentAttache]
> > > (AgentTaskPool-10:null) Seq 72-215557373: Unable to send due to Resource
> > > [Host:72] is unreachable: Host 72: C
> > > hannel is closed
> > > 2016-02-05 03:20:28,833 DEBUG [agent.manager.AgentAttache]
> > > (AgentTaskPool-10:null) Seq 72-215557373: Cancelling.
> > > 2016-02-05 03:20:28,833 WARN  [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Resource [Host:72] is unreachable: Host 72:
> > Channel
> > > is closed
> > > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > > (AgentTaskPool-10:null) SimpleInvestigator unable to determine the state
> > of
> > > the host.  Moving on.
> > > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > > (AgentTaskPool-10:null) XenServerInvestigator unable to determine the
> > state
> > > of the host.  Moving on
> > > .
> > > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > > (AgentTaskPool-10:null) checking if agent (72) is alive
> > > 2016-02-05 03:20:28,840 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > > (AgentTaskPool-10:null) sending ping from (34) to agent's host ip address
> > > (*.*.3.25)
> > > 2016-02-05 03:20:28,843 DEBUG [agent.transport.Request]
> > > (AgentTaskPool-10:null) Seq 34-815540048: Sending  { Cmd , MgmtId:
> > > 161342671900, via: 34, Ver: v1, Flags: 100011
> > > ,
> > >
> > [{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"*.*.3.25","wait":20}}]
> > > }
> > > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
> > > (AgentManager-Handler-14:null) Seq 34-815540048: Processing:  { Ans: ,
> > > MgmtId: 161342671900, via: 34, Ver: v1, F
> > > lags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"wait":0}}] }
> > > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
> > > (AgentTaskPool-10:null) Seq 34-815540048: Received:  { Ans: , MgmtId:
> > > 161342671900, via: 34, Ver: v1, Flags: 10,
> > >  { Answer } }
> > > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> > > (AgentTaskPool-10:null) host (*.*.3.25) has been successfully pinged,
> > > returning that host is up
> > > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > > (AgentTaskPool-10:null) ping from (34) to agent's host ip address
> > > (*.*.3.25) successful, returning th
> > > at agent is disconnected
> > > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > > (AgentTaskPool-10:null) null was able to determine host 72 is in
> > > Disconnected
> > > 2016-02-05 03:20:29,024 INFO  [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) The state determined is Disconnected
> > > 2016-02-05 03:20:29,024 WARN  [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Agent is disconnected but the host is still up:
> > > 72-**-kvm-05
> > > 2016-02-05 03:20:29,025 WARN  [apache.cloudstack.alerts]
> > > (AgentTaskPool-10:null)  alertType:: 7 // dataCenterId:: 6 // podId:: 6
> > //
> > > clusterId:: null // message:: Host disconnected, name: **-kvm-05 (id:72),
> > > availability zone: *****-******-01, pod: *****-******-Pod-01
> > > 2016-02-05 03:20:29,086 INFO  [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Host 72 is disconnecting with event
> > > AgentDisconnected
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) The next status of agent 72is Alert, current
> > status
> > > is Up
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Deregistering link for 72 with state Alert
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Remove Agent : 72
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.ConnectedAgentAttache]
> > > (AgentTaskPool-10:null) Processing Disconnect.
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentAttache]
> > > (AgentTaskPool-10:null) Seq 72-215547906: Sending disconnect to class
> > > com.cloud.network.security.SecurityGroupListener
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Sending Disconnect to listener:
> > >
> > com.cloud.hypervisor.xen.discoverer.XcpServerDiscoverer_EnhancerByCloudStack_8f0e0536
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Sending Disconnect to listener:
> > >
> > com.cloud.deploy.DeploymentPlanningManagerImpl_EnhancerByCloudStack_55d28fac
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Sending Disconnect to listener:
> > > com.cloud.network.NetworkManagerImpl_EnhancerByCloudStack_6763a13f
> > > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > > (AgentTaskPool-10:null) Sending Disconnect to listener:
> > > com.cloud.storage.secondary.SecondaryStorageListener
> > > ===
> > >
> > > Can anyone advise how can I start troubleshooting to find the root cause
> > of
> > > the problem? So that I can take whatever measures to prevent similar
> > > problem from happening.
> > >
> > > Any advice is greatly appreciated.
> > >
> > > Looking forward to your reply, thank you.
> > >
> > > Cheers.
> > >
> > > -ip-
> >

Re: KVM host agent disconnection

Posted by Indra Pramana <in...@sg.or.id>.
Hi Wido,

Good day to you, and thanks for your reply. Nice to hear from you again. :)

So is this caused by a bug on 4.2 which is resolved on newer version of
ACS? Any specific information on the bug, e.g. bug ID and description on
how to fix it? Is there a way I can resolve the problem without having to
upgrade?

Is there any documentation I can follow on how to upgrade from 4.2 to 4.8?
Will this be quite straight-forward or will this involve many steps? We are
running a production environment and we don't have staging / test
environment to play with.

Looking forward to your reply, thank you.

Cheers.

On Sat, Feb 6, 2016 at 3:48 PM, Wido den Hollander <wi...@widodh.nl> wrote:

> Hi,
>
> > Op 5 februari 2016 om 17:24 schreef Indra Pramana <in...@sg.or.id>:
> >
> >
> > Dear all,
> >
> > We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD for primary
> > storage. In the past one week, many of our KVM host agents would often be
> > disconnected from the management server, causing the VMs to go down
> because
> > of HA work. While we used to have host disconnection in the past,
> normally
> > it would only affect just one host, but this time round, when the problem
> > happens, it would happen on multiple hosts, up to 4-5 hosts at the same
> > time.
> >
>
> Any reason to still run 4.2? I've seen this happen as well and I haven't
> seen
> this with recent versions of ACS.
>
> Could you maybe upgrade to 4.8?
>
> Wido
>
> > Nothing much I can find on both the management-server.log and agent.log,
> > with no significant warn, error or exceptions logged before the
> > disconnection. Here are the sample logs from the agent:
> >
> > ===
> > 2016-02-05 03:20:28,820 ERROR [cloud.agent.Agent] (UgentTask-7:null) Ping
> > Interval has gone past 300000.  Attempting to reconnect.
> > 2016-02-05 03:20:28,825 DEBUG [cloud.agent.Agent] (UgentTask-7:null)
> > Clearing watch list: 2
> > 2016-02-05 03:20:28,825 DEBUG [utils.nio.NioConnection]
> > (Agent-Selector:null) Closing socket
> > Socket[addr=/*.*.3.3,port=8250,localport=50489]
> > 2016-02-05 03:20:33,825 INFO  [cloud.agent.Agent] (UgentTask-7:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:20:38,826 INFO  [cloud.agent.Agent] (UgentTask-7:null)
> > Reconnecting...
> > 2016-02-05 03:20:38,829 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > Connecting to *.*.3.3:8250
> > 2016-02-05 03:20:38,925 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > SSL: Handshake done
> > 2016-02-05 03:20:38,926 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > Connected to *.*.3.3:8250
> > 2016-02-05 03:20:43,926 INFO  [cloud.agent.Agent] (UgentTask-7:null)
> > Connected to the server
> > ===
> >
> > Sometimes, the Cloudstack agent will not be able to re-connect unless if
> we
> > stop and start the agent again manually:
> >
> > ===
> > 2016-02-05 03:22:20,330 ERROR [cloud.agent.Agent] (UgentTask-6:null) Ping
> > Interval has gone past 300000.  Attempting to reconnect.
> > 2016-02-05 03:22:20,331 DEBUG [cloud.agent.Agent] (UgentTask-6:null)
> > Clearing watch list: 2
> > 2016-02-05 03:22:20,353 DEBUG [utils.nio.NioConnection]
> > (Agent-Selector:null) Closing socket
> > Socket[addr=/*.*.3.3,port=8250,localport=46231]
> > 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:35,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:35,334 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:40,334 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:40,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:23:00,338 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:23:00,339 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:23:05,339 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:23:05,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:23:16,945 INFO  [cloud.agent.Agent]
> > (AgentShutdownThread:null) Stopping the agent: Reason = sig.kill
> > 2016-02-05 03:23:16,946 DEBUG [cloud.agent.Agent]
> > (AgentShutdownThread:null) Sending shutdown to management server
> > 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > ===
> >
> > During the time when the problem happens, I can confirm that the CPU and
> > memory load on the server is healthy and not overloaded. I understand
> that
> > such problem can happen when the hypervisor host is overloaded, however
> > this is not the case -- especially since when such problem happens now,
> it
> > will strike up to 4-5 hypervisor hosts at the same time.
> >
> > On the management server, here are the typical logs:
> >
> > ===
> > 2016-02-05 03:20:28,828 INFO  [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Investigating why host 72 has disconnected with
> > event AgentDisconnected
> > 2016-02-05 03:20:28,828 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) checking if agent (72) is alive
> > 2016-02-05 03:20:28,833 DEBUG [agent.transport.Request]
> > (AgentTaskPool-10:null) Seq 72-215557373: Sending  { Cmd , MgmtId:
> > 161342671900, via: 72, Ver: v1, Flags: 100011
> > , [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
> > 2016-02-05 03:20:28,833 INFO  [agent.manager.AgentAttache]
> > (AgentTaskPool-10:null) Seq 72-215557373: Unable to send due to Resource
> > [Host:72] is unreachable: Host 72: C
> > hannel is closed
> > 2016-02-05 03:20:28,833 DEBUG [agent.manager.AgentAttache]
> > (AgentTaskPool-10:null) Seq 72-215557373: Cancelling.
> > 2016-02-05 03:20:28,833 WARN  [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Resource [Host:72] is unreachable: Host 72:
> Channel
> > is closed
> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > (AgentTaskPool-10:null) SimpleInvestigator unable to determine the state
> of
> > the host.  Moving on.
> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > (AgentTaskPool-10:null) XenServerInvestigator unable to determine the
> state
> > of the host.  Moving on
> > .
> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > (AgentTaskPool-10:null) checking if agent (72) is alive
> > 2016-02-05 03:20:28,840 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > (AgentTaskPool-10:null) sending ping from (34) to agent's host ip address
> > (*.*.3.25)
> > 2016-02-05 03:20:28,843 DEBUG [agent.transport.Request]
> > (AgentTaskPool-10:null) Seq 34-815540048: Sending  { Cmd , MgmtId:
> > 161342671900, via: 34, Ver: v1, Flags: 100011
> > ,
> >
> [{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"*.*.3.25","wait":20}}]
> > }
> > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
> > (AgentManager-Handler-14:null) Seq 34-815540048: Processing:  { Ans: ,
> > MgmtId: 161342671900, via: 34, Ver: v1, F
> > lags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"wait":0}}] }
> > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
> > (AgentTaskPool-10:null) Seq 34-815540048: Received:  { Ans: , MgmtId:
> > 161342671900, via: 34, Ver: v1, Flags: 10,
> >  { Answer } }
> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> > (AgentTaskPool-10:null) host (*.*.3.25) has been successfully pinged,
> > returning that host is up
> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > (AgentTaskPool-10:null) ping from (34) to agent's host ip address
> > (*.*.3.25) successful, returning th
> > at agent is disconnected
> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > (AgentTaskPool-10:null) null was able to determine host 72 is in
> > Disconnected
> > 2016-02-05 03:20:29,024 INFO  [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) The state determined is Disconnected
> > 2016-02-05 03:20:29,024 WARN  [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Agent is disconnected but the host is still up:
> > 72-**-kvm-05
> > 2016-02-05 03:20:29,025 WARN  [apache.cloudstack.alerts]
> > (AgentTaskPool-10:null)  alertType:: 7 // dataCenterId:: 6 // podId:: 6
> //
> > clusterId:: null // message:: Host disconnected, name: **-kvm-05 (id:72),
> > availability zone: *****-******-01, pod: *****-******-Pod-01
> > 2016-02-05 03:20:29,086 INFO  [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Host 72 is disconnecting with event
> > AgentDisconnected
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) The next status of agent 72is Alert, current
> status
> > is Up
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Deregistering link for 72 with state Alert
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Remove Agent : 72
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.ConnectedAgentAttache]
> > (AgentTaskPool-10:null) Processing Disconnect.
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentAttache]
> > (AgentTaskPool-10:null) Seq 72-215547906: Sending disconnect to class
> > com.cloud.network.security.SecurityGroupListener
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Sending Disconnect to listener:
> >
> com.cloud.hypervisor.xen.discoverer.XcpServerDiscoverer_EnhancerByCloudStack_8f0e0536
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Sending Disconnect to listener:
> >
> com.cloud.deploy.DeploymentPlanningManagerImpl_EnhancerByCloudStack_55d28fac
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Sending Disconnect to listener:
> > com.cloud.network.NetworkManagerImpl_EnhancerByCloudStack_6763a13f
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Sending Disconnect to listener:
> > com.cloud.storage.secondary.SecondaryStorageListener
> > ===
> >
> > Can anyone advise how can I start troubleshooting to find the root cause
> of
> > the problem? So that I can take whatever measures to prevent similar
> > problem from happening.
> >
> > Any advice is greatly appreciated.
> >
> > Looking forward to your reply, thank you.
> >
> > Cheers.
> >
> > -ip-
>

Re: KVM host agent disconnection

Posted by Indra Pramana <in...@sg.or.id>.
Hi Wido,

Good day to you, and thanks for your reply. Nice to hear from you again. :)

So is this caused by a bug on 4.2 which is resolved on newer version of
ACS? Any specific information on the bug, e.g. bug ID and description on
how to fix it? Is there a way I can resolve the problem without having to
upgrade?

Is there any documentation I can follow on how to upgrade from 4.2 to 4.8?
Will this be quite straight-forward or will this involve many steps? We are
running a production environment and we don't have staging / test
environment to play with.

Looking forward to your reply, thank you.

Cheers.

On Sat, Feb 6, 2016 at 3:48 PM, Wido den Hollander <wi...@widodh.nl> wrote:

> Hi,
>
> > Op 5 februari 2016 om 17:24 schreef Indra Pramana <in...@sg.or.id>:
> >
> >
> > Dear all,
> >
> > We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD for primary
> > storage. In the past one week, many of our KVM host agents would often be
> > disconnected from the management server, causing the VMs to go down
> because
> > of HA work. While we used to have host disconnection in the past,
> normally
> > it would only affect just one host, but this time round, when the problem
> > happens, it would happen on multiple hosts, up to 4-5 hosts at the same
> > time.
> >
>
> Any reason to still run 4.2? I've seen this happen as well and I haven't
> seen
> this with recent versions of ACS.
>
> Could you maybe upgrade to 4.8?
>
> Wido
>
> > Nothing much I can find on both the management-server.log and agent.log,
> > with no significant warn, error or exceptions logged before the
> > disconnection. Here are the sample logs from the agent:
> >
> > ===
> > 2016-02-05 03:20:28,820 ERROR [cloud.agent.Agent] (UgentTask-7:null) Ping
> > Interval has gone past 300000.  Attempting to reconnect.
> > 2016-02-05 03:20:28,825 DEBUG [cloud.agent.Agent] (UgentTask-7:null)
> > Clearing watch list: 2
> > 2016-02-05 03:20:28,825 DEBUG [utils.nio.NioConnection]
> > (Agent-Selector:null) Closing socket
> > Socket[addr=/*.*.3.3,port=8250,localport=50489]
> > 2016-02-05 03:20:33,825 INFO  [cloud.agent.Agent] (UgentTask-7:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:20:38,826 INFO  [cloud.agent.Agent] (UgentTask-7:null)
> > Reconnecting...
> > 2016-02-05 03:20:38,829 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > Connecting to *.*.3.3:8250
> > 2016-02-05 03:20:38,925 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > SSL: Handshake done
> > 2016-02-05 03:20:38,926 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > Connected to *.*.3.3:8250
> > 2016-02-05 03:20:43,926 INFO  [cloud.agent.Agent] (UgentTask-7:null)
> > Connected to the server
> > ===
> >
> > Sometimes, the Cloudstack agent will not be able to re-connect unless if
> we
> > stop and start the agent again manually:
> >
> > ===
> > 2016-02-05 03:22:20,330 ERROR [cloud.agent.Agent] (UgentTask-6:null) Ping
> > Interval has gone past 300000.  Attempting to reconnect.
> > 2016-02-05 03:22:20,331 DEBUG [cloud.agent.Agent] (UgentTask-6:null)
> > Clearing watch list: 2
> > 2016-02-05 03:22:20,353 DEBUG [utils.nio.NioConnection]
> > (Agent-Selector:null) Closing socket
> > Socket[addr=/*.*.3.3,port=8250,localport=46231]
> > 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:35,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:35,334 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:40,334 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:40,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:23:00,338 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:23:00,339 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:23:05,339 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:23:05,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:23:16,945 INFO  [cloud.agent.Agent]
> > (AgentShutdownThread:null) Stopping the agent: Reason = sig.kill
> > 2016-02-05 03:23:16,946 DEBUG [cloud.agent.Agent]
> > (AgentShutdownThread:null) Sending shutdown to management server
> > 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > ===
> >
> > During the time when the problem happens, I can confirm that the CPU and
> > memory load on the server is healthy and not overloaded. I understand
> that
> > such problem can happen when the hypervisor host is overloaded, however
> > this is not the case -- especially since when such problem happens now,
> it
> > will strike up to 4-5 hypervisor hosts at the same time.
> >
> > On the management server, here are the typical logs:
> >
> > ===
> > 2016-02-05 03:20:28,828 INFO  [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Investigating why host 72 has disconnected with
> > event AgentDisconnected
> > 2016-02-05 03:20:28,828 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) checking if agent (72) is alive
> > 2016-02-05 03:20:28,833 DEBUG [agent.transport.Request]
> > (AgentTaskPool-10:null) Seq 72-215557373: Sending  { Cmd , MgmtId:
> > 161342671900, via: 72, Ver: v1, Flags: 100011
> > , [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
> > 2016-02-05 03:20:28,833 INFO  [agent.manager.AgentAttache]
> > (AgentTaskPool-10:null) Seq 72-215557373: Unable to send due to Resource
> > [Host:72] is unreachable: Host 72: C
> > hannel is closed
> > 2016-02-05 03:20:28,833 DEBUG [agent.manager.AgentAttache]
> > (AgentTaskPool-10:null) Seq 72-215557373: Cancelling.
> > 2016-02-05 03:20:28,833 WARN  [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Resource [Host:72] is unreachable: Host 72:
> Channel
> > is closed
> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > (AgentTaskPool-10:null) SimpleInvestigator unable to determine the state
> of
> > the host.  Moving on.
> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > (AgentTaskPool-10:null) XenServerInvestigator unable to determine the
> state
> > of the host.  Moving on
> > .
> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > (AgentTaskPool-10:null) checking if agent (72) is alive
> > 2016-02-05 03:20:28,840 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > (AgentTaskPool-10:null) sending ping from (34) to agent's host ip address
> > (*.*.3.25)
> > 2016-02-05 03:20:28,843 DEBUG [agent.transport.Request]
> > (AgentTaskPool-10:null) Seq 34-815540048: Sending  { Cmd , MgmtId:
> > 161342671900, via: 34, Ver: v1, Flags: 100011
> > ,
> >
> [{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"*.*.3.25","wait":20}}]
> > }
> > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
> > (AgentManager-Handler-14:null) Seq 34-815540048: Processing:  { Ans: ,
> > MgmtId: 161342671900, via: 34, Ver: v1, F
> > lags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"wait":0}}] }
> > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
> > (AgentTaskPool-10:null) Seq 34-815540048: Received:  { Ans: , MgmtId:
> > 161342671900, via: 34, Ver: v1, Flags: 10,
> >  { Answer } }
> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> > (AgentTaskPool-10:null) host (*.*.3.25) has been successfully pinged,
> > returning that host is up
> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > (AgentTaskPool-10:null) ping from (34) to agent's host ip address
> > (*.*.3.25) successful, returning th
> > at agent is disconnected
> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > (AgentTaskPool-10:null) null was able to determine host 72 is in
> > Disconnected
> > 2016-02-05 03:20:29,024 INFO  [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) The state determined is Disconnected
> > 2016-02-05 03:20:29,024 WARN  [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Agent is disconnected but the host is still up:
> > 72-**-kvm-05
> > 2016-02-05 03:20:29,025 WARN  [apache.cloudstack.alerts]
> > (AgentTaskPool-10:null)  alertType:: 7 // dataCenterId:: 6 // podId:: 6
> //
> > clusterId:: null // message:: Host disconnected, name: **-kvm-05 (id:72),
> > availability zone: *****-******-01, pod: *****-******-Pod-01
> > 2016-02-05 03:20:29,086 INFO  [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Host 72 is disconnecting with event
> > AgentDisconnected
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) The next status of agent 72is Alert, current
> status
> > is Up
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Deregistering link for 72 with state Alert
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Remove Agent : 72
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.ConnectedAgentAttache]
> > (AgentTaskPool-10:null) Processing Disconnect.
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentAttache]
> > (AgentTaskPool-10:null) Seq 72-215547906: Sending disconnect to class
> > com.cloud.network.security.SecurityGroupListener
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Sending Disconnect to listener:
> >
> com.cloud.hypervisor.xen.discoverer.XcpServerDiscoverer_EnhancerByCloudStack_8f0e0536
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Sending Disconnect to listener:
> >
> com.cloud.deploy.DeploymentPlanningManagerImpl_EnhancerByCloudStack_55d28fac
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Sending Disconnect to listener:
> > com.cloud.network.NetworkManagerImpl_EnhancerByCloudStack_6763a13f
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Sending Disconnect to listener:
> > com.cloud.storage.secondary.SecondaryStorageListener
> > ===
> >
> > Can anyone advise how can I start troubleshooting to find the root cause
> of
> > the problem? So that I can take whatever measures to prevent similar
> > problem from happening.
> >
> > Any advice is greatly appreciated.
> >
> > Looking forward to your reply, thank you.
> >
> > Cheers.
> >
> > -ip-
>

Re: KVM host agent disconnection

Posted by Wido den Hollander <wi...@widodh.nl>.
Hi,

> Op 5 februari 2016 om 17:24 schreef Indra Pramana <in...@sg.or.id>:
> 
> 
> Dear all,
> 
> We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD for primary
> storage. In the past one week, many of our KVM host agents would often be
> disconnected from the management server, causing the VMs to go down because
> of HA work. While we used to have host disconnection in the past, normally
> it would only affect just one host, but this time round, when the problem
> happens, it would happen on multiple hosts, up to 4-5 hosts at the same
> time.
> 

Any reason to still run 4.2? I've seen this happen as well and I haven't seen
this with recent versions of ACS.

Could you maybe upgrade to 4.8?

Wido

> Nothing much I can find on both the management-server.log and agent.log,
> with no significant warn, error or exceptions logged before the
> disconnection. Here are the sample logs from the agent:
> 
> ===
> 2016-02-05 03:20:28,820 ERROR [cloud.agent.Agent] (UgentTask-7:null) Ping
> Interval has gone past 300000.  Attempting to reconnect.
> 2016-02-05 03:20:28,825 DEBUG [cloud.agent.Agent] (UgentTask-7:null)
> Clearing watch list: 2
> 2016-02-05 03:20:28,825 DEBUG [utils.nio.NioConnection]
> (Agent-Selector:null) Closing socket
> Socket[addr=/*.*.3.3,port=8250,localport=50489]
> 2016-02-05 03:20:33,825 INFO  [cloud.agent.Agent] (UgentTask-7:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:20:38,826 INFO  [cloud.agent.Agent] (UgentTask-7:null)
> Reconnecting...
> 2016-02-05 03:20:38,829 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> Connecting to *.*.3.3:8250
> 2016-02-05 03:20:38,925 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> SSL: Handshake done
> 2016-02-05 03:20:38,926 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> Connected to *.*.3.3:8250
> 2016-02-05 03:20:43,926 INFO  [cloud.agent.Agent] (UgentTask-7:null)
> Connected to the server
> ===
> 
> Sometimes, the Cloudstack agent will not be able to re-connect unless if we
> stop and start the agent again manually:
> 
> ===
> 2016-02-05 03:22:20,330 ERROR [cloud.agent.Agent] (UgentTask-6:null) Ping
> Interval has gone past 300000.  Attempting to reconnect.
> 2016-02-05 03:22:20,331 DEBUG [cloud.agent.Agent] (UgentTask-6:null)
> Clearing watch list: 2
> 2016-02-05 03:22:20,353 DEBUG [utils.nio.NioConnection]
> (Agent-Selector:null) Closing socket
> Socket[addr=/*.*.3.3,port=8250,localport=46231]
> 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:22:35,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:22:35,334 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:22:40,334 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:22:40,335 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:23:00,338 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:23:00,339 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:23:05,339 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:23:05,340 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:23:16,945 INFO  [cloud.agent.Agent]
> (AgentShutdownThread:null) Stopping the agent: Reason = sig.kill
> 2016-02-05 03:23:16,946 DEBUG [cloud.agent.Agent]
> (AgentShutdownThread:null) Sending shutdown to management server
> 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> ===
> 
> During the time when the problem happens, I can confirm that the CPU and
> memory load on the server is healthy and not overloaded. I understand that
> such problem can happen when the hypervisor host is overloaded, however
> this is not the case -- especially since when such problem happens now, it
> will strike up to 4-5 hypervisor hosts at the same time.
> 
> On the management server, here are the typical logs:
> 
> ===
> 2016-02-05 03:20:28,828 INFO  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Investigating why host 72 has disconnected with
> event AgentDisconnected
> 2016-02-05 03:20:28,828 DEBUG [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) checking if agent (72) is alive
> 2016-02-05 03:20:28,833 DEBUG [agent.transport.Request]
> (AgentTaskPool-10:null) Seq 72-215557373: Sending  { Cmd , MgmtId:
> 161342671900, via: 72, Ver: v1, Flags: 100011
> , [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
> 2016-02-05 03:20:28,833 INFO  [agent.manager.AgentAttache]
> (AgentTaskPool-10:null) Seq 72-215557373: Unable to send due to Resource
> [Host:72] is unreachable: Host 72: C
> hannel is closed
> 2016-02-05 03:20:28,833 DEBUG [agent.manager.AgentAttache]
> (AgentTaskPool-10:null) Seq 72-215557373: Cancelling.
> 2016-02-05 03:20:28,833 WARN  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Resource [Host:72] is unreachable: Host 72: Channel
> is closed
> 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-10:null) SimpleInvestigator unable to determine the state of
> the host.  Moving on.
> 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-10:null) XenServerInvestigator unable to determine the state
> of the host.  Moving on
> .
> 2016-02-05 03:20:28,836 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (AgentTaskPool-10:null) checking if agent (72) is alive
> 2016-02-05 03:20:28,840 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (AgentTaskPool-10:null) sending ping from (34) to agent's host ip address
> (*.*.3.25)
> 2016-02-05 03:20:28,843 DEBUG [agent.transport.Request]
> (AgentTaskPool-10:null) Seq 34-815540048: Sending  { Cmd , MgmtId:
> 161342671900, via: 34, Ver: v1, Flags: 100011
> ,
> [{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"*.*.3.25","wait":20}}]
> }
> 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
> (AgentManager-Handler-14:null) Seq 34-815540048: Processing:  { Ans: ,
> MgmtId: 161342671900, via: 34, Ver: v1, F
> lags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"wait":0}}] }
> 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
> (AgentTaskPool-10:null) Seq 34-815540048: Received:  { Ans: , MgmtId:
> 161342671900, via: 34, Ver: v1, Flags: 10,
>  { Answer } }
> 2016-02-05 03:20:29,024 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> (AgentTaskPool-10:null) host (*.*.3.25) has been successfully pinged,
> returning that host is up
> 2016-02-05 03:20:29,024 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (AgentTaskPool-10:null) ping from (34) to agent's host ip address
> (*.*.3.25) successful, returning th
> at agent is disconnected
> 2016-02-05 03:20:29,024 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-10:null) null was able to determine host 72 is in
> Disconnected
> 2016-02-05 03:20:29,024 INFO  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) The state determined is Disconnected
> 2016-02-05 03:20:29,024 WARN  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Agent is disconnected but the host is still up:
> 72-**-kvm-05
> 2016-02-05 03:20:29,025 WARN  [apache.cloudstack.alerts]
> (AgentTaskPool-10:null)  alertType:: 7 // dataCenterId:: 6 // podId:: 6 //
> clusterId:: null // message:: Host disconnected, name: **-kvm-05 (id:72),
> availability zone: *****-******-01, pod: *****-******-Pod-01
> 2016-02-05 03:20:29,086 INFO  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Host 72 is disconnecting with event
> AgentDisconnected
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) The next status of agent 72is Alert, current status
> is Up
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Deregistering link for 72 with state Alert
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Remove Agent : 72
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.ConnectedAgentAttache]
> (AgentTaskPool-10:null) Processing Disconnect.
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentAttache]
> (AgentTaskPool-10:null) Seq 72-215547906: Sending disconnect to class
> com.cloud.network.security.SecurityGroupListener
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Sending Disconnect to listener:
> com.cloud.hypervisor.xen.discoverer.XcpServerDiscoverer_EnhancerByCloudStack_8f0e0536
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Sending Disconnect to listener:
> com.cloud.deploy.DeploymentPlanningManagerImpl_EnhancerByCloudStack_55d28fac
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Sending Disconnect to listener:
> com.cloud.network.NetworkManagerImpl_EnhancerByCloudStack_6763a13f
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Sending Disconnect to listener:
> com.cloud.storage.secondary.SecondaryStorageListener
> ===
> 
> Can anyone advise how can I start troubleshooting to find the root cause of
> the problem? So that I can take whatever measures to prevent similar
> problem from happening.
> 
> Any advice is greatly appreciated.
> 
> Looking forward to your reply, thank you.
> 
> Cheers.
> 
> -ip-

Re: KVM host agent disconnection

Posted by Wido den Hollander <wi...@widodh.nl>.
Hi,

> Op 5 februari 2016 om 17:24 schreef Indra Pramana <in...@sg.or.id>:
> 
> 
> Dear all,
> 
> We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD for primary
> storage. In the past one week, many of our KVM host agents would often be
> disconnected from the management server, causing the VMs to go down because
> of HA work. While we used to have host disconnection in the past, normally
> it would only affect just one host, but this time round, when the problem
> happens, it would happen on multiple hosts, up to 4-5 hosts at the same
> time.
> 

Any reason to still run 4.2? I've seen this happen as well and I haven't seen
this with recent versions of ACS.

Could you maybe upgrade to 4.8?

Wido

> Nothing much I can find on both the management-server.log and agent.log,
> with no significant warn, error or exceptions logged before the
> disconnection. Here are the sample logs from the agent:
> 
> ===
> 2016-02-05 03:20:28,820 ERROR [cloud.agent.Agent] (UgentTask-7:null) Ping
> Interval has gone past 300000.  Attempting to reconnect.
> 2016-02-05 03:20:28,825 DEBUG [cloud.agent.Agent] (UgentTask-7:null)
> Clearing watch list: 2
> 2016-02-05 03:20:28,825 DEBUG [utils.nio.NioConnection]
> (Agent-Selector:null) Closing socket
> Socket[addr=/*.*.3.3,port=8250,localport=50489]
> 2016-02-05 03:20:33,825 INFO  [cloud.agent.Agent] (UgentTask-7:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:20:38,826 INFO  [cloud.agent.Agent] (UgentTask-7:null)
> Reconnecting...
> 2016-02-05 03:20:38,829 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> Connecting to *.*.3.3:8250
> 2016-02-05 03:20:38,925 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> SSL: Handshake done
> 2016-02-05 03:20:38,926 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> Connected to *.*.3.3:8250
> 2016-02-05 03:20:43,926 INFO  [cloud.agent.Agent] (UgentTask-7:null)
> Connected to the server
> ===
> 
> Sometimes, the Cloudstack agent will not be able to re-connect unless if we
> stop and start the agent again manually:
> 
> ===
> 2016-02-05 03:22:20,330 ERROR [cloud.agent.Agent] (UgentTask-6:null) Ping
> Interval has gone past 300000.  Attempting to reconnect.
> 2016-02-05 03:22:20,331 DEBUG [cloud.agent.Agent] (UgentTask-6:null)
> Clearing watch list: 2
> 2016-02-05 03:22:20,353 DEBUG [utils.nio.NioConnection]
> (Agent-Selector:null) Closing socket
> Socket[addr=/*.*.3.3,port=8250,localport=46231]
> 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:22:35,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:22:35,334 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:22:40,334 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:22:40,335 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:23:00,338 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:23:00,339 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:23:05,339 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:23:05,340 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> 2016-02-05 03:23:16,945 INFO  [cloud.agent.Agent]
> (AgentShutdownThread:null) Stopping the agent: Reason = sig.kill
> 2016-02-05 03:23:16,946 DEBUG [cloud.agent.Agent]
> (AgentShutdownThread:null) Sending shutdown to management server
> 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> connection to the server. Dealing with the remaining commands...
> 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
> connect because we still have 3 commands in progress.
> ===
> 
> During the time when the problem happens, I can confirm that the CPU and
> memory load on the server is healthy and not overloaded. I understand that
> such problem can happen when the hypervisor host is overloaded, however
> this is not the case -- especially since when such problem happens now, it
> will strike up to 4-5 hypervisor hosts at the same time.
> 
> On the management server, here are the typical logs:
> 
> ===
> 2016-02-05 03:20:28,828 INFO  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Investigating why host 72 has disconnected with
> event AgentDisconnected
> 2016-02-05 03:20:28,828 DEBUG [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) checking if agent (72) is alive
> 2016-02-05 03:20:28,833 DEBUG [agent.transport.Request]
> (AgentTaskPool-10:null) Seq 72-215557373: Sending  { Cmd , MgmtId:
> 161342671900, via: 72, Ver: v1, Flags: 100011
> , [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
> 2016-02-05 03:20:28,833 INFO  [agent.manager.AgentAttache]
> (AgentTaskPool-10:null) Seq 72-215557373: Unable to send due to Resource
> [Host:72] is unreachable: Host 72: C
> hannel is closed
> 2016-02-05 03:20:28,833 DEBUG [agent.manager.AgentAttache]
> (AgentTaskPool-10:null) Seq 72-215557373: Cancelling.
> 2016-02-05 03:20:28,833 WARN  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Resource [Host:72] is unreachable: Host 72: Channel
> is closed
> 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-10:null) SimpleInvestigator unable to determine the state of
> the host.  Moving on.
> 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-10:null) XenServerInvestigator unable to determine the state
> of the host.  Moving on
> .
> 2016-02-05 03:20:28,836 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (AgentTaskPool-10:null) checking if agent (72) is alive
> 2016-02-05 03:20:28,840 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (AgentTaskPool-10:null) sending ping from (34) to agent's host ip address
> (*.*.3.25)
> 2016-02-05 03:20:28,843 DEBUG [agent.transport.Request]
> (AgentTaskPool-10:null) Seq 34-815540048: Sending  { Cmd , MgmtId:
> 161342671900, via: 34, Ver: v1, Flags: 100011
> ,
> [{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"*.*.3.25","wait":20}}]
> }
> 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
> (AgentManager-Handler-14:null) Seq 34-815540048: Processing:  { Ans: ,
> MgmtId: 161342671900, via: 34, Ver: v1, F
> lags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"wait":0}}] }
> 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
> (AgentTaskPool-10:null) Seq 34-815540048: Received:  { Ans: , MgmtId:
> 161342671900, via: 34, Ver: v1, Flags: 10,
>  { Answer } }
> 2016-02-05 03:20:29,024 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> (AgentTaskPool-10:null) host (*.*.3.25) has been successfully pinged,
> returning that host is up
> 2016-02-05 03:20:29,024 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (AgentTaskPool-10:null) ping from (34) to agent's host ip address
> (*.*.3.25) successful, returning th
> at agent is disconnected
> 2016-02-05 03:20:29,024 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-10:null) null was able to determine host 72 is in
> Disconnected
> 2016-02-05 03:20:29,024 INFO  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) The state determined is Disconnected
> 2016-02-05 03:20:29,024 WARN  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Agent is disconnected but the host is still up:
> 72-**-kvm-05
> 2016-02-05 03:20:29,025 WARN  [apache.cloudstack.alerts]
> (AgentTaskPool-10:null)  alertType:: 7 // dataCenterId:: 6 // podId:: 6 //
> clusterId:: null // message:: Host disconnected, name: **-kvm-05 (id:72),
> availability zone: *****-******-01, pod: *****-******-Pod-01
> 2016-02-05 03:20:29,086 INFO  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Host 72 is disconnecting with event
> AgentDisconnected
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) The next status of agent 72is Alert, current status
> is Up
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Deregistering link for 72 with state Alert
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Remove Agent : 72
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.ConnectedAgentAttache]
> (AgentTaskPool-10:null) Processing Disconnect.
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentAttache]
> (AgentTaskPool-10:null) Seq 72-215547906: Sending disconnect to class
> com.cloud.network.security.SecurityGroupListener
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Sending Disconnect to listener:
> com.cloud.hypervisor.xen.discoverer.XcpServerDiscoverer_EnhancerByCloudStack_8f0e0536
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Sending Disconnect to listener:
> com.cloud.deploy.DeploymentPlanningManagerImpl_EnhancerByCloudStack_55d28fac
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Sending Disconnect to listener:
> com.cloud.network.NetworkManagerImpl_EnhancerByCloudStack_6763a13f
> 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> (AgentTaskPool-10:null) Sending Disconnect to listener:
> com.cloud.storage.secondary.SecondaryStorageListener
> ===
> 
> Can anyone advise how can I start troubleshooting to find the root cause of
> the problem? So that I can take whatever measures to prevent similar
> problem from happening.
> 
> Any advice is greatly appreciated.
> 
> Looking forward to your reply, thank you.
> 
> Cheers.
> 
> -ip-