You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cloudstack.apache.org by Indra Pramana <in...@sg.or.id> on 2016/04/03 13:43:35 UTC

Re: URGENT - CloudStack agent not able to connect to management server

Hi Lucian,

Good day to you, and thank you for your reply. Apologise for the delay in
my reply.

Yes, I can confirm that we can access the host and port specified. Based on
the logs, the host can connect to the management server but there's no
follow-up logs which usually come after it's connected. Eventually, we
could only connect back the host after we rebooted it, which means
sacrificing all the VMs which were still up and running during the
disconnection.

At the time when the first hypervisor was disconnected, the CloudStack
management servers were very busy handling the disconnections, trying to
fence the hosts and initiate HA for all the affected VMs, based on the
logs. Could this have put a strain on the management server, causing it to
disconnect all the remaining hosts? Will adding new management server be
able to resolve the problem?

Any advice is appreciated.

Looking forward to your reply, thank you.

Cheers.

On Thu, Mar 31, 2016 at 5:28 PM, Nux! <nu...@li.nux.ro> wrote:

> Hello,
>
> Are you sure you can connect from the hypervisors to the
> cloudstack-management on the host and port specified in the
> agent.properties?
>
> --
> Sent from the Delta quadrant using Borg technology!
>
> Nux!
> www.nux.ro
>
> ----- Original Message -----
> > From: "Indra Pramana" <in...@sg.or.id>
> > To: users@cloudstack.apache.org
> > Sent: Thursday, 31 March, 2016 03:14:59
> > Subject: URGENT - CloudStack agent not able to connect to management
> server
>
> > Dear all,
> >
> > We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD storage. All
> our
> > agents got disconnected from the management server and unable to connect
> > again, despite rebooting the management server and stopping and
> restarting
> > the cloudstack-agent many times.
> >
> > We even tried to physically reboot a hypervisor host (sacrificing all the
> > running VMs inside) to see if it can reconnect after boot-up, and it's
> not
> > able to reconnect (keep on "Connecting" state). Here's the excerpts from
> > the logs:
> >
> > ====
> > 2016-03-31 10:07:49,346 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
> > Sending ping: Seq 0-11:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags: 11,
> >
> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":0,"wait":0}}]
> > }
> > 2016-03-31 10:07:49,395 DEBUG [cloud.agent.Agent] (Agent-Handler-2:null)
> > Received response: Seq 0-11:  { Ans: , MgmtId: 161342671900, via: 75,
> Ver:
> > v1, Flags: 100010,
> >
> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":0,"wait":0},"result":true,"wait":0}}]
> > }
> > 2016-03-31 10:08:49,271 DEBUG [kvm.resource.LibvirtComputingResource]
> > (UgentTask-5:null) Executing:
> > /usr/share/cloudstack-common/scripts/vm/network/security_group.py
> > get_rule_logs_for_vms
> > 2016-03-31 10:08:49,350 DEBUG [kvm.resource.LibvirtComputingResource]
> > (UgentTask-5:null) Execution is successful.
> > 2016-03-31 10:08:49,353 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
> > Sending ping: Seq 0-12:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags: 11,
> >
> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":0,"wait":0}}]
> > }
> > 2016-03-31 10:08:49,406 DEBUG [cloud.agent.Agent] (Agent-Handler-3:null)
> > Received response: Seq 0-12:  { Ans: , MgmtId: 161342671900, via: 75,
> Ver:
> > v1, Flags: 100010,
> >
> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":0,"wait":0},"result":true,"wait":0}}]
> > }
> > 2016-03-31 10:09:49,272 DEBUG [kvm.resource.LibvirtComputingResource]
> > (UgentTask-5:null) Executing:
> > /usr/share/cloudstack-common/scripts/vm/network/security_group.py
> > get_rule_logs_for_vms
> > 2016-03-31 10:09:49,345 DEBUG [kvm.resource.LibvirtComputingResource]
> > (UgentTask-5:null) Execution is successful.
> > 2016-03-31 10:09:49,347 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
> > Sending ping: Seq 0-13:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags: 11,
> >
> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":0,"wait":0}}]
> > }
> > 2016-03-31 10:09:49,398 DEBUG [cloud.agent.Agent] (Agent-Handler-4:null)
> > Received response: Seq 0-13:  { Ans: , MgmtId: 161342671900, via: 75,
> Ver:
> > v1, Flags: 100010,
> >
> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":0,"wait":0},"result":true,"wait":0}}]
> > }
> > ====
> >
> > On the existing hypervisor hosts, normally the agent would stuck at this
> > stage and from Cloudstack GUI, we don't see the agent in "Connecting"
> > state, it will be either on "Disconnected" or "Alert" state.
> >
> > ====
> > 2016-03-31 07:37:09,819 DEBUG [utils.script.Script] (main:null)
> Executing:
> > /bin/bash -c uname -r
> > 2016-03-31 07:37:09,829 DEBUG [utils.script.Script] (main:null) Execution
> > is successful.
> > 2016-03-31 07:37:09,832 DEBUG [cloud.agent.Agent] (main:null) Adding
> > shutdown hook
> > 2016-03-31 07:37:09,833 INFO  [cloud.agent.Agent] (main:null) Agent [id =
> > 73 : type = LibvirtComputingResource : zone = 6 : pod = 6 : workers = 5 :
> > host = 10.x.x.x : port = 8250
> > 2016-03-31 07:37:09,856 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > Connecting to 10.x.x.x:8250
> > 2016-03-31 07:37:10,178 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > SSL: Handshake done
> > 2016-03-31 07:37:10,179 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > Connected to 10.x.x.x:8250
> > ====
> >
> > No other significant and useful logs found on both the agents and
> > management server logs.
> >
> > Anyone can give a clue on what could be the problem? Have been trying to
> > reconnect in the past couple of hours without any issues. Any help is
> > greatly appreciated.
> >
> > Looking forward to your reply, thnk you.
> >
> > Cheers.
> >
> > -ip-
>

Re: URGENT - CloudStack agent not able to connect to management server

Posted by ilya <il...@gmail.com>.
Coincidentally, we observed somewhat similar behavior with ACS 4.5 and
KVM Agents (i assume Xen will be no different). Based on the code check,
this issue also exists in master. I'd think 4.2 is no different.

Marcus can speak more intelligently about this issue than i am, but from
what i understand about this issue and his explanation:

Summary:
CloudStack does not handle Agent connection with SSL Handshake properly,
that is - it process each connection serially, causing a block for the
next agent inline until SSL Handshake goes thru - but what if it does not?


Details and Example:
For example, if you open a telnet session to 8250 on MS, MS expects SSL
handshake to go through, however, the fake telnet session does not do
anything other than take up a socket. Current method for agent
connection is "serial", which means - the next proper agent in line -
cannot process its tasks and is being blocked - eventually gets
disconnected.  As a result, you will have many agents disconnect, then -
as "telnet" session is dropped in 60 seconds, you will have a chance to
reconnect. However, if the improper connection on 8250 is consistent,
you will have a continuous denial of service. The improper SSL handshake
can also be sporadic - causing sporadic disconnection issues.

With that said, we are testing internal fix that will allow for each
connection and subsequent tasks - to be treated as separate thread - by
implementing Callable method. If the improper connection comes thru, it
will be living in its own thread and dropped once it reaches timeout,
without affecting other Agents connections.


Once we confirm that it works as expected, we will release a patch.

In the meantime, if you need to bring back the stability to your
environment, try to find the offending connection. It could be one of
the agents going rogue or some other process trying to establish a
connection on 8250 and never completing SSL Handshake. For example, a
security scan invoked on the network that tries to poke a hole in any
port it finds.

Try restarting all cloudstack agents in your environment and make sure
incoming connection to cloudstack MS on 8250 are valid agent connection.

Putting LB in-front of cloudstack MS will make diagnosing this issue
much harder if you want to find a rogue connection. But long-term, you
definitely want to put LB in front of MS.

Another interesting observation, after we implemented a change mentioned
above, restarted MS servers, cloudstack agents reconnected much quicker,
within matter of seconds VS several minutes.

The fix needs more testing and baking until its released to public.

Regards
ilya





On 4/5/16 9:30 PM, Indra Pramana wrote:
> Hi Sanjeev and Rafael,
> 
> Good day to you, and thank you for your replies and advice.
> 
> We are getting a new management server and HA proxy load balancers. Will
> see if this can resolve the problem.
> 
> Thank you.
> 
> 
> 
> On Tue, Apr 5, 2016 at 8:24 PM, Rafael Weingärtner <
> rafaelweingartner@gmail.com> wrote:
> 
>> How many hosts (hypervisors) are you managing with a single MS?
>>
>> If you add new MSs, you need to balance their (HTTP 8080 and TCP 8250)
>> access with something like the HA proxy load balancer.
>>
>>
>>
>> On Tue, Apr 5, 2016 at 2:09 AM, Sanjeev Neelarapu <
>> sanjeev.neelarapu@accelerite.com> wrote:
>>
>>> Adding additional management server would definitely help.
>>>
>>> Best Regards,
>>> Sanjeev N
>>> Chief Product Engineer, Accelerite
>>> Off: +91 40 6722 9368 | EMail: sanjeev.neelarapu@accelerite.com
>>>
>>>
>>> -----Original Message-----
>>> From: Indra Pramana [mailto:indra@sg.or.id]
>>> Sent: Sunday, April 03, 2016 5:14 PM
>>> To: users@cloudstack.apache.org
>>> Subject: Re: URGENT - CloudStack agent not able to connect to management
>>> server
>>>
>>> Hi Lucian,
>>>
>>> Good day to you, and thank you for your reply. Apologise for the delay in
>>> my reply.
>>>
>>> Yes, I can confirm that we can access the host and port specified. Based
>>> on the logs, the host can connect to the management server but there's no
>>> follow-up logs which usually come after it's connected. Eventually, we
>>> could only connect back the host after we rebooted it, which means
>>> sacrificing all the VMs which were still up and running during the
>>> disconnection.
>>>
>>> At the time when the first hypervisor was disconnected, the CloudStack
>>> management servers were very busy handling the disconnections, trying to
>>> fence the hosts and initiate HA for all the affected VMs, based on the
>>> logs. Could this have put a strain on the management server, causing it
>> to
>>> disconnect all the remaining hosts? Will adding new management server be
>>> able to resolve the problem?
>>>
>>> Any advice is appreciated.
>>>
>>> Looking forward to your reply, thank you.
>>>
>>> Cheers.
>>>
>>> On Thu, Mar 31, 2016 at 5:28 PM, Nux! <nu...@li.nux.ro> wrote:
>>>
>>>> Hello,
>>>>
>>>> Are you sure you can connect from the hypervisors to the
>>>> cloudstack-management on the host and port specified in the
>>>> agent.properties?
>>>>
>>>> --
>>>> Sent from the Delta quadrant using Borg technology!
>>>>
>>>> Nux!
>>>> www.nux.ro
>>>>
>>>> ----- Original Message -----
>>>>> From: "Indra Pramana" <in...@sg.or.id>
>>>>> To: users@cloudstack.apache.org
>>>>> Sent: Thursday, 31 March, 2016 03:14:59
>>>>> Subject: URGENT - CloudStack agent not able to connect to management
>>>> server
>>>>
>>>>> Dear all,
>>>>>
>>>>> We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD storage.
>>>>> All
>>>> our
>>>>> agents got disconnected from the management server and unable to
>>>>> connect again, despite rebooting the management server and stopping
>>>>> and
>>>> restarting
>>>>> the cloudstack-agent many times.
>>>>>
>>>>> We even tried to physically reboot a hypervisor host (sacrificing
>>>>> all the running VMs inside) to see if it can reconnect after
>>>>> boot-up, and it's
>>>> not
>>>>> able to reconnect (keep on "Connecting" state). Here's the excerpts
>>>>> from the logs:
>>>>>
>>>>> ====
>>>>> 2016-03-31 10:07:49,346 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
>>>>> Sending ping: Seq 0-11:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags:
>>>>> 11,
>>>>>
>>>> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupState
>>>> s":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,
>>>> "hostType":"Routing","hostId":0,"wait":0}}]
>>>>> }
>>>>> 2016-03-31 10:07:49,395 DEBUG [cloud.agent.Agent]
>>>>> (Agent-Handler-2:null) Received response: Seq 0-11:  { Ans: ,
>>>>> MgmtId: 161342671900, via: 75,
>>>> Ver:
>>>>> v1, Flags: 100010,
>>>>>
>>>> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","
>>>> hostId":0,"wait":0},"result":true,"wait":0}}]
>>>>> }
>>>>> 2016-03-31 10:08:49,271 DEBUG
>>>>> [kvm.resource.LibvirtComputingResource]
>>>>> (UgentTask-5:null) Executing:
>>>>> /usr/share/cloudstack-common/scripts/vm/network/security_group.py
>>>>> get_rule_logs_for_vms
>>>>> 2016-03-31 10:08:49,350 DEBUG
>>>>> [kvm.resource.LibvirtComputingResource]
>>>>> (UgentTask-5:null) Execution is successful.
>>>>> 2016-03-31 10:08:49,353 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
>>>>> Sending ping: Seq 0-12:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags:
>>>>> 11,
>>>>>
>>>> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupState
>>>> s":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,
>>>> "hostType":"Routing","hostId":0,"wait":0}}]
>>>>> }
>>>>> 2016-03-31 10:08:49,406 DEBUG [cloud.agent.Agent]
>>>>> (Agent-Handler-3:null) Received response: Seq 0-12:  { Ans: ,
>>>>> MgmtId: 161342671900, via: 75,
>>>> Ver:
>>>>> v1, Flags: 100010,
>>>>>
>>>> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","
>>>> hostId":0,"wait":0},"result":true,"wait":0}}]
>>>>> }
>>>>> 2016-03-31 10:09:49,272 DEBUG
>>>>> [kvm.resource.LibvirtComputingResource]
>>>>> (UgentTask-5:null) Executing:
>>>>> /usr/share/cloudstack-common/scripts/vm/network/security_group.py
>>>>> get_rule_logs_for_vms
>>>>> 2016-03-31 10:09:49,345 DEBUG
>>>>> [kvm.resource.LibvirtComputingResource]
>>>>> (UgentTask-5:null) Execution is successful.
>>>>> 2016-03-31 10:09:49,347 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
>>>>> Sending ping: Seq 0-13:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags:
>>>>> 11,
>>>>>
>>>> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupState
>>>> s":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,
>>>> "hostType":"Routing","hostId":0,"wait":0}}]
>>>>> }
>>>>> 2016-03-31 10:09:49,398 DEBUG [cloud.agent.Agent]
>>>>> (Agent-Handler-4:null) Received response: Seq 0-13:  { Ans: ,
>>>>> MgmtId: 161342671900, via: 75,
>>>> Ver:
>>>>> v1, Flags: 100010,
>>>>>
>>>> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","
>>>> hostId":0,"wait":0},"result":true,"wait":0}}]
>>>>> }
>>>>> ====
>>>>>
>>>>> On the existing hypervisor hosts, normally the agent would stuck at
>>>>> this stage and from Cloudstack GUI, we don't see the agent in
>>> "Connecting"
>>>>> state, it will be either on "Disconnected" or "Alert" state.
>>>>>
>>>>> ====
>>>>> 2016-03-31 07:37:09,819 DEBUG [utils.script.Script] (main:null)
>>>> Executing:
>>>>> /bin/bash -c uname -r
>>>>> 2016-03-31 07:37:09,829 DEBUG [utils.script.Script] (main:null)
>>>>> Execution is successful.
>>>>> 2016-03-31 07:37:09,832 DEBUG [cloud.agent.Agent] (main:null) Adding
>>>>> shutdown hook
>>>>> 2016-03-31 07:37:09,833 INFO  [cloud.agent.Agent] (main:null) Agent
>>>>> [id =
>>>>> 73 : type = LibvirtComputingResource : zone = 6 : pod = 6 : workers =
>>> 5 :
>>>>> host = 10.x.x.x : port = 8250
>>>>> 2016-03-31 07:37:09,856 INFO  [utils.nio.NioClient]
>>>>> (Agent-Selector:null) Connecting to 10.x.x.x:8250
>>>>> 2016-03-31 07:37:10,178 INFO  [utils.nio.NioClient]
>>>>> (Agent-Selector:null)
>>>>> SSL: Handshake done
>>>>> 2016-03-31 07:37:10,179 INFO  [utils.nio.NioClient]
>>>>> (Agent-Selector:null) Connected to 10.x.x.x:8250 ====
>>>>>
>>>>> No other significant and useful logs found on both the agents and
>>>>> management server logs.
>>>>>
>>>>> Anyone can give a clue on what could be the problem? Have been
>>>>> trying to reconnect in the past couple of hours without any issues.
>>>>> Any help is greatly appreciated.
>>>>>
>>>>> Looking forward to your reply, thnk you.
>>>>>
>>>>> Cheers.
>>>>>
>>>>> -ip-
>>>>
>>>
>>>
>>>
>>> DISCLAIMER
>>> ==========
>>> This e-mail may contain privileged and confidential information which is
>>> the property of Accelerite, a Persistent Systems business. It is intended
>>> only for the use of the individual or entity to which it is addressed. If
>>> you are not the intended recipient, you are not authorized to read,
>> retain,
>>> copy, print, distribute or use this message. If you have received this
>>> communication in error, please notify the sender and delete all copies of
>>> this message. Accelerite, a Persistent Systems business does not accept
>> any
>>> liability for virus infected mails.
>>>
>>
>>
>>
>> --
>> Rafael Weingärtner
>>
> 

Re: URGENT - CloudStack agent not able to connect to management server

Posted by Indra Pramana <in...@sg.or.id>.
Hi Sanjeev and Rafael,

Good day to you, and thank you for your replies and advice.

We are getting a new management server and HA proxy load balancers. Will
see if this can resolve the problem.

Thank you.



On Tue, Apr 5, 2016 at 8:24 PM, Rafael Weingärtner <
rafaelweingartner@gmail.com> wrote:

> How many hosts (hypervisors) are you managing with a single MS?
>
> If you add new MSs, you need to balance their (HTTP 8080 and TCP 8250)
> access with something like the HA proxy load balancer.
>
>
>
> On Tue, Apr 5, 2016 at 2:09 AM, Sanjeev Neelarapu <
> sanjeev.neelarapu@accelerite.com> wrote:
>
> > Adding additional management server would definitely help.
> >
> > Best Regards,
> > Sanjeev N
> > Chief Product Engineer, Accelerite
> > Off: +91 40 6722 9368 | EMail: sanjeev.neelarapu@accelerite.com
> >
> >
> > -----Original Message-----
> > From: Indra Pramana [mailto:indra@sg.or.id]
> > Sent: Sunday, April 03, 2016 5:14 PM
> > To: users@cloudstack.apache.org
> > Subject: Re: URGENT - CloudStack agent not able to connect to management
> > server
> >
> > Hi Lucian,
> >
> > Good day to you, and thank you for your reply. Apologise for the delay in
> > my reply.
> >
> > Yes, I can confirm that we can access the host and port specified. Based
> > on the logs, the host can connect to the management server but there's no
> > follow-up logs which usually come after it's connected. Eventually, we
> > could only connect back the host after we rebooted it, which means
> > sacrificing all the VMs which were still up and running during the
> > disconnection.
> >
> > At the time when the first hypervisor was disconnected, the CloudStack
> > management servers were very busy handling the disconnections, trying to
> > fence the hosts and initiate HA for all the affected VMs, based on the
> > logs. Could this have put a strain on the management server, causing it
> to
> > disconnect all the remaining hosts? Will adding new management server be
> > able to resolve the problem?
> >
> > Any advice is appreciated.
> >
> > Looking forward to your reply, thank you.
> >
> > Cheers.
> >
> > On Thu, Mar 31, 2016 at 5:28 PM, Nux! <nu...@li.nux.ro> wrote:
> >
> > > Hello,
> > >
> > > Are you sure you can connect from the hypervisors to the
> > > cloudstack-management on the host and port specified in the
> > > agent.properties?
> > >
> > > --
> > > Sent from the Delta quadrant using Borg technology!
> > >
> > > Nux!
> > > www.nux.ro
> > >
> > > ----- Original Message -----
> > > > From: "Indra Pramana" <in...@sg.or.id>
> > > > To: users@cloudstack.apache.org
> > > > Sent: Thursday, 31 March, 2016 03:14:59
> > > > Subject: URGENT - CloudStack agent not able to connect to management
> > > server
> > >
> > > > Dear all,
> > > >
> > > > We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD storage.
> > > > All
> > > our
> > > > agents got disconnected from the management server and unable to
> > > > connect again, despite rebooting the management server and stopping
> > > > and
> > > restarting
> > > > the cloudstack-agent many times.
> > > >
> > > > We even tried to physically reboot a hypervisor host (sacrificing
> > > > all the running VMs inside) to see if it can reconnect after
> > > > boot-up, and it's
> > > not
> > > > able to reconnect (keep on "Connecting" state). Here's the excerpts
> > > > from the logs:
> > > >
> > > > ====
> > > > 2016-03-31 10:07:49,346 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
> > > > Sending ping: Seq 0-11:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags:
> > > > 11,
> > > >
> > > [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupState
> > > s":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,
> > > "hostType":"Routing","hostId":0,"wait":0}}]
> > > > }
> > > > 2016-03-31 10:07:49,395 DEBUG [cloud.agent.Agent]
> > > > (Agent-Handler-2:null) Received response: Seq 0-11:  { Ans: ,
> > > > MgmtId: 161342671900, via: 75,
> > > Ver:
> > > > v1, Flags: 100010,
> > > >
> > > [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","
> > > hostId":0,"wait":0},"result":true,"wait":0}}]
> > > > }
> > > > 2016-03-31 10:08:49,271 DEBUG
> > > > [kvm.resource.LibvirtComputingResource]
> > > > (UgentTask-5:null) Executing:
> > > > /usr/share/cloudstack-common/scripts/vm/network/security_group.py
> > > > get_rule_logs_for_vms
> > > > 2016-03-31 10:08:49,350 DEBUG
> > > > [kvm.resource.LibvirtComputingResource]
> > > > (UgentTask-5:null) Execution is successful.
> > > > 2016-03-31 10:08:49,353 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
> > > > Sending ping: Seq 0-12:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags:
> > > > 11,
> > > >
> > > [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupState
> > > s":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,
> > > "hostType":"Routing","hostId":0,"wait":0}}]
> > > > }
> > > > 2016-03-31 10:08:49,406 DEBUG [cloud.agent.Agent]
> > > > (Agent-Handler-3:null) Received response: Seq 0-12:  { Ans: ,
> > > > MgmtId: 161342671900, via: 75,
> > > Ver:
> > > > v1, Flags: 100010,
> > > >
> > > [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","
> > > hostId":0,"wait":0},"result":true,"wait":0}}]
> > > > }
> > > > 2016-03-31 10:09:49,272 DEBUG
> > > > [kvm.resource.LibvirtComputingResource]
> > > > (UgentTask-5:null) Executing:
> > > > /usr/share/cloudstack-common/scripts/vm/network/security_group.py
> > > > get_rule_logs_for_vms
> > > > 2016-03-31 10:09:49,345 DEBUG
> > > > [kvm.resource.LibvirtComputingResource]
> > > > (UgentTask-5:null) Execution is successful.
> > > > 2016-03-31 10:09:49,347 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
> > > > Sending ping: Seq 0-13:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags:
> > > > 11,
> > > >
> > > [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupState
> > > s":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,
> > > "hostType":"Routing","hostId":0,"wait":0}}]
> > > > }
> > > > 2016-03-31 10:09:49,398 DEBUG [cloud.agent.Agent]
> > > > (Agent-Handler-4:null) Received response: Seq 0-13:  { Ans: ,
> > > > MgmtId: 161342671900, via: 75,
> > > Ver:
> > > > v1, Flags: 100010,
> > > >
> > > [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","
> > > hostId":0,"wait":0},"result":true,"wait":0}}]
> > > > }
> > > > ====
> > > >
> > > > On the existing hypervisor hosts, normally the agent would stuck at
> > > > this stage and from Cloudstack GUI, we don't see the agent in
> > "Connecting"
> > > > state, it will be either on "Disconnected" or "Alert" state.
> > > >
> > > > ====
> > > > 2016-03-31 07:37:09,819 DEBUG [utils.script.Script] (main:null)
> > > Executing:
> > > > /bin/bash -c uname -r
> > > > 2016-03-31 07:37:09,829 DEBUG [utils.script.Script] (main:null)
> > > > Execution is successful.
> > > > 2016-03-31 07:37:09,832 DEBUG [cloud.agent.Agent] (main:null) Adding
> > > > shutdown hook
> > > > 2016-03-31 07:37:09,833 INFO  [cloud.agent.Agent] (main:null) Agent
> > > > [id =
> > > > 73 : type = LibvirtComputingResource : zone = 6 : pod = 6 : workers =
> > 5 :
> > > > host = 10.x.x.x : port = 8250
> > > > 2016-03-31 07:37:09,856 INFO  [utils.nio.NioClient]
> > > > (Agent-Selector:null) Connecting to 10.x.x.x:8250
> > > > 2016-03-31 07:37:10,178 INFO  [utils.nio.NioClient]
> > > > (Agent-Selector:null)
> > > > SSL: Handshake done
> > > > 2016-03-31 07:37:10,179 INFO  [utils.nio.NioClient]
> > > > (Agent-Selector:null) Connected to 10.x.x.x:8250 ====
> > > >
> > > > No other significant and useful logs found on both the agents and
> > > > management server logs.
> > > >
> > > > Anyone can give a clue on what could be the problem? Have been
> > > > trying to reconnect in the past couple of hours without any issues.
> > > > Any help is greatly appreciated.
> > > >
> > > > Looking forward to your reply, thnk you.
> > > >
> > > > Cheers.
> > > >
> > > > -ip-
> > >
> >
> >
> >
> > DISCLAIMER
> > ==========
> > This e-mail may contain privileged and confidential information which is
> > the property of Accelerite, a Persistent Systems business. It is intended
> > only for the use of the individual or entity to which it is addressed. If
> > you are not the intended recipient, you are not authorized to read,
> retain,
> > copy, print, distribute or use this message. If you have received this
> > communication in error, please notify the sender and delete all copies of
> > this message. Accelerite, a Persistent Systems business does not accept
> any
> > liability for virus infected mails.
> >
>
>
>
> --
> Rafael Weingärtner
>

Re: URGENT - CloudStack agent not able to connect to management server

Posted by Rafael Weingärtner <ra...@gmail.com>.
How many hosts (hypervisors) are you managing with a single MS?

If you add new MSs, you need to balance their (HTTP 8080 and TCP 8250)
access with something like the HA proxy load balancer.



On Tue, Apr 5, 2016 at 2:09 AM, Sanjeev Neelarapu <
sanjeev.neelarapu@accelerite.com> wrote:

> Adding additional management server would definitely help.
>
> Best Regards,
> Sanjeev N
> Chief Product Engineer, Accelerite
> Off: +91 40 6722 9368 | EMail: sanjeev.neelarapu@accelerite.com
>
>
> -----Original Message-----
> From: Indra Pramana [mailto:indra@sg.or.id]
> Sent: Sunday, April 03, 2016 5:14 PM
> To: users@cloudstack.apache.org
> Subject: Re: URGENT - CloudStack agent not able to connect to management
> server
>
> Hi Lucian,
>
> Good day to you, and thank you for your reply. Apologise for the delay in
> my reply.
>
> Yes, I can confirm that we can access the host and port specified. Based
> on the logs, the host can connect to the management server but there's no
> follow-up logs which usually come after it's connected. Eventually, we
> could only connect back the host after we rebooted it, which means
> sacrificing all the VMs which were still up and running during the
> disconnection.
>
> At the time when the first hypervisor was disconnected, the CloudStack
> management servers were very busy handling the disconnections, trying to
> fence the hosts and initiate HA for all the affected VMs, based on the
> logs. Could this have put a strain on the management server, causing it to
> disconnect all the remaining hosts? Will adding new management server be
> able to resolve the problem?
>
> Any advice is appreciated.
>
> Looking forward to your reply, thank you.
>
> Cheers.
>
> On Thu, Mar 31, 2016 at 5:28 PM, Nux! <nu...@li.nux.ro> wrote:
>
> > Hello,
> >
> > Are you sure you can connect from the hypervisors to the
> > cloudstack-management on the host and port specified in the
> > agent.properties?
> >
> > --
> > Sent from the Delta quadrant using Borg technology!
> >
> > Nux!
> > www.nux.ro
> >
> > ----- Original Message -----
> > > From: "Indra Pramana" <in...@sg.or.id>
> > > To: users@cloudstack.apache.org
> > > Sent: Thursday, 31 March, 2016 03:14:59
> > > Subject: URGENT - CloudStack agent not able to connect to management
> > server
> >
> > > Dear all,
> > >
> > > We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD storage.
> > > All
> > our
> > > agents got disconnected from the management server and unable to
> > > connect again, despite rebooting the management server and stopping
> > > and
> > restarting
> > > the cloudstack-agent many times.
> > >
> > > We even tried to physically reboot a hypervisor host (sacrificing
> > > all the running VMs inside) to see if it can reconnect after
> > > boot-up, and it's
> > not
> > > able to reconnect (keep on "Connecting" state). Here's the excerpts
> > > from the logs:
> > >
> > > ====
> > > 2016-03-31 10:07:49,346 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
> > > Sending ping: Seq 0-11:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags:
> > > 11,
> > >
> > [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupState
> > s":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,
> > "hostType":"Routing","hostId":0,"wait":0}}]
> > > }
> > > 2016-03-31 10:07:49,395 DEBUG [cloud.agent.Agent]
> > > (Agent-Handler-2:null) Received response: Seq 0-11:  { Ans: ,
> > > MgmtId: 161342671900, via: 75,
> > Ver:
> > > v1, Flags: 100010,
> > >
> > [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","
> > hostId":0,"wait":0},"result":true,"wait":0}}]
> > > }
> > > 2016-03-31 10:08:49,271 DEBUG
> > > [kvm.resource.LibvirtComputingResource]
> > > (UgentTask-5:null) Executing:
> > > /usr/share/cloudstack-common/scripts/vm/network/security_group.py
> > > get_rule_logs_for_vms
> > > 2016-03-31 10:08:49,350 DEBUG
> > > [kvm.resource.LibvirtComputingResource]
> > > (UgentTask-5:null) Execution is successful.
> > > 2016-03-31 10:08:49,353 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
> > > Sending ping: Seq 0-12:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags:
> > > 11,
> > >
> > [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupState
> > s":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,
> > "hostType":"Routing","hostId":0,"wait":0}}]
> > > }
> > > 2016-03-31 10:08:49,406 DEBUG [cloud.agent.Agent]
> > > (Agent-Handler-3:null) Received response: Seq 0-12:  { Ans: ,
> > > MgmtId: 161342671900, via: 75,
> > Ver:
> > > v1, Flags: 100010,
> > >
> > [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","
> > hostId":0,"wait":0},"result":true,"wait":0}}]
> > > }
> > > 2016-03-31 10:09:49,272 DEBUG
> > > [kvm.resource.LibvirtComputingResource]
> > > (UgentTask-5:null) Executing:
> > > /usr/share/cloudstack-common/scripts/vm/network/security_group.py
> > > get_rule_logs_for_vms
> > > 2016-03-31 10:09:49,345 DEBUG
> > > [kvm.resource.LibvirtComputingResource]
> > > (UgentTask-5:null) Execution is successful.
> > > 2016-03-31 10:09:49,347 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
> > > Sending ping: Seq 0-13:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags:
> > > 11,
> > >
> > [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupState
> > s":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,
> > "hostType":"Routing","hostId":0,"wait":0}}]
> > > }
> > > 2016-03-31 10:09:49,398 DEBUG [cloud.agent.Agent]
> > > (Agent-Handler-4:null) Received response: Seq 0-13:  { Ans: ,
> > > MgmtId: 161342671900, via: 75,
> > Ver:
> > > v1, Flags: 100010,
> > >
> > [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","
> > hostId":0,"wait":0},"result":true,"wait":0}}]
> > > }
> > > ====
> > >
> > > On the existing hypervisor hosts, normally the agent would stuck at
> > > this stage and from Cloudstack GUI, we don't see the agent in
> "Connecting"
> > > state, it will be either on "Disconnected" or "Alert" state.
> > >
> > > ====
> > > 2016-03-31 07:37:09,819 DEBUG [utils.script.Script] (main:null)
> > Executing:
> > > /bin/bash -c uname -r
> > > 2016-03-31 07:37:09,829 DEBUG [utils.script.Script] (main:null)
> > > Execution is successful.
> > > 2016-03-31 07:37:09,832 DEBUG [cloud.agent.Agent] (main:null) Adding
> > > shutdown hook
> > > 2016-03-31 07:37:09,833 INFO  [cloud.agent.Agent] (main:null) Agent
> > > [id =
> > > 73 : type = LibvirtComputingResource : zone = 6 : pod = 6 : workers =
> 5 :
> > > host = 10.x.x.x : port = 8250
> > > 2016-03-31 07:37:09,856 INFO  [utils.nio.NioClient]
> > > (Agent-Selector:null) Connecting to 10.x.x.x:8250
> > > 2016-03-31 07:37:10,178 INFO  [utils.nio.NioClient]
> > > (Agent-Selector:null)
> > > SSL: Handshake done
> > > 2016-03-31 07:37:10,179 INFO  [utils.nio.NioClient]
> > > (Agent-Selector:null) Connected to 10.x.x.x:8250 ====
> > >
> > > No other significant and useful logs found on both the agents and
> > > management server logs.
> > >
> > > Anyone can give a clue on what could be the problem? Have been
> > > trying to reconnect in the past couple of hours without any issues.
> > > Any help is greatly appreciated.
> > >
> > > Looking forward to your reply, thnk you.
> > >
> > > Cheers.
> > >
> > > -ip-
> >
>
>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Accelerite, a Persistent Systems business. It is intended
> only for the use of the individual or entity to which it is addressed. If
> you are not the intended recipient, you are not authorized to read, retain,
> copy, print, distribute or use this message. If you have received this
> communication in error, please notify the sender and delete all copies of
> this message. Accelerite, a Persistent Systems business does not accept any
> liability for virus infected mails.
>



-- 
Rafael Weingärtner

RE: URGENT - CloudStack agent not able to connect to management server

Posted by Sanjeev Neelarapu <sa...@accelerite.com>.
Adding additional management server would definitely help.

Best Regards,
Sanjeev N
Chief Product Engineer, Accelerite
Off: +91 40 6722 9368 | EMail: sanjeev.neelarapu@accelerite.com 


-----Original Message-----
From: Indra Pramana [mailto:indra@sg.or.id] 
Sent: Sunday, April 03, 2016 5:14 PM
To: users@cloudstack.apache.org
Subject: Re: URGENT - CloudStack agent not able to connect to management server

Hi Lucian,

Good day to you, and thank you for your reply. Apologise for the delay in my reply.

Yes, I can confirm that we can access the host and port specified. Based on the logs, the host can connect to the management server but there's no follow-up logs which usually come after it's connected. Eventually, we could only connect back the host after we rebooted it, which means sacrificing all the VMs which were still up and running during the disconnection.

At the time when the first hypervisor was disconnected, the CloudStack management servers were very busy handling the disconnections, trying to fence the hosts and initiate HA for all the affected VMs, based on the logs. Could this have put a strain on the management server, causing it to disconnect all the remaining hosts? Will adding new management server be able to resolve the problem?

Any advice is appreciated.

Looking forward to your reply, thank you.

Cheers.

On Thu, Mar 31, 2016 at 5:28 PM, Nux! <nu...@li.nux.ro> wrote:

> Hello,
>
> Are you sure you can connect from the hypervisors to the 
> cloudstack-management on the host and port specified in the 
> agent.properties?
>
> --
> Sent from the Delta quadrant using Borg technology!
>
> Nux!
> www.nux.ro
>
> ----- Original Message -----
> > From: "Indra Pramana" <in...@sg.or.id>
> > To: users@cloudstack.apache.org
> > Sent: Thursday, 31 March, 2016 03:14:59
> > Subject: URGENT - CloudStack agent not able to connect to management
> server
>
> > Dear all,
> >
> > We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD storage. 
> > All
> our
> > agents got disconnected from the management server and unable to 
> > connect again, despite rebooting the management server and stopping 
> > and
> restarting
> > the cloudstack-agent many times.
> >
> > We even tried to physically reboot a hypervisor host (sacrificing 
> > all the running VMs inside) to see if it can reconnect after 
> > boot-up, and it's
> not
> > able to reconnect (keep on "Connecting" state). Here's the excerpts 
> > from the logs:
> >
> > ====
> > 2016-03-31 10:07:49,346 DEBUG [cloud.agent.Agent] (UgentTask-5:null) 
> > Sending ping: Seq 0-11:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags: 
> > 11,
> >
> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupState
> s":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,
> "hostType":"Routing","hostId":0,"wait":0}}]
> > }
> > 2016-03-31 10:07:49,395 DEBUG [cloud.agent.Agent] 
> > (Agent-Handler-2:null) Received response: Seq 0-11:  { Ans: , 
> > MgmtId: 161342671900, via: 75,
> Ver:
> > v1, Flags: 100010,
> >
> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","
> hostId":0,"wait":0},"result":true,"wait":0}}]
> > }
> > 2016-03-31 10:08:49,271 DEBUG 
> > [kvm.resource.LibvirtComputingResource]
> > (UgentTask-5:null) Executing:
> > /usr/share/cloudstack-common/scripts/vm/network/security_group.py
> > get_rule_logs_for_vms
> > 2016-03-31 10:08:49,350 DEBUG 
> > [kvm.resource.LibvirtComputingResource]
> > (UgentTask-5:null) Execution is successful.
> > 2016-03-31 10:08:49,353 DEBUG [cloud.agent.Agent] (UgentTask-5:null) 
> > Sending ping: Seq 0-12:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags: 
> > 11,
> >
> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupState
> s":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,
> "hostType":"Routing","hostId":0,"wait":0}}]
> > }
> > 2016-03-31 10:08:49,406 DEBUG [cloud.agent.Agent] 
> > (Agent-Handler-3:null) Received response: Seq 0-12:  { Ans: , 
> > MgmtId: 161342671900, via: 75,
> Ver:
> > v1, Flags: 100010,
> >
> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","
> hostId":0,"wait":0},"result":true,"wait":0}}]
> > }
> > 2016-03-31 10:09:49,272 DEBUG 
> > [kvm.resource.LibvirtComputingResource]
> > (UgentTask-5:null) Executing:
> > /usr/share/cloudstack-common/scripts/vm/network/security_group.py
> > get_rule_logs_for_vms
> > 2016-03-31 10:09:49,345 DEBUG 
> > [kvm.resource.LibvirtComputingResource]
> > (UgentTask-5:null) Execution is successful.
> > 2016-03-31 10:09:49,347 DEBUG [cloud.agent.Agent] (UgentTask-5:null) 
> > Sending ping: Seq 0-13:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags: 
> > 11,
> >
> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupState
> s":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,
> "hostType":"Routing","hostId":0,"wait":0}}]
> > }
> > 2016-03-31 10:09:49,398 DEBUG [cloud.agent.Agent] 
> > (Agent-Handler-4:null) Received response: Seq 0-13:  { Ans: , 
> > MgmtId: 161342671900, via: 75,
> Ver:
> > v1, Flags: 100010,
> >
> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","
> hostId":0,"wait":0},"result":true,"wait":0}}]
> > }
> > ====
> >
> > On the existing hypervisor hosts, normally the agent would stuck at 
> > this stage and from Cloudstack GUI, we don't see the agent in "Connecting"
> > state, it will be either on "Disconnected" or "Alert" state.
> >
> > ====
> > 2016-03-31 07:37:09,819 DEBUG [utils.script.Script] (main:null)
> Executing:
> > /bin/bash -c uname -r
> > 2016-03-31 07:37:09,829 DEBUG [utils.script.Script] (main:null) 
> > Execution is successful.
> > 2016-03-31 07:37:09,832 DEBUG [cloud.agent.Agent] (main:null) Adding 
> > shutdown hook
> > 2016-03-31 07:37:09,833 INFO  [cloud.agent.Agent] (main:null) Agent 
> > [id =
> > 73 : type = LibvirtComputingResource : zone = 6 : pod = 6 : workers = 5 :
> > host = 10.x.x.x : port = 8250
> > 2016-03-31 07:37:09,856 INFO  [utils.nio.NioClient] 
> > (Agent-Selector:null) Connecting to 10.x.x.x:8250
> > 2016-03-31 07:37:10,178 INFO  [utils.nio.NioClient] 
> > (Agent-Selector:null)
> > SSL: Handshake done
> > 2016-03-31 07:37:10,179 INFO  [utils.nio.NioClient] 
> > (Agent-Selector:null) Connected to 10.x.x.x:8250 ====
> >
> > No other significant and useful logs found on both the agents and 
> > management server logs.
> >
> > Anyone can give a clue on what could be the problem? Have been 
> > trying to reconnect in the past couple of hours without any issues. 
> > Any help is greatly appreciated.
> >
> > Looking forward to your reply, thnk you.
> >
> > Cheers.
> >
> > -ip-
>



DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Accelerite, a Persistent Systems business. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Accelerite, a Persistent Systems business does not accept any liability for virus infected mails.

Re: URGENT - CloudStack agent not able to connect to management server

Posted by Indra Pramana <in...@sg.or.id>.
Hi Somesh,

Thanks for your reply.

>Instead of rebooting the KVM hosts, you may want try stopping the agent on
all the hosts
>and then starting the agent service one by one.

We have done this, in fact, this is what we tried to do every time we want
to reconnect a CloudStack agent (on Alert or Disconnected state) to the
management server. We are running Ubuntu 12.04 LTS platform for the agent
hosts as well as the management server.

service cloudstack-agent stop
(optional:) killall jsvc
(optional:) service libvirt-bin restart
service cloudstack-agent start

But it didn't work for that particular occasion. Not too sure why. So far
we didn't have any further disconnection issues after that particular
incident so I don't know if the problem will still be there when a host
gets disconnected now. It will be very disruptive to always reboot the
hypervisor host (and sacrifices all running VMs in the process) every time
a host gets disconnected for any reason.

Thank you.


On Wed, Apr 6, 2016 at 8:53 PM, Somesh Naidu <So...@citrix.com>
wrote:

> > Eventually, we could only connect back the host after we rebooted it,
> which means sacrificing all the VMs which were still up and running during
> the disconnection.
>
> Instead of rebooting the KVM hosts, you may want try stopping the agent on
> all the hosts and then starting the agent service one by one.
>
> > Will adding new management server be able to resolve the problem?
>
> That really depends on whether your existing management servers are
> optimally tuned and still the resources are getting maxed out, if not,
> adding another server will be more of an overhead than benefit.
>
> Regards,
> Somesh
>
> -----Original Message-----
> From: Indra Pramana [mailto:indra@sg.or.id]
> Sent: Sunday, April 03, 2016 7:44 AM
> To: users@cloudstack.apache.org
> Subject: Re: URGENT - CloudStack agent not able to connect to management
> server
>
> Hi Lucian,
>
> Good day to you, and thank you for your reply. Apologise for the delay in
> my reply.
>
> Yes, I can confirm that we can access the host and port specified. Based on
> the logs, the host can connect to the management server but there's no
> follow-up logs which usually come after it's connected. Eventually, we
> could only connect back the host after we rebooted it, which means
> sacrificing all the VMs which were still up and running during the
> disconnection.
>
> At the time when the first hypervisor was disconnected, the CloudStack
> management servers were very busy handling the disconnections, trying to
> fence the hosts and initiate HA for all the affected VMs, based on the
> logs. Could this have put a strain on the management server, causing it to
> disconnect all the remaining hosts? Will adding new management server be
> able to resolve the problem?
>
> Any advice is appreciated.
>
> Looking forward to your reply, thank you.
>
> Cheers.
>
> On Thu, Mar 31, 2016 at 5:28 PM, Nux! <nu...@li.nux.ro> wrote:
>
> > Hello,
> >
> > Are you sure you can connect from the hypervisors to the
> > cloudstack-management on the host and port specified in the
> > agent.properties?
> >
> > --
> > Sent from the Delta quadrant using Borg technology!
> >
> > Nux!
> > www.nux.ro
> >
> > ----- Original Message -----
> > > From: "Indra Pramana" <in...@sg.or.id>
> > > To: users@cloudstack.apache.org
> > > Sent: Thursday, 31 March, 2016 03:14:59
> > > Subject: URGENT - CloudStack agent not able to connect to management
> > server
> >
> > > Dear all,
> > >
> > > We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD storage. All
> > our
> > > agents got disconnected from the management server and unable to
> connect
> > > again, despite rebooting the management server and stopping and
> > restarting
> > > the cloudstack-agent many times.
> > >
> > > We even tried to physically reboot a hypervisor host (sacrificing all
> the
> > > running VMs inside) to see if it can reconnect after boot-up, and it's
> > not
> > > able to reconnect (keep on "Connecting" state). Here's the excerpts
> from
> > > the logs:
> > >
> > > ====
> > > 2016-03-31 10:07:49,346 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
> > > Sending ping: Seq 0-11:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags:
> 11,
> > >
> >
> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":0,"wait":0}}]
> > > }
> > > 2016-03-31 10:07:49,395 DEBUG [cloud.agent.Agent]
> (Agent-Handler-2:null)
> > > Received response: Seq 0-11:  { Ans: , MgmtId: 161342671900, via: 75,
> > Ver:
> > > v1, Flags: 100010,
> > >
> >
> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":0,"wait":0},"result":true,"wait":0}}]
> > > }
> > > 2016-03-31 10:08:49,271 DEBUG [kvm.resource.LibvirtComputingResource]
> > > (UgentTask-5:null) Executing:
> > > /usr/share/cloudstack-common/scripts/vm/network/security_group.py
> > > get_rule_logs_for_vms
> > > 2016-03-31 10:08:49,350 DEBUG [kvm.resource.LibvirtComputingResource]
> > > (UgentTask-5:null) Execution is successful.
> > > 2016-03-31 10:08:49,353 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
> > > Sending ping: Seq 0-12:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags:
> 11,
> > >
> >
> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":0,"wait":0}}]
> > > }
> > > 2016-03-31 10:08:49,406 DEBUG [cloud.agent.Agent]
> (Agent-Handler-3:null)
> > > Received response: Seq 0-12:  { Ans: , MgmtId: 161342671900, via: 75,
> > Ver:
> > > v1, Flags: 100010,
> > >
> >
> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":0,"wait":0},"result":true,"wait":0}}]
> > > }
> > > 2016-03-31 10:09:49,272 DEBUG [kvm.resource.LibvirtComputingResource]
> > > (UgentTask-5:null) Executing:
> > > /usr/share/cloudstack-common/scripts/vm/network/security_group.py
> > > get_rule_logs_for_vms
> > > 2016-03-31 10:09:49,345 DEBUG [kvm.resource.LibvirtComputingResource]
> > > (UgentTask-5:null) Execution is successful.
> > > 2016-03-31 10:09:49,347 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
> > > Sending ping: Seq 0-13:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags:
> 11,
> > >
> >
> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":0,"wait":0}}]
> > > }
> > > 2016-03-31 10:09:49,398 DEBUG [cloud.agent.Agent]
> (Agent-Handler-4:null)
> > > Received response: Seq 0-13:  { Ans: , MgmtId: 161342671900, via: 75,
> > Ver:
> > > v1, Flags: 100010,
> > >
> >
> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":0,"wait":0},"result":true,"wait":0}}]
> > > }
> > > ====
> > >
> > > On the existing hypervisor hosts, normally the agent would stuck at
> this
> > > stage and from Cloudstack GUI, we don't see the agent in "Connecting"
> > > state, it will be either on "Disconnected" or "Alert" state.
> > >
> > > ====
> > > 2016-03-31 07:37:09,819 DEBUG [utils.script.Script] (main:null)
> > Executing:
> > > /bin/bash -c uname -r
> > > 2016-03-31 07:37:09,829 DEBUG [utils.script.Script] (main:null)
> Execution
> > > is successful.
> > > 2016-03-31 07:37:09,832 DEBUG [cloud.agent.Agent] (main:null) Adding
> > > shutdown hook
> > > 2016-03-31 07:37:09,833 INFO  [cloud.agent.Agent] (main:null) Agent
> [id =
> > > 73 : type = LibvirtComputingResource : zone = 6 : pod = 6 : workers =
> 5 :
> > > host = 10.x.x.x : port = 8250
> > > 2016-03-31 07:37:09,856 INFO  [utils.nio.NioClient]
> (Agent-Selector:null)
> > > Connecting to 10.x.x.x:8250
> > > 2016-03-31 07:37:10,178 INFO  [utils.nio.NioClient]
> (Agent-Selector:null)
> > > SSL: Handshake done
> > > 2016-03-31 07:37:10,179 INFO  [utils.nio.NioClient]
> (Agent-Selector:null)
> > > Connected to 10.x.x.x:8250
> > > ====
> > >
> > > No other significant and useful logs found on both the agents and
> > > management server logs.
> > >
> > > Anyone can give a clue on what could be the problem? Have been trying
> to
> > > reconnect in the past couple of hours without any issues. Any help is
> > > greatly appreciated.
> > >
> > > Looking forward to your reply, thnk you.
> > >
> > > Cheers.
> > >
> > > -ip-
> >
>

RE: URGENT - CloudStack agent not able to connect to management server

Posted by Somesh Naidu <So...@citrix.com>.
> Eventually, we could only connect back the host after we rebooted it, which means sacrificing all the VMs which were still up and running during the disconnection.

Instead of rebooting the KVM hosts, you may want try stopping the agent on all the hosts and then starting the agent service one by one.

> Will adding new management server be able to resolve the problem?

That really depends on whether your existing management servers are optimally tuned and still the resources are getting maxed out, if not, adding another server will be more of an overhead than benefit. 

Regards,
Somesh

-----Original Message-----
From: Indra Pramana [mailto:indra@sg.or.id] 
Sent: Sunday, April 03, 2016 7:44 AM
To: users@cloudstack.apache.org
Subject: Re: URGENT - CloudStack agent not able to connect to management server

Hi Lucian,

Good day to you, and thank you for your reply. Apologise for the delay in
my reply.

Yes, I can confirm that we can access the host and port specified. Based on
the logs, the host can connect to the management server but there's no
follow-up logs which usually come after it's connected. Eventually, we
could only connect back the host after we rebooted it, which means
sacrificing all the VMs which were still up and running during the
disconnection.

At the time when the first hypervisor was disconnected, the CloudStack
management servers were very busy handling the disconnections, trying to
fence the hosts and initiate HA for all the affected VMs, based on the
logs. Could this have put a strain on the management server, causing it to
disconnect all the remaining hosts? Will adding new management server be
able to resolve the problem?

Any advice is appreciated.

Looking forward to your reply, thank you.

Cheers.

On Thu, Mar 31, 2016 at 5:28 PM, Nux! <nu...@li.nux.ro> wrote:

> Hello,
>
> Are you sure you can connect from the hypervisors to the
> cloudstack-management on the host and port specified in the
> agent.properties?
>
> --
> Sent from the Delta quadrant using Borg technology!
>
> Nux!
> www.nux.ro
>
> ----- Original Message -----
> > From: "Indra Pramana" <in...@sg.or.id>
> > To: users@cloudstack.apache.org
> > Sent: Thursday, 31 March, 2016 03:14:59
> > Subject: URGENT - CloudStack agent not able to connect to management
> server
>
> > Dear all,
> >
> > We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD storage. All
> our
> > agents got disconnected from the management server and unable to connect
> > again, despite rebooting the management server and stopping and
> restarting
> > the cloudstack-agent many times.
> >
> > We even tried to physically reboot a hypervisor host (sacrificing all the
> > running VMs inside) to see if it can reconnect after boot-up, and it's
> not
> > able to reconnect (keep on "Connecting" state). Here's the excerpts from
> > the logs:
> >
> > ====
> > 2016-03-31 10:07:49,346 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
> > Sending ping: Seq 0-11:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags: 11,
> >
> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":0,"wait":0}}]
> > }
> > 2016-03-31 10:07:49,395 DEBUG [cloud.agent.Agent] (Agent-Handler-2:null)
> > Received response: Seq 0-11:  { Ans: , MgmtId: 161342671900, via: 75,
> Ver:
> > v1, Flags: 100010,
> >
> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":0,"wait":0},"result":true,"wait":0}}]
> > }
> > 2016-03-31 10:08:49,271 DEBUG [kvm.resource.LibvirtComputingResource]
> > (UgentTask-5:null) Executing:
> > /usr/share/cloudstack-common/scripts/vm/network/security_group.py
> > get_rule_logs_for_vms
> > 2016-03-31 10:08:49,350 DEBUG [kvm.resource.LibvirtComputingResource]
> > (UgentTask-5:null) Execution is successful.
> > 2016-03-31 10:08:49,353 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
> > Sending ping: Seq 0-12:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags: 11,
> >
> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":0,"wait":0}}]
> > }
> > 2016-03-31 10:08:49,406 DEBUG [cloud.agent.Agent] (Agent-Handler-3:null)
> > Received response: Seq 0-12:  { Ans: , MgmtId: 161342671900, via: 75,
> Ver:
> > v1, Flags: 100010,
> >
> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":0,"wait":0},"result":true,"wait":0}}]
> > }
> > 2016-03-31 10:09:49,272 DEBUG [kvm.resource.LibvirtComputingResource]
> > (UgentTask-5:null) Executing:
> > /usr/share/cloudstack-common/scripts/vm/network/security_group.py
> > get_rule_logs_for_vms
> > 2016-03-31 10:09:49,345 DEBUG [kvm.resource.LibvirtComputingResource]
> > (UgentTask-5:null) Execution is successful.
> > 2016-03-31 10:09:49,347 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
> > Sending ping: Seq 0-13:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags: 11,
> >
> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":0,"wait":0}}]
> > }
> > 2016-03-31 10:09:49,398 DEBUG [cloud.agent.Agent] (Agent-Handler-4:null)
> > Received response: Seq 0-13:  { Ans: , MgmtId: 161342671900, via: 75,
> Ver:
> > v1, Flags: 100010,
> >
> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":0,"wait":0},"result":true,"wait":0}}]
> > }
> > ====
> >
> > On the existing hypervisor hosts, normally the agent would stuck at this
> > stage and from Cloudstack GUI, we don't see the agent in "Connecting"
> > state, it will be either on "Disconnected" or "Alert" state.
> >
> > ====
> > 2016-03-31 07:37:09,819 DEBUG [utils.script.Script] (main:null)
> Executing:
> > /bin/bash -c uname -r
> > 2016-03-31 07:37:09,829 DEBUG [utils.script.Script] (main:null) Execution
> > is successful.
> > 2016-03-31 07:37:09,832 DEBUG [cloud.agent.Agent] (main:null) Adding
> > shutdown hook
> > 2016-03-31 07:37:09,833 INFO  [cloud.agent.Agent] (main:null) Agent [id =
> > 73 : type = LibvirtComputingResource : zone = 6 : pod = 6 : workers = 5 :
> > host = 10.x.x.x : port = 8250
> > 2016-03-31 07:37:09,856 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > Connecting to 10.x.x.x:8250
> > 2016-03-31 07:37:10,178 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > SSL: Handshake done
> > 2016-03-31 07:37:10,179 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > Connected to 10.x.x.x:8250
> > ====
> >
> > No other significant and useful logs found on both the agents and
> > management server logs.
> >
> > Anyone can give a clue on what could be the problem? Have been trying to
> > reconnect in the past couple of hours without any issues. Any help is
> > greatly appreciated.
> >
> > Looking forward to your reply, thnk you.
> >
> > Cheers.
> >
> > -ip-
>