You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cloudstack.apache.org by li jerry <di...@hotmail.com> on 2019/07/18 03:00:46 UTC

答复: Agent LB for CloudStack failed

Hi Nicolas

test-ceph-node01

[root@test-ceph-node01 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
workers=5
guest.network.device=br0
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=88ca642a-e319-3369-b2c9-39c2b2bddc7c
public.network.device=br0
cluster=1
local.storage.uuid=ec28176f-a3db-4383-90c8-6dcdbc45c3e0
keystore.passphrase=O8VdcZqBwWMMxwk2
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=1
host=172.17.1.141,172.17.1.142@roundrobin

this is test-ceph-node02

[root@test-ceph-node02 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:23 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
guid=649cbe62-dcac-36ae-a62c-699f0e0b8af1
hypervisor.type=kvm
cluster=1
public.network.device=br0
local.storage.uuid=2fc2f796-0614-40cf-bfdf-37a9429520fb
domr.scripts.dir=scripts/network/domr/kvm
keystore.passphrase=vB48rgCk58vNJC6N
host=172.17.1.142,172.17.1.141@roundrobin
LibvirtComputingResource.id=4

test-ceph-node03

[root@test-ceph-node03 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=4d3742c4-8678-3f21-a841-c1ffa32d0a8d
public.network.device=br0
cluster=1
local.storage.uuid=31ee15cf-b3b2-4387-b081-7c47971b9e68
keystore.passphrase=ACgs24DnBgYkORvh
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=5
host=172.17.1.141,172.17.1.142@roundrobin

test-ceph-node04
[root@test-ceph-node04 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:22 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=bfd4b7ba-fd5f-365d-b4d8-a6e8e7c78c0c
public.network.device=br0
cluster=1
local.storage.uuid=2d5004ff-37b1-4f66-bff0-e71ac211f1da
keystore.passphrase=r3D4upcAOdWbwE9p
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=6
host=172.17.1.142,172.17.1.141@roundrobin

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 10:56
收件人: users@cloudstack.apache.org<ma...@cloudstack.apache.org>; dev@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Hi Jerry,

I'll request some additional information. Can you provide me with the value stored on agent.properties for 'host' property on each KVM host? I suspect that the global setting has not been propagated to the agents, as it is trying to reconnect instead of connecting to the next management server once it is down.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Monday, July 15, 2019 10:20 PM
To: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
Subject: Agent LB for CloudStack failed

Hello everyone

My kvm Agent LB on 4.11.2/4.11.3 failed. When the preferred managment node is forced to power off, the agent will not immediately connect to the second management node.After 15 minutes, the agent issues a "No route to host" error and connects to the second management node.

management node:
acs-mn01,172.17.1.141
acs-mn02,172.17.1.142

mysql db node:
acs-db01

kvmm agent node:
test-ceph-node01
test-ceph-node02
test-ceph-node03
test-ceph-node04


global seting

host=172.17.1.142,172.17.1.141
indirect.agent.lb.algorithm=roundrobin
indirect.agent.lb.check.interval=60


Partial agnet logs:

2019-07-15 23:22:39,340 DEBUG [cloud.agent.Agent] (UgentTask-5:null) (logid:) Sending ping: Seq 1-19: { Cmd , MgmtId: -1, via: 1, Ver : v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"_hostVmStateReport":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType ":"Routing","hostId":1,"wait":0}}] }
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport= 34854] closed on read. Probably -1 returned: No route to host
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=34854]
2019-07-15 23:23:09,961 DEBUG [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Clearing watch list: 2
2019-07-15 23:23:09,962 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in Progress.
2019-07-15 23:23:09,963 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) NioClient connection closed
2019-07-15 23:23:09,964 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Reconnecting to host:172.17.1.142
2019-07-15 23:23:09,964 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) Connecting to 172.17.1.142:8250
2019-07-15 23:23:12,972 ERROR [utils.nio.NioConnection] (Agent-Handler-4:null) (logid:a4e4de49) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
 At sun.nio.ch.Net.connect0(Native Method)
 At sun.nio.ch.Net.connect(Net.java:454)
 At sun.nio.ch.Net.connect(Net.java:446)
 At sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
 At com.cloud.utils.nio.NioClient.init(NioClient.java:56)
 At com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
 At com.cloud.agent.Agent.reconnect(Agent.java:517)
 At com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
 At com.clo

Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




答复: Agent LB for CloudStack failed

Posted by li jerry <di...@hotmail.com>.
In order to facilitate tracking, I created a new issue.



https://github.com/apache/cloudstack/issues/3505



Welcome to update, thank you







________________________________
发件人: li jerry <di...@hotmail.com>
发送时间: Friday, July 19, 2019 8:25:27 PM
收件人: dev@cloudstack.apache.org <de...@cloudstack.apache.org>; users@cloudstack.apache.org <us...@cloudstack.apache.org>
主题: 答复: Agent LB for CloudStack failed

Thank you, look forward to your reply.



发送自 Windows 10 版邮件<https://go.microsoft.com/fwlink/?LinkId=550986>应用



________________________________
发件人: Nicolas Vazquez <Ni...@shapeblue.com>
发送时间: Friday, July 19, 2019 8:16:40 PM
收件人: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Ok, I'll try replicating and get back to you.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Thursday, July 18, 2019 4:41 AM
To: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
Subject: 答复: Agent LB for CloudStack failed


I added host.lb.check.interval = 0 to all agent.properties and restarted the cloudstack-agent


The following is the connection status of the agent after reboot.

mysql> select host.id ,host.name,host.mgmt_server_id,host.status,mshost.name from host,mshost where host.mgmt_server_id=mshost.msid;
+----+------------------------------------+----------------+--------+----------+
| id | name                               | mgmt_server_id | status | name     |
+----+------------------------------------+----------------+--------+----------+
|  1 | test-ceph-node01.cs2cloud.internal |  2200502468634 | Up     | acs-mn01 |
|  3 | s-8-VM                             |  2200502468634 | Up     | acs-mn01 |
|  5 | test-ceph-node03.cs2cloud.internal |  2200502468634 | Up     | acs-mn01 |
|  2 | v-9-VM                             |  2199950196764 | Up     | acs-mn02 |
|  4 | test-ceph-node02.cs2cloud.internal |  2199950196764 | Up     | acs-mn02 |
|  6 | test-ceph-node04.cs2cloud.internal |  2199950196764 | Up     | acs-mn02 |
+----+------------------------------------+----------------+--------+----------+
6 rows in set (0.00 sec)

2019-07-18 15:10 Forced power off to close acs-mn02

wait....................................

After the 15th minute (2019-07-18 15:26:23), the agent found that the management node failed and began to switch.
So, add host.lb.check.interval=0 to agent. properties doesn't solve the problem.

Below is the log




2019-07-18 15:26:23,414 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport=33190] closed on read.  Probably -1 returned: No route to host
2019-07-18 15:26:23,416 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=33190]
2019-07-18 15:26:23,417 DEBUG [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Clearing watch list: 2
2019-07-18 15:26:23,417 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in progress.
2019-07-18 15:26:23,420 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) NioClient connection closed
2019-07-18 15:26:23,420 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Reconnecting to host:172.17.1.142
2019-07-18 15:26:23,420 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connecting to 172.17.1.142:8250
2019-07-18 15:26:26,427 ERROR [utils.nio.NioConnection] (Agent-Handler-2:null) (logid:) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
      at sun.nio.ch.Net.connect0(Native Method)
      at sun.nio.ch.Net.connect(Net.java:454)
      at sun.nio.ch.Net.connect(Net.java:446)
      at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
      at com.cloud.utils.nio.NioClient.init(NioClient.java:56)
      at com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
      at com.cloud.agent.Agent.reconnect(Agent.java:517)
      at com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
      at com.cloud.utils.nio.Task.call(Task.java:83)
      at com.cloud.utils.nio.Task.call(Task.java:29)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
2019-07-18 15:26:26,432 INFO  [utils.exception.CSExceptionErrorCode] (Agent-Handler-2:null) (logid:) Could not find exception: com.cloud.utils.exception.NioConnectionException in error code list for exceptions
2019-07-18 15:26:26,432 WARN  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) NIO Connection Exception  com.cloud.utils.exception.NioConnectionException: No route to host
2019-07-18 15:26:26,432 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Attempted to connect to the server, but received an unexpected exception, trying again...
2019-07-18 15:26:26,432 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) NioClient connection closed
2019-07-18 15:26:31,433 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Reconnecting to host:172.17.1.141
2019-07-18 15:26:31,434 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connecting to 172.17.1.141:8250
2019-07-18 15:26:31,435 INFO  [utils.nio.Link] (Agent-Handler-2:null) (logid:) Conf file found: /etc/cloudstack/agent/agent.properties
2019-07-18 15:26:31,545 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) SSL: Handshake done
2019-07-18 15:26:31,546 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connected to 172.17.1.141:8250
2019-07-18 15:26:31,564 DEBUG [kvm.resource.LibvirtConnection] (Agent-Handler-1:null) (logid:) Looking for libvirtd connection at: qemu:///system

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 12:48
收件人: dev@cloudstack.apache.org<ma...@cloudstack.apache.org>; users@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Thanks,

I suspect the culprit is the background task trying to reconnect to the preferred host (which runs every 60 seconds).

I would suggest disabling the background task by setting the interval to 0. As you do not want to change your 'host' global configuration to propagate a new list to the agents, you should do it this way:

- Add this line to agent.properties: host.lb.check.interval=0
- Restart the agent

Please let me know if this fixes your issue.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Thursday, July 18, 2019 12:00 AM
To: dev@cloudstack.apache.org <de...@cloudstack.apache.org>; users@cloudstack.apache.org <us...@cloudstack.apache.org>
Subject: 答复: Agent LB for CloudStack failed

Hi Nicolas

test-ceph-node01

[root@test-ceph-node01 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
workers=5
guest.network.device=br0
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=88ca642a-e319-3369-b2c9-39c2b2bddc7c
public.network.device=br0
cluster=1
local.storage.uuid=ec28176f-a3db-4383-90c8-6dcdbc45c3e0
keystore.passphrase=O8VdcZqBwWMMxwk2
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=1
host=172.17.1.141,172.17.1.142@roundrobin

this is test-ceph-node02

[root@test-ceph-node02 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:23 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
guid=649cbe62-dcac-36ae-a62c-699f0e0b8af1
hypervisor.type=kvm
cluster=1
public.network.device=br0
local.storage.uuid=2fc2f796-0614-40cf-bfdf-37a9429520fb
domr.scripts.dir=scripts/network/domr/kvm
keystore.passphrase=vB48rgCk58vNJC6N
host=172.17.1.142,172.17.1.141@roundrobin
LibvirtComputingResource.id=4

test-ceph-node03

[root@test-ceph-node03 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=4d3742c4-8678-3f21-a841-c1ffa32d0a8d
public.network.device=br0
cluster=1
local.storage.uuid=31ee15cf-b3b2-4387-b081-7c47971b9e68
keystore.passphrase=ACgs24DnBgYkORvh
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=5
host=172.17.1.141,172.17.1.142@roundrobin

test-ceph-node04
[root@test-ceph-node04 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:22 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=bfd4b7ba-fd5f-365d-b4d8-a6e8e7c78c0c
public.network.device=br0
cluster=1
local.storage.uuid=2d5004ff-37b1-4f66-bff0-e71ac211f1da
keystore.passphrase=r3D4upcAOdWbwE9p
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=6
host=172.17.1.142,172.17.1.141@roundrobin

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 10:56
收件人: users@cloudstack.apache.org<ma...@cloudstack.apache.org>; dev@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Hi Jerry,

I'll request some additional information. Can you provide me with the value stored on agent.properties for 'host' property on each KVM host? I suspect that the global setting has not been propagated to the agents, as it is trying to reconnect instead of connecting to the next management server once it is down.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Monday, July 15, 2019 10:20 PM
To: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
Subject: Agent LB for CloudStack failed

Hello everyone

My kvm Agent LB on 4.11.2/4.11.3 failed. When the preferred managment node is forced to power off, the agent will not immediately connect to the second management node.After 15 minutes, the agent issues a "No route to host" error and connects to the second management node.

management node:
acs-mn01,172.17.1.141
acs-mn02,172.17.1.142

mysql db node:
acs-db01

kvmm agent node:
test-ceph-node01
test-ceph-node02
test-ceph-node03
test-ceph-node04


global seting

host=172.17.1.142,172.17.1.141
indirect.agent.lb.algorithm=roundrobin
indirect.agent.lb.check.interval=60


Partial agnet logs:

2019-07-15 23:22:39,340 DEBUG [cloud.agent.Agent] (UgentTask-5:null) (logid:) Sending ping: Seq 1-19: { Cmd , MgmtId: -1, via: 1, Ver : v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"_hostVmStateReport":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType ":"Routing","hostId":1,"wait":0}}] }
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport= 34854] closed on read. Probably -1 returned: No route to host
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=34854]
2019-07-15 23:23:09,961 DEBUG [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Clearing watch list: 2
2019-07-15 23:23:09,962 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in Progress.
2019-07-15 23:23:09,963 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) NioClient connection closed
2019-07-15 23:23:09,964 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Reconnecting to host:172.17.1.142
2019-07-15 23:23:09,964 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) Connecting to 172.17.1.142:8250
2019-07-15 23:23:12,972 ERROR [utils.nio.NioConnection] (Agent-Handler-4:null) (logid:a4e4de49) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
 At sun.nio.ch.Net.connect0(Native Method)
 At sun.nio.ch.Net.connect(Net.java:454)
 At sun.nio.ch.Net.connect(Net.java:446)
 At sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
 At com.cloud.utils.nio.NioClient.init(NioClient.java:56)
 At com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
 At com.cloud.agent.Agent.reconnect(Agent.java:517)
 At com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
 At com.clo

Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




答复: Agent LB for CloudStack failed

Posted by li jerry <di...@hotmail.com>.
In order to facilitate tracking, I created a new issue.



https://github.com/apache/cloudstack/issues/3505



Welcome to update, thank you







________________________________
发件人: li jerry <di...@hotmail.com>
发送时间: Friday, July 19, 2019 8:25:27 PM
收件人: dev@cloudstack.apache.org <de...@cloudstack.apache.org>; users@cloudstack.apache.org <us...@cloudstack.apache.org>
主题: 答复: Agent LB for CloudStack failed

Thank you, look forward to your reply.



发送自 Windows 10 版邮件<https://go.microsoft.com/fwlink/?LinkId=550986>应用



________________________________
发件人: Nicolas Vazquez <Ni...@shapeblue.com>
发送时间: Friday, July 19, 2019 8:16:40 PM
收件人: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Ok, I'll try replicating and get back to you.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Thursday, July 18, 2019 4:41 AM
To: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
Subject: 答复: Agent LB for CloudStack failed


I added host.lb.check.interval = 0 to all agent.properties and restarted the cloudstack-agent


The following is the connection status of the agent after reboot.

mysql> select host.id ,host.name,host.mgmt_server_id,host.status,mshost.name from host,mshost where host.mgmt_server_id=mshost.msid;
+----+------------------------------------+----------------+--------+----------+
| id | name                               | mgmt_server_id | status | name     |
+----+------------------------------------+----------------+--------+----------+
|  1 | test-ceph-node01.cs2cloud.internal |  2200502468634 | Up     | acs-mn01 |
|  3 | s-8-VM                             |  2200502468634 | Up     | acs-mn01 |
|  5 | test-ceph-node03.cs2cloud.internal |  2200502468634 | Up     | acs-mn01 |
|  2 | v-9-VM                             |  2199950196764 | Up     | acs-mn02 |
|  4 | test-ceph-node02.cs2cloud.internal |  2199950196764 | Up     | acs-mn02 |
|  6 | test-ceph-node04.cs2cloud.internal |  2199950196764 | Up     | acs-mn02 |
+----+------------------------------------+----------------+--------+----------+
6 rows in set (0.00 sec)

2019-07-18 15:10 Forced power off to close acs-mn02

wait....................................

After the 15th minute (2019-07-18 15:26:23), the agent found that the management node failed and began to switch.
So, add host.lb.check.interval=0 to agent. properties doesn't solve the problem.

Below is the log




2019-07-18 15:26:23,414 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport=33190] closed on read.  Probably -1 returned: No route to host
2019-07-18 15:26:23,416 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=33190]
2019-07-18 15:26:23,417 DEBUG [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Clearing watch list: 2
2019-07-18 15:26:23,417 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in progress.
2019-07-18 15:26:23,420 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) NioClient connection closed
2019-07-18 15:26:23,420 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Reconnecting to host:172.17.1.142
2019-07-18 15:26:23,420 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connecting to 172.17.1.142:8250
2019-07-18 15:26:26,427 ERROR [utils.nio.NioConnection] (Agent-Handler-2:null) (logid:) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
      at sun.nio.ch.Net.connect0(Native Method)
      at sun.nio.ch.Net.connect(Net.java:454)
      at sun.nio.ch.Net.connect(Net.java:446)
      at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
      at com.cloud.utils.nio.NioClient.init(NioClient.java:56)
      at com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
      at com.cloud.agent.Agent.reconnect(Agent.java:517)
      at com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
      at com.cloud.utils.nio.Task.call(Task.java:83)
      at com.cloud.utils.nio.Task.call(Task.java:29)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
2019-07-18 15:26:26,432 INFO  [utils.exception.CSExceptionErrorCode] (Agent-Handler-2:null) (logid:) Could not find exception: com.cloud.utils.exception.NioConnectionException in error code list for exceptions
2019-07-18 15:26:26,432 WARN  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) NIO Connection Exception  com.cloud.utils.exception.NioConnectionException: No route to host
2019-07-18 15:26:26,432 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Attempted to connect to the server, but received an unexpected exception, trying again...
2019-07-18 15:26:26,432 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) NioClient connection closed
2019-07-18 15:26:31,433 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Reconnecting to host:172.17.1.141
2019-07-18 15:26:31,434 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connecting to 172.17.1.141:8250
2019-07-18 15:26:31,435 INFO  [utils.nio.Link] (Agent-Handler-2:null) (logid:) Conf file found: /etc/cloudstack/agent/agent.properties
2019-07-18 15:26:31,545 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) SSL: Handshake done
2019-07-18 15:26:31,546 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connected to 172.17.1.141:8250
2019-07-18 15:26:31,564 DEBUG [kvm.resource.LibvirtConnection] (Agent-Handler-1:null) (logid:) Looking for libvirtd connection at: qemu:///system

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 12:48
收件人: dev@cloudstack.apache.org<ma...@cloudstack.apache.org>; users@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Thanks,

I suspect the culprit is the background task trying to reconnect to the preferred host (which runs every 60 seconds).

I would suggest disabling the background task by setting the interval to 0. As you do not want to change your 'host' global configuration to propagate a new list to the agents, you should do it this way:

- Add this line to agent.properties: host.lb.check.interval=0
- Restart the agent

Please let me know if this fixes your issue.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Thursday, July 18, 2019 12:00 AM
To: dev@cloudstack.apache.org <de...@cloudstack.apache.org>; users@cloudstack.apache.org <us...@cloudstack.apache.org>
Subject: 答复: Agent LB for CloudStack failed

Hi Nicolas

test-ceph-node01

[root@test-ceph-node01 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
workers=5
guest.network.device=br0
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=88ca642a-e319-3369-b2c9-39c2b2bddc7c
public.network.device=br0
cluster=1
local.storage.uuid=ec28176f-a3db-4383-90c8-6dcdbc45c3e0
keystore.passphrase=O8VdcZqBwWMMxwk2
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=1
host=172.17.1.141,172.17.1.142@roundrobin

this is test-ceph-node02

[root@test-ceph-node02 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:23 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
guid=649cbe62-dcac-36ae-a62c-699f0e0b8af1
hypervisor.type=kvm
cluster=1
public.network.device=br0
local.storage.uuid=2fc2f796-0614-40cf-bfdf-37a9429520fb
domr.scripts.dir=scripts/network/domr/kvm
keystore.passphrase=vB48rgCk58vNJC6N
host=172.17.1.142,172.17.1.141@roundrobin
LibvirtComputingResource.id=4

test-ceph-node03

[root@test-ceph-node03 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=4d3742c4-8678-3f21-a841-c1ffa32d0a8d
public.network.device=br0
cluster=1
local.storage.uuid=31ee15cf-b3b2-4387-b081-7c47971b9e68
keystore.passphrase=ACgs24DnBgYkORvh
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=5
host=172.17.1.141,172.17.1.142@roundrobin

test-ceph-node04
[root@test-ceph-node04 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:22 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=bfd4b7ba-fd5f-365d-b4d8-a6e8e7c78c0c
public.network.device=br0
cluster=1
local.storage.uuid=2d5004ff-37b1-4f66-bff0-e71ac211f1da
keystore.passphrase=r3D4upcAOdWbwE9p
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=6
host=172.17.1.142,172.17.1.141@roundrobin

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 10:56
收件人: users@cloudstack.apache.org<ma...@cloudstack.apache.org>; dev@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Hi Jerry,

I'll request some additional information. Can you provide me with the value stored on agent.properties for 'host' property on each KVM host? I suspect that the global setting has not been propagated to the agents, as it is trying to reconnect instead of connecting to the next management server once it is down.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Monday, July 15, 2019 10:20 PM
To: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
Subject: Agent LB for CloudStack failed

Hello everyone

My kvm Agent LB on 4.11.2/4.11.3 failed. When the preferred managment node is forced to power off, the agent will not immediately connect to the second management node.After 15 minutes, the agent issues a "No route to host" error and connects to the second management node.

management node:
acs-mn01,172.17.1.141
acs-mn02,172.17.1.142

mysql db node:
acs-db01

kvmm agent node:
test-ceph-node01
test-ceph-node02
test-ceph-node03
test-ceph-node04


global seting

host=172.17.1.142,172.17.1.141
indirect.agent.lb.algorithm=roundrobin
indirect.agent.lb.check.interval=60


Partial agnet logs:

2019-07-15 23:22:39,340 DEBUG [cloud.agent.Agent] (UgentTask-5:null) (logid:) Sending ping: Seq 1-19: { Cmd , MgmtId: -1, via: 1, Ver : v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"_hostVmStateReport":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType ":"Routing","hostId":1,"wait":0}}] }
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport= 34854] closed on read. Probably -1 returned: No route to host
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=34854]
2019-07-15 23:23:09,961 DEBUG [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Clearing watch list: 2
2019-07-15 23:23:09,962 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in Progress.
2019-07-15 23:23:09,963 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) NioClient connection closed
2019-07-15 23:23:09,964 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Reconnecting to host:172.17.1.142
2019-07-15 23:23:09,964 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) Connecting to 172.17.1.142:8250
2019-07-15 23:23:12,972 ERROR [utils.nio.NioConnection] (Agent-Handler-4:null) (logid:a4e4de49) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
 At sun.nio.ch.Net.connect0(Native Method)
 At sun.nio.ch.Net.connect(Net.java:454)
 At sun.nio.ch.Net.connect(Net.java:446)
 At sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
 At com.cloud.utils.nio.NioClient.init(NioClient.java:56)
 At com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
 At com.cloud.agent.Agent.reconnect(Agent.java:517)
 At com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
 At com.clo

Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




答复: Agent LB for CloudStack failed

Posted by li jerry <di...@hotmail.com>.
Thank you, look forward to your reply.



发送自 Windows 10 版邮件<https://go.microsoft.com/fwlink/?LinkId=550986>应用



________________________________
发件人: Nicolas Vazquez <Ni...@shapeblue.com>
发送时间: Friday, July 19, 2019 8:16:40 PM
收件人: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Ok, I'll try replicating and get back to you.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Thursday, July 18, 2019 4:41 AM
To: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
Subject: 答复: Agent LB for CloudStack failed


I added host.lb.check.interval = 0 to all agent.properties and restarted the cloudstack-agent


The following is the connection status of the agent after reboot.

mysql> select host.id ,host.name,host.mgmt_server_id,host.status,mshost.name from host,mshost where host.mgmt_server_id=mshost.msid;
+----+------------------------------------+----------------+--------+----------+
| id | name                               | mgmt_server_id | status | name     |
+----+------------------------------------+----------------+--------+----------+
|  1 | test-ceph-node01.cs2cloud.internal |  2200502468634 | Up     | acs-mn01 |
|  3 | s-8-VM                             |  2200502468634 | Up     | acs-mn01 |
|  5 | test-ceph-node03.cs2cloud.internal |  2200502468634 | Up     | acs-mn01 |
|  2 | v-9-VM                             |  2199950196764 | Up     | acs-mn02 |
|  4 | test-ceph-node02.cs2cloud.internal |  2199950196764 | Up     | acs-mn02 |
|  6 | test-ceph-node04.cs2cloud.internal |  2199950196764 | Up     | acs-mn02 |
+----+------------------------------------+----------------+--------+----------+
6 rows in set (0.00 sec)

2019-07-18 15:10 Forced power off to close acs-mn02

wait....................................

After the 15th minute (2019-07-18 15:26:23), the agent found that the management node failed and began to switch.
So, add host.lb.check.interval=0 to agent. properties doesn't solve the problem.

Below is the log




2019-07-18 15:26:23,414 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport=33190] closed on read.  Probably -1 returned: No route to host
2019-07-18 15:26:23,416 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=33190]
2019-07-18 15:26:23,417 DEBUG [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Clearing watch list: 2
2019-07-18 15:26:23,417 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in progress.
2019-07-18 15:26:23,420 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) NioClient connection closed
2019-07-18 15:26:23,420 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Reconnecting to host:172.17.1.142
2019-07-18 15:26:23,420 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connecting to 172.17.1.142:8250
2019-07-18 15:26:26,427 ERROR [utils.nio.NioConnection] (Agent-Handler-2:null) (logid:) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
      at sun.nio.ch.Net.connect0(Native Method)
      at sun.nio.ch.Net.connect(Net.java:454)
      at sun.nio.ch.Net.connect(Net.java:446)
      at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
      at com.cloud.utils.nio.NioClient.init(NioClient.java:56)
      at com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
      at com.cloud.agent.Agent.reconnect(Agent.java:517)
      at com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
      at com.cloud.utils.nio.Task.call(Task.java:83)
      at com.cloud.utils.nio.Task.call(Task.java:29)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
2019-07-18 15:26:26,432 INFO  [utils.exception.CSExceptionErrorCode] (Agent-Handler-2:null) (logid:) Could not find exception: com.cloud.utils.exception.NioConnectionException in error code list for exceptions
2019-07-18 15:26:26,432 WARN  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) NIO Connection Exception  com.cloud.utils.exception.NioConnectionException: No route to host
2019-07-18 15:26:26,432 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Attempted to connect to the server, but received an unexpected exception, trying again...
2019-07-18 15:26:26,432 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) NioClient connection closed
2019-07-18 15:26:31,433 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Reconnecting to host:172.17.1.141
2019-07-18 15:26:31,434 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connecting to 172.17.1.141:8250
2019-07-18 15:26:31,435 INFO  [utils.nio.Link] (Agent-Handler-2:null) (logid:) Conf file found: /etc/cloudstack/agent/agent.properties
2019-07-18 15:26:31,545 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) SSL: Handshake done
2019-07-18 15:26:31,546 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connected to 172.17.1.141:8250
2019-07-18 15:26:31,564 DEBUG [kvm.resource.LibvirtConnection] (Agent-Handler-1:null) (logid:) Looking for libvirtd connection at: qemu:///system

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 12:48
收件人: dev@cloudstack.apache.org<ma...@cloudstack.apache.org>; users@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Thanks,

I suspect the culprit is the background task trying to reconnect to the preferred host (which runs every 60 seconds).

I would suggest disabling the background task by setting the interval to 0. As you do not want to change your 'host' global configuration to propagate a new list to the agents, you should do it this way:

- Add this line to agent.properties: host.lb.check.interval=0
- Restart the agent

Please let me know if this fixes your issue.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Thursday, July 18, 2019 12:00 AM
To: dev@cloudstack.apache.org <de...@cloudstack.apache.org>; users@cloudstack.apache.org <us...@cloudstack.apache.org>
Subject: 答复: Agent LB for CloudStack failed

Hi Nicolas

test-ceph-node01

[root@test-ceph-node01 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
workers=5
guest.network.device=br0
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=88ca642a-e319-3369-b2c9-39c2b2bddc7c
public.network.device=br0
cluster=1
local.storage.uuid=ec28176f-a3db-4383-90c8-6dcdbc45c3e0
keystore.passphrase=O8VdcZqBwWMMxwk2
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=1
host=172.17.1.141,172.17.1.142@roundrobin

this is test-ceph-node02

[root@test-ceph-node02 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:23 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
guid=649cbe62-dcac-36ae-a62c-699f0e0b8af1
hypervisor.type=kvm
cluster=1
public.network.device=br0
local.storage.uuid=2fc2f796-0614-40cf-bfdf-37a9429520fb
domr.scripts.dir=scripts/network/domr/kvm
keystore.passphrase=vB48rgCk58vNJC6N
host=172.17.1.142,172.17.1.141@roundrobin
LibvirtComputingResource.id=4

test-ceph-node03

[root@test-ceph-node03 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=4d3742c4-8678-3f21-a841-c1ffa32d0a8d
public.network.device=br0
cluster=1
local.storage.uuid=31ee15cf-b3b2-4387-b081-7c47971b9e68
keystore.passphrase=ACgs24DnBgYkORvh
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=5
host=172.17.1.141,172.17.1.142@roundrobin

test-ceph-node04
[root@test-ceph-node04 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:22 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=bfd4b7ba-fd5f-365d-b4d8-a6e8e7c78c0c
public.network.device=br0
cluster=1
local.storage.uuid=2d5004ff-37b1-4f66-bff0-e71ac211f1da
keystore.passphrase=r3D4upcAOdWbwE9p
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=6
host=172.17.1.142,172.17.1.141@roundrobin

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 10:56
收件人: users@cloudstack.apache.org<ma...@cloudstack.apache.org>; dev@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Hi Jerry,

I'll request some additional information. Can you provide me with the value stored on agent.properties for 'host' property on each KVM host? I suspect that the global setting has not been propagated to the agents, as it is trying to reconnect instead of connecting to the next management server once it is down.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Monday, July 15, 2019 10:20 PM
To: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
Subject: Agent LB for CloudStack failed

Hello everyone

My kvm Agent LB on 4.11.2/4.11.3 failed. When the preferred managment node is forced to power off, the agent will not immediately connect to the second management node.After 15 minutes, the agent issues a "No route to host" error and connects to the second management node.

management node:
acs-mn01,172.17.1.141
acs-mn02,172.17.1.142

mysql db node:
acs-db01

kvmm agent node:
test-ceph-node01
test-ceph-node02
test-ceph-node03
test-ceph-node04


global seting

host=172.17.1.142,172.17.1.141
indirect.agent.lb.algorithm=roundrobin
indirect.agent.lb.check.interval=60


Partial agnet logs:

2019-07-15 23:22:39,340 DEBUG [cloud.agent.Agent] (UgentTask-5:null) (logid:) Sending ping: Seq 1-19: { Cmd , MgmtId: -1, via: 1, Ver : v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"_hostVmStateReport":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType ":"Routing","hostId":1,"wait":0}}] }
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport= 34854] closed on read. Probably -1 returned: No route to host
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=34854]
2019-07-15 23:23:09,961 DEBUG [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Clearing watch list: 2
2019-07-15 23:23:09,962 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in Progress.
2019-07-15 23:23:09,963 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) NioClient connection closed
2019-07-15 23:23:09,964 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Reconnecting to host:172.17.1.142
2019-07-15 23:23:09,964 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) Connecting to 172.17.1.142:8250
2019-07-15 23:23:12,972 ERROR [utils.nio.NioConnection] (Agent-Handler-4:null) (logid:a4e4de49) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
 At sun.nio.ch.Net.connect0(Native Method)
 At sun.nio.ch.Net.connect(Net.java:454)
 At sun.nio.ch.Net.connect(Net.java:446)
 At sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
 At com.cloud.utils.nio.NioClient.init(NioClient.java:56)
 At com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
 At com.cloud.agent.Agent.reconnect(Agent.java:517)
 At com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
 At com.clo

Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




答复: Agent LB for CloudStack failed

Posted by li jerry <di...@hotmail.com>.
Thank you, look forward to your reply.



发送自 Windows 10 版邮件<https://go.microsoft.com/fwlink/?LinkId=550986>应用



________________________________
发件人: Nicolas Vazquez <Ni...@shapeblue.com>
发送时间: Friday, July 19, 2019 8:16:40 PM
收件人: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Ok, I'll try replicating and get back to you.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Thursday, July 18, 2019 4:41 AM
To: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
Subject: 答复: Agent LB for CloudStack failed


I added host.lb.check.interval = 0 to all agent.properties and restarted the cloudstack-agent


The following is the connection status of the agent after reboot.

mysql> select host.id ,host.name,host.mgmt_server_id,host.status,mshost.name from host,mshost where host.mgmt_server_id=mshost.msid;
+----+------------------------------------+----------------+--------+----------+
| id | name                               | mgmt_server_id | status | name     |
+----+------------------------------------+----------------+--------+----------+
|  1 | test-ceph-node01.cs2cloud.internal |  2200502468634 | Up     | acs-mn01 |
|  3 | s-8-VM                             |  2200502468634 | Up     | acs-mn01 |
|  5 | test-ceph-node03.cs2cloud.internal |  2200502468634 | Up     | acs-mn01 |
|  2 | v-9-VM                             |  2199950196764 | Up     | acs-mn02 |
|  4 | test-ceph-node02.cs2cloud.internal |  2199950196764 | Up     | acs-mn02 |
|  6 | test-ceph-node04.cs2cloud.internal |  2199950196764 | Up     | acs-mn02 |
+----+------------------------------------+----------------+--------+----------+
6 rows in set (0.00 sec)

2019-07-18 15:10 Forced power off to close acs-mn02

wait....................................

After the 15th minute (2019-07-18 15:26:23), the agent found that the management node failed and began to switch.
So, add host.lb.check.interval=0 to agent. properties doesn't solve the problem.

Below is the log




2019-07-18 15:26:23,414 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport=33190] closed on read.  Probably -1 returned: No route to host
2019-07-18 15:26:23,416 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=33190]
2019-07-18 15:26:23,417 DEBUG [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Clearing watch list: 2
2019-07-18 15:26:23,417 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in progress.
2019-07-18 15:26:23,420 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) NioClient connection closed
2019-07-18 15:26:23,420 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Reconnecting to host:172.17.1.142
2019-07-18 15:26:23,420 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connecting to 172.17.1.142:8250
2019-07-18 15:26:26,427 ERROR [utils.nio.NioConnection] (Agent-Handler-2:null) (logid:) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
      at sun.nio.ch.Net.connect0(Native Method)
      at sun.nio.ch.Net.connect(Net.java:454)
      at sun.nio.ch.Net.connect(Net.java:446)
      at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
      at com.cloud.utils.nio.NioClient.init(NioClient.java:56)
      at com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
      at com.cloud.agent.Agent.reconnect(Agent.java:517)
      at com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
      at com.cloud.utils.nio.Task.call(Task.java:83)
      at com.cloud.utils.nio.Task.call(Task.java:29)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
2019-07-18 15:26:26,432 INFO  [utils.exception.CSExceptionErrorCode] (Agent-Handler-2:null) (logid:) Could not find exception: com.cloud.utils.exception.NioConnectionException in error code list for exceptions
2019-07-18 15:26:26,432 WARN  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) NIO Connection Exception  com.cloud.utils.exception.NioConnectionException: No route to host
2019-07-18 15:26:26,432 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Attempted to connect to the server, but received an unexpected exception, trying again...
2019-07-18 15:26:26,432 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) NioClient connection closed
2019-07-18 15:26:31,433 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Reconnecting to host:172.17.1.141
2019-07-18 15:26:31,434 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connecting to 172.17.1.141:8250
2019-07-18 15:26:31,435 INFO  [utils.nio.Link] (Agent-Handler-2:null) (logid:) Conf file found: /etc/cloudstack/agent/agent.properties
2019-07-18 15:26:31,545 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) SSL: Handshake done
2019-07-18 15:26:31,546 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connected to 172.17.1.141:8250
2019-07-18 15:26:31,564 DEBUG [kvm.resource.LibvirtConnection] (Agent-Handler-1:null) (logid:) Looking for libvirtd connection at: qemu:///system

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 12:48
收件人: dev@cloudstack.apache.org<ma...@cloudstack.apache.org>; users@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Thanks,

I suspect the culprit is the background task trying to reconnect to the preferred host (which runs every 60 seconds).

I would suggest disabling the background task by setting the interval to 0. As you do not want to change your 'host' global configuration to propagate a new list to the agents, you should do it this way:

- Add this line to agent.properties: host.lb.check.interval=0
- Restart the agent

Please let me know if this fixes your issue.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Thursday, July 18, 2019 12:00 AM
To: dev@cloudstack.apache.org <de...@cloudstack.apache.org>; users@cloudstack.apache.org <us...@cloudstack.apache.org>
Subject: 答复: Agent LB for CloudStack failed

Hi Nicolas

test-ceph-node01

[root@test-ceph-node01 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
workers=5
guest.network.device=br0
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=88ca642a-e319-3369-b2c9-39c2b2bddc7c
public.network.device=br0
cluster=1
local.storage.uuid=ec28176f-a3db-4383-90c8-6dcdbc45c3e0
keystore.passphrase=O8VdcZqBwWMMxwk2
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=1
host=172.17.1.141,172.17.1.142@roundrobin

this is test-ceph-node02

[root@test-ceph-node02 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:23 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
guid=649cbe62-dcac-36ae-a62c-699f0e0b8af1
hypervisor.type=kvm
cluster=1
public.network.device=br0
local.storage.uuid=2fc2f796-0614-40cf-bfdf-37a9429520fb
domr.scripts.dir=scripts/network/domr/kvm
keystore.passphrase=vB48rgCk58vNJC6N
host=172.17.1.142,172.17.1.141@roundrobin
LibvirtComputingResource.id=4

test-ceph-node03

[root@test-ceph-node03 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=4d3742c4-8678-3f21-a841-c1ffa32d0a8d
public.network.device=br0
cluster=1
local.storage.uuid=31ee15cf-b3b2-4387-b081-7c47971b9e68
keystore.passphrase=ACgs24DnBgYkORvh
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=5
host=172.17.1.141,172.17.1.142@roundrobin

test-ceph-node04
[root@test-ceph-node04 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:22 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=bfd4b7ba-fd5f-365d-b4d8-a6e8e7c78c0c
public.network.device=br0
cluster=1
local.storage.uuid=2d5004ff-37b1-4f66-bff0-e71ac211f1da
keystore.passphrase=r3D4upcAOdWbwE9p
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=6
host=172.17.1.142,172.17.1.141@roundrobin

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 10:56
收件人: users@cloudstack.apache.org<ma...@cloudstack.apache.org>; dev@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Hi Jerry,

I'll request some additional information. Can you provide me with the value stored on agent.properties for 'host' property on each KVM host? I suspect that the global setting has not been propagated to the agents, as it is trying to reconnect instead of connecting to the next management server once it is down.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Monday, July 15, 2019 10:20 PM
To: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
Subject: Agent LB for CloudStack failed

Hello everyone

My kvm Agent LB on 4.11.2/4.11.3 failed. When the preferred managment node is forced to power off, the agent will not immediately connect to the second management node.After 15 minutes, the agent issues a "No route to host" error and connects to the second management node.

management node:
acs-mn01,172.17.1.141
acs-mn02,172.17.1.142

mysql db node:
acs-db01

kvmm agent node:
test-ceph-node01
test-ceph-node02
test-ceph-node03
test-ceph-node04


global seting

host=172.17.1.142,172.17.1.141
indirect.agent.lb.algorithm=roundrobin
indirect.agent.lb.check.interval=60


Partial agnet logs:

2019-07-15 23:22:39,340 DEBUG [cloud.agent.Agent] (UgentTask-5:null) (logid:) Sending ping: Seq 1-19: { Cmd , MgmtId: -1, via: 1, Ver : v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"_hostVmStateReport":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType ":"Routing","hostId":1,"wait":0}}] }
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport= 34854] closed on read. Probably -1 returned: No route to host
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=34854]
2019-07-15 23:23:09,961 DEBUG [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Clearing watch list: 2
2019-07-15 23:23:09,962 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in Progress.
2019-07-15 23:23:09,963 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) NioClient connection closed
2019-07-15 23:23:09,964 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Reconnecting to host:172.17.1.142
2019-07-15 23:23:09,964 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) Connecting to 172.17.1.142:8250
2019-07-15 23:23:12,972 ERROR [utils.nio.NioConnection] (Agent-Handler-4:null) (logid:a4e4de49) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
 At sun.nio.ch.Net.connect0(Native Method)
 At sun.nio.ch.Net.connect(Net.java:454)
 At sun.nio.ch.Net.connect(Net.java:446)
 At sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
 At com.cloud.utils.nio.NioClient.init(NioClient.java:56)
 At com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
 At com.cloud.agent.Agent.reconnect(Agent.java:517)
 At com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
 At com.clo

Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Re: Agent LB for CloudStack failed

Posted by Nicolas Vazquez <Ni...@shapeblue.com>.
Ok, I'll try replicating and get back to you.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Thursday, July 18, 2019 4:41 AM
To: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
Subject: 答复: Agent LB for CloudStack failed


I added host.lb.check.interval = 0 to all agent.properties and restarted the cloudstack-agent


The following is the connection status of the agent after reboot.

mysql> select host.id ,host.name,host.mgmt_server_id,host.status,mshost.name from host,mshost where host.mgmt_server_id=mshost.msid;
+----+------------------------------------+----------------+--------+----------+
| id | name                               | mgmt_server_id | status | name     |
+----+------------------------------------+----------------+--------+----------+
|  1 | test-ceph-node01.cs2cloud.internal |  2200502468634 | Up     | acs-mn01 |
|  3 | s-8-VM                             |  2200502468634 | Up     | acs-mn01 |
|  5 | test-ceph-node03.cs2cloud.internal |  2200502468634 | Up     | acs-mn01 |
|  2 | v-9-VM                             |  2199950196764 | Up     | acs-mn02 |
|  4 | test-ceph-node02.cs2cloud.internal |  2199950196764 | Up     | acs-mn02 |
|  6 | test-ceph-node04.cs2cloud.internal |  2199950196764 | Up     | acs-mn02 |
+----+------------------------------------+----------------+--------+----------+
6 rows in set (0.00 sec)

2019-07-18 15:10 Forced power off to close acs-mn02

wait....................................

After the 15th minute (2019-07-18 15:26:23), the agent found that the management node failed and began to switch.
So, add host.lb.check.interval=0 to agent. properties doesn't solve the problem.

Below is the log




2019-07-18 15:26:23,414 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport=33190] closed on read.  Probably -1 returned: No route to host
2019-07-18 15:26:23,416 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=33190]
2019-07-18 15:26:23,417 DEBUG [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Clearing watch list: 2
2019-07-18 15:26:23,417 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in progress.
2019-07-18 15:26:23,420 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) NioClient connection closed
2019-07-18 15:26:23,420 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Reconnecting to host:172.17.1.142
2019-07-18 15:26:23,420 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connecting to 172.17.1.142:8250
2019-07-18 15:26:26,427 ERROR [utils.nio.NioConnection] (Agent-Handler-2:null) (logid:) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
      at sun.nio.ch.Net.connect0(Native Method)
      at sun.nio.ch.Net.connect(Net.java:454)
      at sun.nio.ch.Net.connect(Net.java:446)
      at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
      at com.cloud.utils.nio.NioClient.init(NioClient.java:56)
      at com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
      at com.cloud.agent.Agent.reconnect(Agent.java:517)
      at com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
      at com.cloud.utils.nio.Task.call(Task.java:83)
      at com.cloud.utils.nio.Task.call(Task.java:29)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
2019-07-18 15:26:26,432 INFO  [utils.exception.CSExceptionErrorCode] (Agent-Handler-2:null) (logid:) Could not find exception: com.cloud.utils.exception.NioConnectionException in error code list for exceptions
2019-07-18 15:26:26,432 WARN  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) NIO Connection Exception  com.cloud.utils.exception.NioConnectionException: No route to host
2019-07-18 15:26:26,432 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Attempted to connect to the server, but received an unexpected exception, trying again...
2019-07-18 15:26:26,432 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) NioClient connection closed
2019-07-18 15:26:31,433 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Reconnecting to host:172.17.1.141
2019-07-18 15:26:31,434 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connecting to 172.17.1.141:8250
2019-07-18 15:26:31,435 INFO  [utils.nio.Link] (Agent-Handler-2:null) (logid:) Conf file found: /etc/cloudstack/agent/agent.properties
2019-07-18 15:26:31,545 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) SSL: Handshake done
2019-07-18 15:26:31,546 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connected to 172.17.1.141:8250
2019-07-18 15:26:31,564 DEBUG [kvm.resource.LibvirtConnection] (Agent-Handler-1:null) (logid:) Looking for libvirtd connection at: qemu:///system

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 12:48
收件人: dev@cloudstack.apache.org<ma...@cloudstack.apache.org>; users@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Thanks,

I suspect the culprit is the background task trying to reconnect to the preferred host (which runs every 60 seconds).

I would suggest disabling the background task by setting the interval to 0. As you do not want to change your 'host' global configuration to propagate a new list to the agents, you should do it this way:

- Add this line to agent.properties: host.lb.check.interval=0
- Restart the agent

Please let me know if this fixes your issue.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Thursday, July 18, 2019 12:00 AM
To: dev@cloudstack.apache.org <de...@cloudstack.apache.org>; users@cloudstack.apache.org <us...@cloudstack.apache.org>
Subject: 答复: Agent LB for CloudStack failed

Hi Nicolas

test-ceph-node01

[root@test-ceph-node01 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
workers=5
guest.network.device=br0
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=88ca642a-e319-3369-b2c9-39c2b2bddc7c
public.network.device=br0
cluster=1
local.storage.uuid=ec28176f-a3db-4383-90c8-6dcdbc45c3e0
keystore.passphrase=O8VdcZqBwWMMxwk2
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=1
host=172.17.1.141,172.17.1.142@roundrobin

this is test-ceph-node02

[root@test-ceph-node02 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:23 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
guid=649cbe62-dcac-36ae-a62c-699f0e0b8af1
hypervisor.type=kvm
cluster=1
public.network.device=br0
local.storage.uuid=2fc2f796-0614-40cf-bfdf-37a9429520fb
domr.scripts.dir=scripts/network/domr/kvm
keystore.passphrase=vB48rgCk58vNJC6N
host=172.17.1.142,172.17.1.141@roundrobin
LibvirtComputingResource.id=4

test-ceph-node03

[root@test-ceph-node03 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=4d3742c4-8678-3f21-a841-c1ffa32d0a8d
public.network.device=br0
cluster=1
local.storage.uuid=31ee15cf-b3b2-4387-b081-7c47971b9e68
keystore.passphrase=ACgs24DnBgYkORvh
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=5
host=172.17.1.141,172.17.1.142@roundrobin

test-ceph-node04
[root@test-ceph-node04 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:22 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=bfd4b7ba-fd5f-365d-b4d8-a6e8e7c78c0c
public.network.device=br0
cluster=1
local.storage.uuid=2d5004ff-37b1-4f66-bff0-e71ac211f1da
keystore.passphrase=r3D4upcAOdWbwE9p
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=6
host=172.17.1.142,172.17.1.141@roundrobin

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 10:56
收件人: users@cloudstack.apache.org<ma...@cloudstack.apache.org>; dev@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Hi Jerry,

I'll request some additional information. Can you provide me with the value stored on agent.properties for 'host' property on each KVM host? I suspect that the global setting has not been propagated to the agents, as it is trying to reconnect instead of connecting to the next management server once it is down.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Monday, July 15, 2019 10:20 PM
To: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
Subject: Agent LB for CloudStack failed

Hello everyone

My kvm Agent LB on 4.11.2/4.11.3 failed. When the preferred managment node is forced to power off, the agent will not immediately connect to the second management node.After 15 minutes, the agent issues a "No route to host" error and connects to the second management node.

management node:
acs-mn01,172.17.1.141
acs-mn02,172.17.1.142

mysql db node:
acs-db01

kvmm agent node:
test-ceph-node01
test-ceph-node02
test-ceph-node03
test-ceph-node04


global seting

host=172.17.1.142,172.17.1.141
indirect.agent.lb.algorithm=roundrobin
indirect.agent.lb.check.interval=60


Partial agnet logs:

2019-07-15 23:22:39,340 DEBUG [cloud.agent.Agent] (UgentTask-5:null) (logid:) Sending ping: Seq 1-19: { Cmd , MgmtId: -1, via: 1, Ver : v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"_hostVmStateReport":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType ":"Routing","hostId":1,"wait":0}}] }
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport= 34854] closed on read. Probably -1 returned: No route to host
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=34854]
2019-07-15 23:23:09,961 DEBUG [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Clearing watch list: 2
2019-07-15 23:23:09,962 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in Progress.
2019-07-15 23:23:09,963 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) NioClient connection closed
2019-07-15 23:23:09,964 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Reconnecting to host:172.17.1.142
2019-07-15 23:23:09,964 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) Connecting to 172.17.1.142:8250
2019-07-15 23:23:12,972 ERROR [utils.nio.NioConnection] (Agent-Handler-4:null) (logid:a4e4de49) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
 At sun.nio.ch.Net.connect0(Native Method)
 At sun.nio.ch.Net.connect(Net.java:454)
 At sun.nio.ch.Net.connect(Net.java:446)
 At sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
 At com.cloud.utils.nio.NioClient.init(NioClient.java:56)
 At com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
 At com.cloud.agent.Agent.reconnect(Agent.java:517)
 At com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
 At com.clo

Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Nicolas.Vazquez@shapeblue.com 
www.shapeblue.com
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue
  
 


Re: Agent LB for CloudStack failed

Posted by Nicolas Vazquez <Ni...@shapeblue.com>.
Ok, I'll try replicating and get back to you.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Thursday, July 18, 2019 4:41 AM
To: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
Subject: 答复: Agent LB for CloudStack failed


I added host.lb.check.interval = 0 to all agent.properties and restarted the cloudstack-agent


The following is the connection status of the agent after reboot.

mysql> select host.id ,host.name,host.mgmt_server_id,host.status,mshost.name from host,mshost where host.mgmt_server_id=mshost.msid;
+----+------------------------------------+----------------+--------+----------+
| id | name                               | mgmt_server_id | status | name     |
+----+------------------------------------+----------------+--------+----------+
|  1 | test-ceph-node01.cs2cloud.internal |  2200502468634 | Up     | acs-mn01 |
|  3 | s-8-VM                             |  2200502468634 | Up     | acs-mn01 |
|  5 | test-ceph-node03.cs2cloud.internal |  2200502468634 | Up     | acs-mn01 |
|  2 | v-9-VM                             |  2199950196764 | Up     | acs-mn02 |
|  4 | test-ceph-node02.cs2cloud.internal |  2199950196764 | Up     | acs-mn02 |
|  6 | test-ceph-node04.cs2cloud.internal |  2199950196764 | Up     | acs-mn02 |
+----+------------------------------------+----------------+--------+----------+
6 rows in set (0.00 sec)

2019-07-18 15:10 Forced power off to close acs-mn02

wait....................................

After the 15th minute (2019-07-18 15:26:23), the agent found that the management node failed and began to switch.
So, add host.lb.check.interval=0 to agent. properties doesn't solve the problem.

Below is the log




2019-07-18 15:26:23,414 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport=33190] closed on read.  Probably -1 returned: No route to host
2019-07-18 15:26:23,416 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=33190]
2019-07-18 15:26:23,417 DEBUG [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Clearing watch list: 2
2019-07-18 15:26:23,417 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in progress.
2019-07-18 15:26:23,420 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) NioClient connection closed
2019-07-18 15:26:23,420 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Reconnecting to host:172.17.1.142
2019-07-18 15:26:23,420 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connecting to 172.17.1.142:8250
2019-07-18 15:26:26,427 ERROR [utils.nio.NioConnection] (Agent-Handler-2:null) (logid:) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
      at sun.nio.ch.Net.connect0(Native Method)
      at sun.nio.ch.Net.connect(Net.java:454)
      at sun.nio.ch.Net.connect(Net.java:446)
      at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
      at com.cloud.utils.nio.NioClient.init(NioClient.java:56)
      at com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
      at com.cloud.agent.Agent.reconnect(Agent.java:517)
      at com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
      at com.cloud.utils.nio.Task.call(Task.java:83)
      at com.cloud.utils.nio.Task.call(Task.java:29)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
2019-07-18 15:26:26,432 INFO  [utils.exception.CSExceptionErrorCode] (Agent-Handler-2:null) (logid:) Could not find exception: com.cloud.utils.exception.NioConnectionException in error code list for exceptions
2019-07-18 15:26:26,432 WARN  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) NIO Connection Exception  com.cloud.utils.exception.NioConnectionException: No route to host
2019-07-18 15:26:26,432 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Attempted to connect to the server, but received an unexpected exception, trying again...
2019-07-18 15:26:26,432 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) NioClient connection closed
2019-07-18 15:26:31,433 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Reconnecting to host:172.17.1.141
2019-07-18 15:26:31,434 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connecting to 172.17.1.141:8250
2019-07-18 15:26:31,435 INFO  [utils.nio.Link] (Agent-Handler-2:null) (logid:) Conf file found: /etc/cloudstack/agent/agent.properties
2019-07-18 15:26:31,545 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) SSL: Handshake done
2019-07-18 15:26:31,546 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connected to 172.17.1.141:8250
2019-07-18 15:26:31,564 DEBUG [kvm.resource.LibvirtConnection] (Agent-Handler-1:null) (logid:) Looking for libvirtd connection at: qemu:///system

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 12:48
收件人: dev@cloudstack.apache.org<ma...@cloudstack.apache.org>; users@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Thanks,

I suspect the culprit is the background task trying to reconnect to the preferred host (which runs every 60 seconds).

I would suggest disabling the background task by setting the interval to 0. As you do not want to change your 'host' global configuration to propagate a new list to the agents, you should do it this way:

- Add this line to agent.properties: host.lb.check.interval=0
- Restart the agent

Please let me know if this fixes your issue.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Thursday, July 18, 2019 12:00 AM
To: dev@cloudstack.apache.org <de...@cloudstack.apache.org>; users@cloudstack.apache.org <us...@cloudstack.apache.org>
Subject: 答复: Agent LB for CloudStack failed

Hi Nicolas

test-ceph-node01

[root@test-ceph-node01 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
workers=5
guest.network.device=br0
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=88ca642a-e319-3369-b2c9-39c2b2bddc7c
public.network.device=br0
cluster=1
local.storage.uuid=ec28176f-a3db-4383-90c8-6dcdbc45c3e0
keystore.passphrase=O8VdcZqBwWMMxwk2
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=1
host=172.17.1.141,172.17.1.142@roundrobin

this is test-ceph-node02

[root@test-ceph-node02 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:23 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
guid=649cbe62-dcac-36ae-a62c-699f0e0b8af1
hypervisor.type=kvm
cluster=1
public.network.device=br0
local.storage.uuid=2fc2f796-0614-40cf-bfdf-37a9429520fb
domr.scripts.dir=scripts/network/domr/kvm
keystore.passphrase=vB48rgCk58vNJC6N
host=172.17.1.142,172.17.1.141@roundrobin
LibvirtComputingResource.id=4

test-ceph-node03

[root@test-ceph-node03 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=4d3742c4-8678-3f21-a841-c1ffa32d0a8d
public.network.device=br0
cluster=1
local.storage.uuid=31ee15cf-b3b2-4387-b081-7c47971b9e68
keystore.passphrase=ACgs24DnBgYkORvh
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=5
host=172.17.1.141,172.17.1.142@roundrobin

test-ceph-node04
[root@test-ceph-node04 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:22 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=bfd4b7ba-fd5f-365d-b4d8-a6e8e7c78c0c
public.network.device=br0
cluster=1
local.storage.uuid=2d5004ff-37b1-4f66-bff0-e71ac211f1da
keystore.passphrase=r3D4upcAOdWbwE9p
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=6
host=172.17.1.142,172.17.1.141@roundrobin

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 10:56
收件人: users@cloudstack.apache.org<ma...@cloudstack.apache.org>; dev@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Hi Jerry,

I'll request some additional information. Can you provide me with the value stored on agent.properties for 'host' property on each KVM host? I suspect that the global setting has not been propagated to the agents, as it is trying to reconnect instead of connecting to the next management server once it is down.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Monday, July 15, 2019 10:20 PM
To: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
Subject: Agent LB for CloudStack failed

Hello everyone

My kvm Agent LB on 4.11.2/4.11.3 failed. When the preferred managment node is forced to power off, the agent will not immediately connect to the second management node.After 15 minutes, the agent issues a "No route to host" error and connects to the second management node.

management node:
acs-mn01,172.17.1.141
acs-mn02,172.17.1.142

mysql db node:
acs-db01

kvmm agent node:
test-ceph-node01
test-ceph-node02
test-ceph-node03
test-ceph-node04


global seting

host=172.17.1.142,172.17.1.141
indirect.agent.lb.algorithm=roundrobin
indirect.agent.lb.check.interval=60


Partial agnet logs:

2019-07-15 23:22:39,340 DEBUG [cloud.agent.Agent] (UgentTask-5:null) (logid:) Sending ping: Seq 1-19: { Cmd , MgmtId: -1, via: 1, Ver : v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"_hostVmStateReport":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType ":"Routing","hostId":1,"wait":0}}] }
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport= 34854] closed on read. Probably -1 returned: No route to host
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=34854]
2019-07-15 23:23:09,961 DEBUG [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Clearing watch list: 2
2019-07-15 23:23:09,962 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in Progress.
2019-07-15 23:23:09,963 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) NioClient connection closed
2019-07-15 23:23:09,964 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Reconnecting to host:172.17.1.142
2019-07-15 23:23:09,964 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) Connecting to 172.17.1.142:8250
2019-07-15 23:23:12,972 ERROR [utils.nio.NioConnection] (Agent-Handler-4:null) (logid:a4e4de49) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
 At sun.nio.ch.Net.connect0(Native Method)
 At sun.nio.ch.Net.connect(Net.java:454)
 At sun.nio.ch.Net.connect(Net.java:446)
 At sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
 At com.cloud.utils.nio.NioClient.init(NioClient.java:56)
 At com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
 At com.cloud.agent.Agent.reconnect(Agent.java:517)
 At com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
 At com.clo

Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Nicolas.Vazquez@shapeblue.com 
www.shapeblue.com
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue
  
 


答复: Agent LB for CloudStack failed

Posted by li jerry <di...@hotmail.com>.
I added host.lb.check.interval = 0 to all agent.properties and restarted the cloudstack-agent


The following is the connection status of the agent after reboot.

mysql> select host.id ,host.name,host.mgmt_server_id,host.status,mshost.name from host,mshost where host.mgmt_server_id=mshost.msid;
+----+------------------------------------+----------------+--------+----------+
| id | name                               | mgmt_server_id | status | name     |
+----+------------------------------------+----------------+--------+----------+
|  1 | test-ceph-node01.cs2cloud.internal |  2200502468634 | Up     | acs-mn01 |
|  3 | s-8-VM                             |  2200502468634 | Up     | acs-mn01 |
|  5 | test-ceph-node03.cs2cloud.internal |  2200502468634 | Up     | acs-mn01 |
|  2 | v-9-VM                             |  2199950196764 | Up     | acs-mn02 |
|  4 | test-ceph-node02.cs2cloud.internal |  2199950196764 | Up     | acs-mn02 |
|  6 | test-ceph-node04.cs2cloud.internal |  2199950196764 | Up     | acs-mn02 |
+----+------------------------------------+----------------+--------+----------+
6 rows in set (0.00 sec)

2019-07-18 15:10 Forced power off to close acs-mn02

wait....................................

After the 15th minute (2019-07-18 15:26:23), the agent found that the management node failed and began to switch.
So, add host.lb.check.interval=0 to agent. properties doesn't solve the problem.

Below is the log




2019-07-18 15:26:23,414 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport=33190] closed on read.  Probably -1 returned: No route to host
2019-07-18 15:26:23,416 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=33190]
2019-07-18 15:26:23,417 DEBUG [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Clearing watch list: 2
2019-07-18 15:26:23,417 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in progress.
2019-07-18 15:26:23,420 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) NioClient connection closed
2019-07-18 15:26:23,420 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Reconnecting to host:172.17.1.142
2019-07-18 15:26:23,420 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connecting to 172.17.1.142:8250
2019-07-18 15:26:26,427 ERROR [utils.nio.NioConnection] (Agent-Handler-2:null) (logid:) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
      at sun.nio.ch.Net.connect0(Native Method)
      at sun.nio.ch.Net.connect(Net.java:454)
      at sun.nio.ch.Net.connect(Net.java:446)
      at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
      at com.cloud.utils.nio.NioClient.init(NioClient.java:56)
      at com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
      at com.cloud.agent.Agent.reconnect(Agent.java:517)
      at com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
      at com.cloud.utils.nio.Task.call(Task.java:83)
      at com.cloud.utils.nio.Task.call(Task.java:29)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
2019-07-18 15:26:26,432 INFO  [utils.exception.CSExceptionErrorCode] (Agent-Handler-2:null) (logid:) Could not find exception: com.cloud.utils.exception.NioConnectionException in error code list for exceptions
2019-07-18 15:26:26,432 WARN  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) NIO Connection Exception  com.cloud.utils.exception.NioConnectionException: No route to host
2019-07-18 15:26:26,432 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Attempted to connect to the server, but received an unexpected exception, trying again...
2019-07-18 15:26:26,432 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) NioClient connection closed
2019-07-18 15:26:31,433 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Reconnecting to host:172.17.1.141
2019-07-18 15:26:31,434 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connecting to 172.17.1.141:8250
2019-07-18 15:26:31,435 INFO  [utils.nio.Link] (Agent-Handler-2:null) (logid:) Conf file found: /etc/cloudstack/agent/agent.properties
2019-07-18 15:26:31,545 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) SSL: Handshake done
2019-07-18 15:26:31,546 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connected to 172.17.1.141:8250
2019-07-18 15:26:31,564 DEBUG [kvm.resource.LibvirtConnection] (Agent-Handler-1:null) (logid:) Looking for libvirtd connection at: qemu:///system

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 12:48
收件人: dev@cloudstack.apache.org<ma...@cloudstack.apache.org>; users@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Thanks,

I suspect the culprit is the background task trying to reconnect to the preferred host (which runs every 60 seconds).

I would suggest disabling the background task by setting the interval to 0. As you do not want to change your 'host' global configuration to propagate a new list to the agents, you should do it this way:

- Add this line to agent.properties: host.lb.check.interval=0
- Restart the agent

Please let me know if this fixes your issue.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Thursday, July 18, 2019 12:00 AM
To: dev@cloudstack.apache.org <de...@cloudstack.apache.org>; users@cloudstack.apache.org <us...@cloudstack.apache.org>
Subject: 答复: Agent LB for CloudStack failed

Hi Nicolas

test-ceph-node01

[root@test-ceph-node01 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
workers=5
guest.network.device=br0
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=88ca642a-e319-3369-b2c9-39c2b2bddc7c
public.network.device=br0
cluster=1
local.storage.uuid=ec28176f-a3db-4383-90c8-6dcdbc45c3e0
keystore.passphrase=O8VdcZqBwWMMxwk2
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=1
host=172.17.1.141,172.17.1.142@roundrobin

this is test-ceph-node02

[root@test-ceph-node02 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:23 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
guid=649cbe62-dcac-36ae-a62c-699f0e0b8af1
hypervisor.type=kvm
cluster=1
public.network.device=br0
local.storage.uuid=2fc2f796-0614-40cf-bfdf-37a9429520fb
domr.scripts.dir=scripts/network/domr/kvm
keystore.passphrase=vB48rgCk58vNJC6N
host=172.17.1.142,172.17.1.141@roundrobin
LibvirtComputingResource.id=4

test-ceph-node03

[root@test-ceph-node03 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=4d3742c4-8678-3f21-a841-c1ffa32d0a8d
public.network.device=br0
cluster=1
local.storage.uuid=31ee15cf-b3b2-4387-b081-7c47971b9e68
keystore.passphrase=ACgs24DnBgYkORvh
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=5
host=172.17.1.141,172.17.1.142@roundrobin

test-ceph-node04
[root@test-ceph-node04 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:22 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=bfd4b7ba-fd5f-365d-b4d8-a6e8e7c78c0c
public.network.device=br0
cluster=1
local.storage.uuid=2d5004ff-37b1-4f66-bff0-e71ac211f1da
keystore.passphrase=r3D4upcAOdWbwE9p
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=6
host=172.17.1.142,172.17.1.141@roundrobin

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 10:56
收件人: users@cloudstack.apache.org<ma...@cloudstack.apache.org>; dev@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Hi Jerry,

I'll request some additional information. Can you provide me with the value stored on agent.properties for 'host' property on each KVM host? I suspect that the global setting has not been propagated to the agents, as it is trying to reconnect instead of connecting to the next management server once it is down.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Monday, July 15, 2019 10:20 PM
To: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
Subject: Agent LB for CloudStack failed

Hello everyone

My kvm Agent LB on 4.11.2/4.11.3 failed. When the preferred managment node is forced to power off, the agent will not immediately connect to the second management node.After 15 minutes, the agent issues a "No route to host" error and connects to the second management node.

management node:
acs-mn01,172.17.1.141
acs-mn02,172.17.1.142

mysql db node:
acs-db01

kvmm agent node:
test-ceph-node01
test-ceph-node02
test-ceph-node03
test-ceph-node04


global seting

host=172.17.1.142,172.17.1.141
indirect.agent.lb.algorithm=roundrobin
indirect.agent.lb.check.interval=60


Partial agnet logs:

2019-07-15 23:22:39,340 DEBUG [cloud.agent.Agent] (UgentTask-5:null) (logid:) Sending ping: Seq 1-19: { Cmd , MgmtId: -1, via: 1, Ver : v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"_hostVmStateReport":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType ":"Routing","hostId":1,"wait":0}}] }
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport= 34854] closed on read. Probably -1 returned: No route to host
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=34854]
2019-07-15 23:23:09,961 DEBUG [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Clearing watch list: 2
2019-07-15 23:23:09,962 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in Progress.
2019-07-15 23:23:09,963 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) NioClient connection closed
2019-07-15 23:23:09,964 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Reconnecting to host:172.17.1.142
2019-07-15 23:23:09,964 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) Connecting to 172.17.1.142:8250
2019-07-15 23:23:12,972 ERROR [utils.nio.NioConnection] (Agent-Handler-4:null) (logid:a4e4de49) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
 At sun.nio.ch.Net.connect0(Native Method)
 At sun.nio.ch.Net.connect(Net.java:454)
 At sun.nio.ch.Net.connect(Net.java:446)
 At sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
 At com.cloud.utils.nio.NioClient.init(NioClient.java:56)
 At com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
 At com.cloud.agent.Agent.reconnect(Agent.java:517)
 At com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
 At com.clo

Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




答复: Agent LB for CloudStack failed

Posted by li jerry <di...@hotmail.com>.
I added host.lb.check.interval = 0 to all agent.properties and restarted the cloudstack-agent


The following is the connection status of the agent after reboot.

mysql> select host.id ,host.name,host.mgmt_server_id,host.status,mshost.name from host,mshost where host.mgmt_server_id=mshost.msid;
+----+------------------------------------+----------------+--------+----------+
| id | name                               | mgmt_server_id | status | name     |
+----+------------------------------------+----------------+--------+----------+
|  1 | test-ceph-node01.cs2cloud.internal |  2200502468634 | Up     | acs-mn01 |
|  3 | s-8-VM                             |  2200502468634 | Up     | acs-mn01 |
|  5 | test-ceph-node03.cs2cloud.internal |  2200502468634 | Up     | acs-mn01 |
|  2 | v-9-VM                             |  2199950196764 | Up     | acs-mn02 |
|  4 | test-ceph-node02.cs2cloud.internal |  2199950196764 | Up     | acs-mn02 |
|  6 | test-ceph-node04.cs2cloud.internal |  2199950196764 | Up     | acs-mn02 |
+----+------------------------------------+----------------+--------+----------+
6 rows in set (0.00 sec)

2019-07-18 15:10 Forced power off to close acs-mn02

wait....................................

After the 15th minute (2019-07-18 15:26:23), the agent found that the management node failed and began to switch.
So, add host.lb.check.interval=0 to agent. properties doesn't solve the problem.

Below is the log




2019-07-18 15:26:23,414 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport=33190] closed on read.  Probably -1 returned: No route to host
2019-07-18 15:26:23,416 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=33190]
2019-07-18 15:26:23,417 DEBUG [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Clearing watch list: 2
2019-07-18 15:26:23,417 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in progress.
2019-07-18 15:26:23,420 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) NioClient connection closed
2019-07-18 15:26:23,420 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Reconnecting to host:172.17.1.142
2019-07-18 15:26:23,420 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connecting to 172.17.1.142:8250
2019-07-18 15:26:26,427 ERROR [utils.nio.NioConnection] (Agent-Handler-2:null) (logid:) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
      at sun.nio.ch.Net.connect0(Native Method)
      at sun.nio.ch.Net.connect(Net.java:454)
      at sun.nio.ch.Net.connect(Net.java:446)
      at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
      at com.cloud.utils.nio.NioClient.init(NioClient.java:56)
      at com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
      at com.cloud.agent.Agent.reconnect(Agent.java:517)
      at com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
      at com.cloud.utils.nio.Task.call(Task.java:83)
      at com.cloud.utils.nio.Task.call(Task.java:29)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
2019-07-18 15:26:26,432 INFO  [utils.exception.CSExceptionErrorCode] (Agent-Handler-2:null) (logid:) Could not find exception: com.cloud.utils.exception.NioConnectionException in error code list for exceptions
2019-07-18 15:26:26,432 WARN  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) NIO Connection Exception  com.cloud.utils.exception.NioConnectionException: No route to host
2019-07-18 15:26:26,432 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Attempted to connect to the server, but received an unexpected exception, trying again...
2019-07-18 15:26:26,432 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) NioClient connection closed
2019-07-18 15:26:31,433 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Reconnecting to host:172.17.1.141
2019-07-18 15:26:31,434 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connecting to 172.17.1.141:8250
2019-07-18 15:26:31,435 INFO  [utils.nio.Link] (Agent-Handler-2:null) (logid:) Conf file found: /etc/cloudstack/agent/agent.properties
2019-07-18 15:26:31,545 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) SSL: Handshake done
2019-07-18 15:26:31,546 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) (logid:) Connected to 172.17.1.141:8250
2019-07-18 15:26:31,564 DEBUG [kvm.resource.LibvirtConnection] (Agent-Handler-1:null) (logid:) Looking for libvirtd connection at: qemu:///system

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 12:48
收件人: dev@cloudstack.apache.org<ma...@cloudstack.apache.org>; users@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Thanks,

I suspect the culprit is the background task trying to reconnect to the preferred host (which runs every 60 seconds).

I would suggest disabling the background task by setting the interval to 0. As you do not want to change your 'host' global configuration to propagate a new list to the agents, you should do it this way:

- Add this line to agent.properties: host.lb.check.interval=0
- Restart the agent

Please let me know if this fixes your issue.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Thursday, July 18, 2019 12:00 AM
To: dev@cloudstack.apache.org <de...@cloudstack.apache.org>; users@cloudstack.apache.org <us...@cloudstack.apache.org>
Subject: 答复: Agent LB for CloudStack failed

Hi Nicolas

test-ceph-node01

[root@test-ceph-node01 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
workers=5
guest.network.device=br0
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=88ca642a-e319-3369-b2c9-39c2b2bddc7c
public.network.device=br0
cluster=1
local.storage.uuid=ec28176f-a3db-4383-90c8-6dcdbc45c3e0
keystore.passphrase=O8VdcZqBwWMMxwk2
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=1
host=172.17.1.141,172.17.1.142@roundrobin

this is test-ceph-node02

[root@test-ceph-node02 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:23 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
guid=649cbe62-dcac-36ae-a62c-699f0e0b8af1
hypervisor.type=kvm
cluster=1
public.network.device=br0
local.storage.uuid=2fc2f796-0614-40cf-bfdf-37a9429520fb
domr.scripts.dir=scripts/network/domr/kvm
keystore.passphrase=vB48rgCk58vNJC6N
host=172.17.1.142,172.17.1.141@roundrobin
LibvirtComputingResource.id=4

test-ceph-node03

[root@test-ceph-node03 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=4d3742c4-8678-3f21-a841-c1ffa32d0a8d
public.network.device=br0
cluster=1
local.storage.uuid=31ee15cf-b3b2-4387-b081-7c47971b9e68
keystore.passphrase=ACgs24DnBgYkORvh
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=5
host=172.17.1.141,172.17.1.142@roundrobin

test-ceph-node04
[root@test-ceph-node04 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:22 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=bfd4b7ba-fd5f-365d-b4d8-a6e8e7c78c0c
public.network.device=br0
cluster=1
local.storage.uuid=2d5004ff-37b1-4f66-bff0-e71ac211f1da
keystore.passphrase=r3D4upcAOdWbwE9p
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=6
host=172.17.1.142,172.17.1.141@roundrobin

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 10:56
收件人: users@cloudstack.apache.org<ma...@cloudstack.apache.org>; dev@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Hi Jerry,

I'll request some additional information. Can you provide me with the value stored on agent.properties for 'host' property on each KVM host? I suspect that the global setting has not been propagated to the agents, as it is trying to reconnect instead of connecting to the next management server once it is down.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Monday, July 15, 2019 10:20 PM
To: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
Subject: Agent LB for CloudStack failed

Hello everyone

My kvm Agent LB on 4.11.2/4.11.3 failed. When the preferred managment node is forced to power off, the agent will not immediately connect to the second management node.After 15 minutes, the agent issues a "No route to host" error and connects to the second management node.

management node:
acs-mn01,172.17.1.141
acs-mn02,172.17.1.142

mysql db node:
acs-db01

kvmm agent node:
test-ceph-node01
test-ceph-node02
test-ceph-node03
test-ceph-node04


global seting

host=172.17.1.142,172.17.1.141
indirect.agent.lb.algorithm=roundrobin
indirect.agent.lb.check.interval=60


Partial agnet logs:

2019-07-15 23:22:39,340 DEBUG [cloud.agent.Agent] (UgentTask-5:null) (logid:) Sending ping: Seq 1-19: { Cmd , MgmtId: -1, via: 1, Ver : v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"_hostVmStateReport":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType ":"Routing","hostId":1,"wait":0}}] }
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport= 34854] closed on read. Probably -1 returned: No route to host
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=34854]
2019-07-15 23:23:09,961 DEBUG [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Clearing watch list: 2
2019-07-15 23:23:09,962 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in Progress.
2019-07-15 23:23:09,963 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) NioClient connection closed
2019-07-15 23:23:09,964 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Reconnecting to host:172.17.1.142
2019-07-15 23:23:09,964 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) Connecting to 172.17.1.142:8250
2019-07-15 23:23:12,972 ERROR [utils.nio.NioConnection] (Agent-Handler-4:null) (logid:a4e4de49) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
 At sun.nio.ch.Net.connect0(Native Method)
 At sun.nio.ch.Net.connect(Net.java:454)
 At sun.nio.ch.Net.connect(Net.java:446)
 At sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
 At com.cloud.utils.nio.NioClient.init(NioClient.java:56)
 At com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
 At com.cloud.agent.Agent.reconnect(Agent.java:517)
 At com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
 At com.clo

Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Re: Agent LB for CloudStack failed

Posted by Nicolas Vazquez <Ni...@shapeblue.com>.
Thanks,

I suspect the culprit is the background task trying to reconnect to the preferred host (which runs every 60 seconds).

I would suggest disabling the background task by setting the interval to 0. As you do not want to change your 'host' global configuration to propagate a new list to the agents, you should do it this way:

- Add this line to agent.properties: host.lb.check.interval=0
- Restart the agent

Please let me know if this fixes your issue.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Thursday, July 18, 2019 12:00 AM
To: dev@cloudstack.apache.org <de...@cloudstack.apache.org>; users@cloudstack.apache.org <us...@cloudstack.apache.org>
Subject: 答复: Agent LB for CloudStack failed

Hi Nicolas

test-ceph-node01

[root@test-ceph-node01 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
workers=5
guest.network.device=br0
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=88ca642a-e319-3369-b2c9-39c2b2bddc7c
public.network.device=br0
cluster=1
local.storage.uuid=ec28176f-a3db-4383-90c8-6dcdbc45c3e0
keystore.passphrase=O8VdcZqBwWMMxwk2
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=1
host=172.17.1.141,172.17.1.142@roundrobin

this is test-ceph-node02

[root@test-ceph-node02 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:23 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
guid=649cbe62-dcac-36ae-a62c-699f0e0b8af1
hypervisor.type=kvm
cluster=1
public.network.device=br0
local.storage.uuid=2fc2f796-0614-40cf-bfdf-37a9429520fb
domr.scripts.dir=scripts/network/domr/kvm
keystore.passphrase=vB48rgCk58vNJC6N
host=172.17.1.142,172.17.1.141@roundrobin
LibvirtComputingResource.id=4

test-ceph-node03

[root@test-ceph-node03 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=4d3742c4-8678-3f21-a841-c1ffa32d0a8d
public.network.device=br0
cluster=1
local.storage.uuid=31ee15cf-b3b2-4387-b081-7c47971b9e68
keystore.passphrase=ACgs24DnBgYkORvh
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=5
host=172.17.1.141,172.17.1.142@roundrobin

test-ceph-node04
[root@test-ceph-node04 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:22 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=bfd4b7ba-fd5f-365d-b4d8-a6e8e7c78c0c
public.network.device=br0
cluster=1
local.storage.uuid=2d5004ff-37b1-4f66-bff0-e71ac211f1da
keystore.passphrase=r3D4upcAOdWbwE9p
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=6
host=172.17.1.142,172.17.1.141@roundrobin

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 10:56
收件人: users@cloudstack.apache.org<ma...@cloudstack.apache.org>; dev@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Hi Jerry,

I'll request some additional information. Can you provide me with the value stored on agent.properties for 'host' property on each KVM host? I suspect that the global setting has not been propagated to the agents, as it is trying to reconnect instead of connecting to the next management server once it is down.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Monday, July 15, 2019 10:20 PM
To: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
Subject: Agent LB for CloudStack failed

Hello everyone

My kvm Agent LB on 4.11.2/4.11.3 failed. When the preferred managment node is forced to power off, the agent will not immediately connect to the second management node.After 15 minutes, the agent issues a "No route to host" error and connects to the second management node.

management node:
acs-mn01,172.17.1.141
acs-mn02,172.17.1.142

mysql db node:
acs-db01

kvmm agent node:
test-ceph-node01
test-ceph-node02
test-ceph-node03
test-ceph-node04


global seting

host=172.17.1.142,172.17.1.141
indirect.agent.lb.algorithm=roundrobin
indirect.agent.lb.check.interval=60


Partial agnet logs:

2019-07-15 23:22:39,340 DEBUG [cloud.agent.Agent] (UgentTask-5:null) (logid:) Sending ping: Seq 1-19: { Cmd , MgmtId: -1, via: 1, Ver : v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"_hostVmStateReport":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType ":"Routing","hostId":1,"wait":0}}] }
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport= 34854] closed on read. Probably -1 returned: No route to host
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=34854]
2019-07-15 23:23:09,961 DEBUG [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Clearing watch list: 2
2019-07-15 23:23:09,962 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in Progress.
2019-07-15 23:23:09,963 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) NioClient connection closed
2019-07-15 23:23:09,964 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Reconnecting to host:172.17.1.142
2019-07-15 23:23:09,964 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) Connecting to 172.17.1.142:8250
2019-07-15 23:23:12,972 ERROR [utils.nio.NioConnection] (Agent-Handler-4:null) (logid:a4e4de49) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
 At sun.nio.ch.Net.connect0(Native Method)
 At sun.nio.ch.Net.connect(Net.java:454)
 At sun.nio.ch.Net.connect(Net.java:446)
 At sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
 At com.cloud.utils.nio.NioClient.init(NioClient.java:56)
 At com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
 At com.cloud.agent.Agent.reconnect(Agent.java:517)
 At com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
 At com.clo

Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Nicolas.Vazquez@shapeblue.com 
www.shapeblue.com
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue
  
 


Re: Agent LB for CloudStack failed

Posted by Nicolas Vazquez <Ni...@shapeblue.com>.
Thanks,

I suspect the culprit is the background task trying to reconnect to the preferred host (which runs every 60 seconds).

I would suggest disabling the background task by setting the interval to 0. As you do not want to change your 'host' global configuration to propagate a new list to the agents, you should do it this way:

- Add this line to agent.properties: host.lb.check.interval=0
- Restart the agent

Please let me know if this fixes your issue.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Thursday, July 18, 2019 12:00 AM
To: dev@cloudstack.apache.org <de...@cloudstack.apache.org>; users@cloudstack.apache.org <us...@cloudstack.apache.org>
Subject: 答复: Agent LB for CloudStack failed

Hi Nicolas

test-ceph-node01

[root@test-ceph-node01 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
workers=5
guest.network.device=br0
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=88ca642a-e319-3369-b2c9-39c2b2bddc7c
public.network.device=br0
cluster=1
local.storage.uuid=ec28176f-a3db-4383-90c8-6dcdbc45c3e0
keystore.passphrase=O8VdcZqBwWMMxwk2
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=1
host=172.17.1.141,172.17.1.142@roundrobin

this is test-ceph-node02

[root@test-ceph-node02 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:23 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
guid=649cbe62-dcac-36ae-a62c-699f0e0b8af1
hypervisor.type=kvm
cluster=1
public.network.device=br0
local.storage.uuid=2fc2f796-0614-40cf-bfdf-37a9429520fb
domr.scripts.dir=scripts/network/domr/kvm
keystore.passphrase=vB48rgCk58vNJC6N
host=172.17.1.142,172.17.1.141@roundrobin
LibvirtComputingResource.id=4

test-ceph-node03

[root@test-ceph-node03 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:39:18 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=4d3742c4-8678-3f21-a841-c1ffa32d0a8d
public.network.device=br0
cluster=1
local.storage.uuid=31ee15cf-b3b2-4387-b081-7c47971b9e68
keystore.passphrase=ACgs24DnBgYkORvh
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=5
host=172.17.1.141,172.17.1.142@roundrobin

test-ceph-node04
[root@test-ceph-node04 ~]# cat /etc/cloudstack/agent/agent.properties
#Storage
#Wed Jul 17 10:58:22 CST 2019
guest.network.device=br0
workers=5
private.network.device=br0
port=8250
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
pod=1
zone=1
hypervisor.type=kvm
guid=bfd4b7ba-fd5f-365d-b4d8-a6e8e7c78c0c
public.network.device=br0
cluster=1
local.storage.uuid=2d5004ff-37b1-4f66-bff0-e71ac211f1da
keystore.passphrase=r3D4upcAOdWbwE9p
domr.scripts.dir=scripts/network/domr/kvm
LibvirtComputingResource.id=6
host=172.17.1.142,172.17.1.141@roundrobin

发件人: Nicolas Vazquez<ma...@shapeblue.com>
发送时间: 2019年7月18日 10:56
收件人: users@cloudstack.apache.org<ma...@cloudstack.apache.org>; dev@cloudstack.apache.org<ma...@cloudstack.apache.org>
主题: Re: Agent LB for CloudStack failed

Hi Jerry,

I'll request some additional information. Can you provide me with the value stored on agent.properties for 'host' property on each KVM host? I suspect that the global setting has not been propagated to the agents, as it is trying to reconnect instead of connecting to the next management server once it is down.


Regards,

Nicolas Vazquez

________________________________
From: li jerry <di...@hotmail.com>
Sent: Monday, July 15, 2019 10:20 PM
To: users@cloudstack.apache.org <us...@cloudstack.apache.org>; dev@cloudstack.apache.org <de...@cloudstack.apache.org>
Subject: Agent LB for CloudStack failed

Hello everyone

My kvm Agent LB on 4.11.2/4.11.3 failed. When the preferred managment node is forced to power off, the agent will not immediately connect to the second management node.After 15 minutes, the agent issues a "No route to host" error and connects to the second management node.

management node:
acs-mn01,172.17.1.141
acs-mn02,172.17.1.142

mysql db node:
acs-db01

kvmm agent node:
test-ceph-node01
test-ceph-node02
test-ceph-node03
test-ceph-node04


global seting

host=172.17.1.142,172.17.1.141
indirect.agent.lb.algorithm=roundrobin
indirect.agent.lb.check.interval=60


Partial agnet logs:

2019-07-15 23:22:39,340 DEBUG [cloud.agent.Agent] (UgentTask-5:null) (logid:) Sending ping: Seq 1-19: { Cmd , MgmtId: -1, via: 1, Ver : v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"_hostVmStateReport":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType ":"Routing","hostId":1,"wait":0}}] }
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.17.1.142,port=8250,localport= 34854] closed on read. Probably -1 returned: No route to host
2019-07-15 23:23:09,960 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.17.1.142,port=8250,localport=34854]
2019-07-15 23:23:09,961 DEBUG [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Clearing watch list: 2
2019-07-15 23:23:09,962 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Lost connection to host: 172.17.1.142. Attempting reconnection while we still have 0 commands in Progress.
2019-07-15 23:23:09,963 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) NioClient connection closed
2019-07-15 23:23:09,964 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid:a4e4de49) Reconnecting to host:172.17.1.142
2019-07-15 23:23:09,964 INFO [utils.nio.NioClient] (Agent-Handler-4:null) (logid:a4e4de49) Connecting to 172.17.1.142:8250
2019-07-15 23:23:12,972 ERROR [utils.nio.NioConnection] (Agent-Handler-4:null) (logid:a4e4de49) Unable to initialize the threads.
java.net.NoRouteToHostException: No route to host
 At sun.nio.ch.Net.connect0(Native Method)
 At sun.nio.ch.Net.connect(Net.java:454)
 At sun.nio.ch.Net.connect(Net.java:446)
 At sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
 At com.cloud.utils.nio.NioClient.init(NioClient.java:56)
 At com.cloud.utils.nio.NioConnection.start(NioConnection.java:95)
 At com.cloud.agent.Agent.reconnect(Agent.java:517)
 At com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:1091)
 At com.clo

Nicolas.Vazquez@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue




Nicolas.Vazquez@shapeblue.com 
www.shapeblue.com
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue