You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@cloudstack.apache.org by "Sowmya Krishnan (JIRA)" <ji...@apache.org> on 2013/06/10 11:01:24 UTC

[jira] [Created] (CLOUDSTACK-2918) In a scaled up environment, hosts fail to come up after Management server restart in clustered set up

Sowmya Krishnan created CLOUDSTACK-2918:
-------------------------------------------

             Summary: In a scaled up environment, hosts fail to come up after Management server restart in clustered set up
                 Key: CLOUDSTACK-2918
                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-2918
             Project: CloudStack
          Issue Type: Bug
      Security Level: Public (Anyone can view this level - this is the default.)
          Components: Management Server
    Affects Versions: 4.2.0
         Environment: Load Test environment with simulator
            Reporter: Sowmya Krishnan
             Fix For: 4.2.0


Scaled setup with 20K simulated hosts and as many VMs, Basic Zone, Clustered setup with 3 management servers.

After restarting one or all the Management servers, re balancing hosts takes very long time, and the rebalanced hosts don't come to Up state either.
Following is the error message from logs when the MS was starting up:

2013-05-31 00:00:22,361 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Joining node, IP: 10.223.48.2, msid: 206915885094132
2013-05-31 00:00:22,361 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Joining node, IP: 10.223.48.66, msid: 206915885097283
2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Receive cluster alert, EventArgs: com.cloud.cluster.ClusterNodeJoinEventArgs
2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Handle cluster node join alert, joined node: 10.223.48.2, msidL: 206915885094132
2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Handle cluster node join alert, joined node: 10.223.48.66, msidL: 206915885097283
2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Management server node 10.223.48.66 is up, send alert
2013-05-31 00:00:22,539 WARN [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Notifying management server join event took 178 ms
2013-05-31 00:00:25,825 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 5583ms, profilerH
eartbeatUpdate: Done. Duration: 187ms, profilerPeerScan: Done. Duration: 560ms, profilerAgentLB: Done. Duration: 4836ms
2013-05-31 00:00:25,963 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Detected management node joined, id:3, nodeIP:10.223.48.130
2013-05-31 00:00:25,964 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Notify management server node join to listeners.
2013-05-31 00:00:25,964 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Joining node, IP: 10.223.48.130, msid: 206915885093830
2013-05-31 00:00:25,964 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Receive cluster alert, EventArgs: com.cloud.cluster.ClusterNodeJoinEventArgs
2013-05-31 00:00:25,964 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Handle cluster node join alert, joined node: 10.223.48.130, msidL: 206915885093830
2013-05-31 00:00:25,964 WARN [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Notifying management server join event took 0 ms
2013-05-31 00:00:27,501 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1675ms, profilerH
eartbeatUpdate: Done. Duration: 135ms, profilerPeerScan: Done. Duration: 179ms, profilerAgentLB: Done. Duration: 1361ms
2013-05-31 00:00:29,389 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1888ms, profilerH
eartbeatUpdate: Done. Duration: 462ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1424ms
2013-05-31 00:00:33,912 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1643ms, profilerH
eartbeatUpdate: Done. Duration: 141ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB: Done. Duration: 1501ms
2013-05-31 00:00:41,328 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1675ms, profilerH
eartbeatUpdate: Done. Duration: 178ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1495ms
2013-05-31 00:00:48,799 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Running Capacity Checker ...
2013-05-31 00:00:48,805 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) recalculating system capacity
2013-05-31 00:00:48,805 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing cpu/ram capacity update
2013-05-31 00:00:48,883 DEBUG [network.router.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:null) Found 0 routers to update status.
2013-05-31 00:00:48,901 DEBUG [network.router.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:null) Found 0 networks to update RvR status.
2013-05-31 00:00:50,031 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1596ms, profilerHeartbeatUpdate: Done. Duration: 179ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB: Done. Duration: 1416ms
2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:2 usedCpu: 500 reservedCpu: 0
2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:2 usedMem: 268435456 reservedMem: 0
2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:3 usedCpu: 500 reservedCpu: 0
2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:3 usedMem: 1073741824 reservedMem: 0
2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:5 usedCpu: 0 reservedCpu: 0
...
...
...

2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:2 usedMem: 268435456 reservedMem: 0
2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:3 usedCpu: 500 reservedCpu: 0
2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:3 usedMem: 1073741824 reservedMem: 0
2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:5 usedCpu: 0 reservedCpu: 0
2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:5 usedMem: 0 reservedMem: 0
2013-05-31 00:00:50,414 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:6 usedCpu: 0 reservedCpu: 0
...
...
...

2013-05-31 00:03:05,249 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:20002 usedMem: 0 reservedMem: 0
2013-05-31 00:03:05,249 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Done executing cpu/ram capacity update
2013-05-31 00:03:05,250 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing storage capacity update
2013-05-31 00:03:05,890 DEBUG [cloud.storage.StorageManagerImpl] (CapacityChecker:null) Successfully set Capacity - 1099511627776 for capacity type - 9 , DataCenterId - 1, HostOrPoolId - 20
0, PodId 1
2013-05-31 00:03:06,080 DEBUG [cloud.storage.StorageManagerImpl] (CapacityChecker:null) Successfully set Capacity - 1099511627776 for capacity type - 9 , DataCenterId - 1, HostOrPoolId - 20
1, PodId 1


2013-05-31 00:04:32,680 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Done executing storage capacity update
2013-05-31 00:04:32,680 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing capacity updates for public ip and Vlans
2013-05-31 00:04:33,488 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1565ms, profilerHeartbeatUpdate: Done. Duration: 122ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1441ms
2013-05-31 00:04:35,070 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1582ms, profilerHeartbeatUpdate: Done. Duration: 172ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1408ms
2013-05-31 00:04:35,206 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Done capacity updates for public ip and Vlans
2013-05-31 00:04:35,206 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing capacity updates for private ip
2013-05-31 00:04:37,155 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 2085ms, profilerHeartbeatUpdate: Done. Duration: 158ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1925ms


...
...
...


2013-05-31 01:56:10,384 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 2105ms, profilerHeartbeatUpdate: Done. Duration: 141ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB: Done. Duration: 1963ms
2013-05-31 01:56:12,043 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1659ms, profilerHeartbeatUpdate: Done. Duration: 120ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1537ms
2013-05-31 01:56:14,172 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 2129ms, profilerHeartbeatUpdate: Done. Duration: 96ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 2031ms

All 3 Management servers have the same above message and the hosts continue to be in disconnected state

mysql> select count(*), status, type, mgmt_server_id from host group by mgmt_server_id, status, type;
+----------+--------------+--------------------+-----------------+
| count(*) | status | type | mgmt_server_id |
+----------+--------------+--------------------+-----------------+
| 1 | Alert | SecondaryStorage | NULL |
| 17854 | Disconnected | Routing | NULL |
| 1 | Disconnected | SecondaryStorageVM | NULL |
| 53 | Disconnected | Routing | 206915885093830 |
| 1485 | Disconnected | Routing | 206915885094132 |
| 608 | Disconnected | Routing | 206915885097283 |
+----------+--------------+--------------------+-----------------+
6 rows in set (0.07 sec)

Observed this issue even with lesser hosts ~5K 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira