You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@cloudstack.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2017/12/02 15:46:01 UTC

[jira] [Commented] (CLOUDSTACK-7853) Hosts that are temporary Disconnected and get behind on ping (PingTimeout) turn up in permanent state Alert

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16275620#comment-16275620 ] 

ASF subversion and git services commented on CLOUDSTACK-7853:
-------------------------------------------------------------

Commit 9d6972cb244cc3f659624bbbc35f99fff1c2a44b in cloudstack's branch refs/heads/debian9-systemvmtemplate from [~rohit.yadav@shapeblue.com]
[ https://gitbox.apache.org/repos/asf?p=cloudstack.git;h=9d6972c ]

CLOUDSTACK-7853: Fix ping timeout edge case and refactor code

Refresh InaccurateClock every 10seconds, refactor code to get ping timeout
and ping interval.

Signed-off-by: Rohit Yadav <ro...@shapeblue.com>


> Hosts that are temporary Disconnected and get behind on ping (PingTimeout) turn up in permanent state Alert
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-7853
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7853
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>    Affects Versions: 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0
>            Reporter: Joris van Lieshout
>
> If for some reason (I've been unable to determine why but my suspicion is that the management server is busy processing other agent requests and/or xapi is temporary unavailable) a host that is Disconnected gets behind on ping (PingTimeout) it it transitioned to a permanent state of Alert.
> INFO  [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-9551e174) Found the following agents behind on ping: [421, 427, 425]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Ping timeout for host 421, do invstigation
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Transition:[Resource state = Enabled, Agent event = PingTimeout, Host id = 421, name = xxxxxx1]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Agent status update: [id = 421; name = xxxxxx1; old status = Disconnected; event = PingTimeout; new status = Alert; old update count = 111; new update count = 112]
> ----/ next cycle / -----
> INFO  [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Found the following agents behind on ping: [421, 427, 425]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Ping timeout for host 421, do invstigation
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Transition:[Resource state = Enabled, Agent event = PingTimeout, Host id = 421, name = xxxxxx1]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Cannot transit agent status with event PingTimeout for host 421, name=xxxxxx1, mangement server id is 345052370017
> ERROR [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Caught the following exception: 
> com.cloud.utils.exception.CloudRuntimeException: Cannot transit agent status with event PingTimeout for host 421, mangement server id is 345052370017,Unable to transition to a new state from Alert via PingTimeout
>         at com.cloud.agent.manager.AgentManagerImpl.agentStatusTransitTo(AgentManagerImpl.java:1334)
>         at com.cloud.agent.manager.AgentManagerImpl.disconnectAgent(AgentManagerImpl.java:1349)
>         at com.cloud.agent.manager.AgentManagerImpl.disconnectInternal(AgentManagerImpl.java:1378)
>         at com.cloud.agent.manager.AgentManagerImpl.disconnectWithInvestigation(AgentManagerImpl.java:1384)
>         at com.cloud.agent.manager.AgentManagerImpl$MonitorTask.runInContext(AgentManagerImpl.java:1466)
>         at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
>         at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
>         at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
>         at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
>         at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:701)
> I think the bug occures because there is no valid state transition from Alert via PingTimeout to something recoverable
> Status.java
> 		s_fsm.addTransition(Status.Alert, Event.AgentConnected, Status.Connecting);
>         s_fsm.addTransition(Status.Alert, Event.Ping, Status.Up);
>         s_fsm.addTransition(Status.Alert, Event.Remove, Status.Removed);
>         s_fsm.addTransition(Status.Alert, Event.ManagementServerDown, Status.Alert);
>         s_fsm.addTransition(Status.Alert, Event.AgentDisconnected, Status.Alert);
>         s_fsm.addTransition(Status.Alert, Event.ShutdownRequested, Status.Disconnected);
>  As a workaround to get out of this situation we put the cluster in Unmanage, wait 10 minutes and put the cluster back in manage



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)