You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@slider.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2014/08/06 15:45:12 UTC

[jira] [Resolved] (SLIDER-276) AgentProvider releases nodes that the AM has already been released after detecting heartbeat failure

     [ https://issues.apache.org/jira/browse/SLIDER-276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Loughran resolved SLIDER-276.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: Slider 0.50
         Assignee: Steve Loughran  (was: Ted Yu)

> AgentProvider releases nodes that the AM has already been released after detecting heartbeat failure
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SLIDER-276
>                 URL: https://issues.apache.org/jira/browse/SLIDER-276
>             Project: Slider
>          Issue Type: Bug
>    Affects Versions: Slider 0.40
>            Reporter: Ted Yu
>            Assignee: Steve Loughran
>            Priority: Minor
>             Fix For: Slider 0.50
>
>          Time Spent: 3h
>  Remaining Estimate: 0h
>
> I issued flex command to reduce the number of region servers by 1:
> {code}
> 14/08/04 18:14:52 INFO state.AppState: RoleStatus{name='HBASE_REGIONSERVER', key=2, desired=1, actual=2, requested=0, releasing=0, failed=0, started=2, startFailed=0, completed=0, failureMessage=''}
> 14/08/04 18:14:52 INFO state.AppState: HBASE_REGIONSERVER: Asking for 1 fewer node(s) for a total of 1
> 14/08/04 18:14:52 INFO state.AppState: RoleStatus{name='HBASE_MASTER', key=1, desired=1, actual=1, requested=0, releasing=0, failed=0, started=1, startFailed=0, completed=0, failureMessage=''}
> 14/08/04 18:14:52 INFO state.AppState: RoleStatus{name='HBASE_REST', key=3, desired=1, actual=1, requested=0, releasing=0, failed=0, started=1, startFailed=0, completed=0, failureMessage=''}
> 14/08/04 18:14:52 INFO appmaster.SliderAppMaster: onContainersCompleted([1]
> 14/08/04 18:14:52 INFO appmaster.SliderAppMaster: Container Completion for containerID=container_1405721039692_0013_01_000004, state=COMPLETE, exitStatus=-100, diagnostics=Container released by application
> 14/08/04 18:14:52 INFO state.AppState: Container was queued for release
> 14/08/04 18:14:52 INFO state.AppState: decrementing role count for role HBASE_REGIONSERVER
> 14/08/04 18:14:53 INFO state.AppState: RoleStatus{name='HBASE_REGIONSERVER', key=2, desired=1, actual=1, requested=0, releasing=0, failed=0, started=2, startFailed=0, completed=1, failureMessage=''}
> 14/08/04 18:14:53 INFO state.AppState: RoleStatus{name='HBASE_MASTER', key=1, desired=1, actual=1, requested=0, releasing=0, failed=0, started=1, startFailed=0, completed=0, failureMessage=''}
> 14/08/04 18:14:53 INFO state.AppState: RoleStatus{name='HBASE_REST', key=3, desired=1, actual=1, requested=0, releasing=0, failed=0, started=1, startFailed=0, completed=0, failureMessage=''}
> 14/08/04 18:16:18 WARN agent.HeartbeatMonitor: Component container_1405721039692_0013_01_000004___HBASE_REGIONSERVER marked UNHEALTHY. Last heartbeat received at 1407176092207 approx. 86129 ms. back.
> 14/08/04 18:17:18 WARN agent.HeartbeatMonitor: Component container_1405721039692_0013_01_000004___HBASE_REGIONSERVER marked HEARTBEAT_LOST. Last heartbeat received at 1407176092207 approx. 146130 ms. back.
> 14/08/04 18:17:18 INFO appmaster.SliderAppMaster: Refreshing container container_1405721039692_0013_01_000004 per provider request.
> 14/08/04 18:17:18 WARN agent.HeartbeatMonitor: ERROR
> java.lang.AssertionError: no live nodes to release
> 	at org.apache.slider.server.appmaster.state.NodeEntry.release(NodeEntry.java:172)
> 	at org.apache.slider.server.appmaster.state.RoleHistory.onContainerReleaseSubmitted(RoleHistory.java:656)
> 	at org.apache.slider.server.appmaster.state.AppState.containerReleaseSubmitted(AppState.java:919)
> 	at org.apache.slider.server.appmaster.state.AppState.releaseContainer(AppState.java:1491)
> 	at org.apache.slider.server.appmaster.SliderAppMaster.refreshContainer(SliderAppMaster.java:1444)
> 	at org.apache.slider.providers.agent.AgentProviderService.releaseContainer(AgentProviderService.java:391)
> 	at org.apache.slider.providers.agent.HeartbeatMonitor.doWork(HeartbeatMonitor.java:109)
> 	at org.apache.slider.providers.agent.HeartbeatMonitor.run(HeartbeatMonitor.java:69)
> 	at java.lang.Thread.run(Thread.java:722)
> {code}
> As can be seen above, NodeEntry#containerCompleted() event was received before NodeEntry#release() was called.
> This triggered the following assertion:
> {code}
>   public synchronized void release() {
>     assert live > 0 : "no live nodes to release";
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)