You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@slider.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2017/01/25 19:01:26 UTC

[jira] [Commented] (SLIDER-1189) Agent never connects to new AM if AM restart takes too long

    [ https://issues.apache.org/jira/browse/SLIDER-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838374#comment-15838374 ] 

ASF subversion and git services commented on SLIDER-1189:
---------------------------------------------------------

Commit 5a83421b2291298aef3cd4c99c880b5cb26d29ed in incubator-slider's branch refs/heads/develop from [~billie.rinaldi]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;h=5a83421 ]

SLIDER-1189 Agent never connects to new AM if AM restart takes too long


> Agent never connects to new AM if AM restart takes too long
> -----------------------------------------------------------
>
>                 Key: SLIDER-1189
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1189
>             Project: Slider
>          Issue Type: Bug
>          Components: agent
>            Reporter: Billie Rinaldi
>            Assignee: Billie Rinaldi
>            Priority: Critical
>             Fix For: Slider 1.0.0
>
>         Attachments: SLIDER-1189.1.patch, SLIDER-1189.2.patch, SLIDER-1189.3.patch
>
>
> In testing RM and AM failure scenarios, I killed my RM, killed the AM, waited for a bit, then restarted the RM. The AM is restarted, but running agents never connect to the new AM. The AM data is re-read from the ZK registry once if the heartbeat retry threshold is reached, at which point the agent tries re-registering with the AM. However, if the AM data is stale at that point, it never re-reads the data from the ZK registry, and retries registering with the nonexistent AM forever (until it is timed out due to heartbeat loss and killed by the new AM).
> Note this happens when AM restart is delayed more than about a minute, which can occur if the RM is down or the RM is up but busy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)