You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-issues@hadoop.apache.org by "Karthik Kambatla (JIRA)" <ji...@apache.org> on 2016/10/04 19:01:20 UTC

[jira] [Commented] (YARN-5677) RM can be in active-active state for an extended period

    [ https://issues.apache.org/jira/browse/YARN-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546317#comment-15546317 ] 

Karthik Kambatla commented on YARN-5677:
----------------------------------------

Meaningful implementation for {{enterNeutralMode}} makes a lot of sense. Sorry for not filing a JIRA for the TODO I added years ago.

The patch here makes sense. My one concern is with letting the outstanding task run even after canceling the timer, especially when canceled as part of becomeActive. 

[~templedf] - in an offline conversation, you mentioned running into issues with the VerifyActiveStatusThread being stuck on transition to standby. Is the plan to fix that too in this JIRA? Or, to take care of it as a follow-up? 


> RM can be in active-active state for an extended period
> -------------------------------------------------------
>
>                 Key: YARN-5677
>                 URL: https://issues.apache.org/jira/browse/YARN-5677
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Daniel Templeton
>            Assignee: Daniel Templeton
>            Priority: Critical
>         Attachments: YARN-5677.001.patch
>
>
> In trunk, there is no maximum number of retries that I see.  It appears the connection will be retried forever, with the active never figuring out it's no longer active.  In my testing, the active-active state lasted almost 2 hours with no sign of stopping before I killed it.  The solution appears to be to cap the number of retries or amount of time spent retrying.
> This issue is significant because of the asynchronous nature of job submission.  If the active doesn't know it's not active, it will buffer up job submissions until it finally realizes it has become the standby. Then it will fail all the job submissions in bulk. In high-volume workflows, that behavior can create huge mass job failures.
> This issue is also important because the node managers will not fail over to the new active until the old active realizes it's the standby.  Workloads submitted after the old active loses contact with ZK will therefore fail to be executed regardless of which RM the clients contact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org