You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2015/11/09 21:32:11 UTC

[jira] [Comment Edited] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"

    [ https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997296#comment-14997296 ] 

Jason Lowe edited comment on YARN-4334 at 11/9/15 8:31 PM:
-----------------------------------------------------------

Thanks for the prototype, Chang!

Ideally when attempting to recover from an old state we should still remember the apps but recover them in a completed state (either killed or failed).  It looks like the prototype will cause the RM to completely forget everything which isn't ideal.  WIthout recovering the state but yet leaving it in the state store then we risk a situation like the following:
# RM restarts late, recovers nothing
# RM updates the store timestamp
# RM restarts 
# RM tries to recover all the old state left from the first instance that wasn't cleaned up in the second

Was there a reason to use a raw thread and sleeps for the update rather than a Timer?  In either case it needs to be a daemon thread.

The recovery code should check the version first before doing anything else with the state store.

The conf settings give no hints in their name nor any documentation as to what units to use.  Is it millseconds?  minutes?  hours?  Why a default of 10000?

"RMLivenessKey" should be a static final constant to avoid the chance of typos.

The code has no check for the key missing a value -- db.get will return null if the key is missing.

Nit: a setting of zero should be equivalent to a -1 setting.  It makes no sense to configure it so the store is always expired.




was (Author: jlowe):
Thanks for the prototype, Chang!

Ideally when attempting to recover from an old state we should still remember the apps but recover them in a completed state (either killed or failed).  It looks like the prototype will cause the RM to completely forget everything which isn't ideal.  WIthout recovering the state but yet leaving it in the state store then we risk a situation like the following:
# RM restarts late, recovers nothing
# RM updates the store timestamp
# RM restarts 
# RM tries to recover all the old state left from the first instance that wasn't cleaned up in the second

Was there a reason to use a raw thread and sleeps for the update rather than a Timer?  In either case it needs to be a daemon thread.

The recovery code should check the version first before doing anything else with the state store.

The conf settings give no hints in their name nor any documentation as to what units to use.  Is it millseconds?  minutes?  hours?  Why a default of 10000?

"RMLivenessKey" should be a static final constant to avoid the chance of typos.

The code has no check for the key missing a value -- db.get will return NULL if the 

Nit: a setting of zero should be equivalent to a -1 setting.  It makes no sense to configure it so the store is always expired.



> Ability to avoid ResourceManager recovery if state store is "too old"
> ---------------------------------------------------------------------
>
>                 Key: YARN-4334
>                 URL: https://issues.apache.org/jira/browse/YARN-4334
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Jason Lowe
>            Assignee: Chang Li
>         Attachments: YARN-4334.wip.patch
>
>
> There are times when a ResourceManager has been down long enough that ApplicationMasters and potentially external client-side monitoring mechanisms have given up completely.  If the ResourceManager starts back up and tries to recover we can get into situations where the RM launches new application attempts for the AMs that gave up, but then the client _also_ launches another instance of the app because it assumed everything was dead.
> It would be nice if the RM could be optionally configured to avoid trying to recover if the state store was "too old."  The RM would come up without any applications recovered, but we would avoid a double-submission situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)