You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Sunil G (JIRA)" <ji...@apache.org> on 2015/09/07 17:10:46 UTC

[jira] [Commented] (YARN-4118) Newly submitted app maybe stuck at saving state if store operation failure is ignored in ZKRMStateStore

    [ https://issues.apache.org/jira/browse/YARN-4118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14733823#comment-14733823 ] 

Sunil G commented on YARN-4118:
-------------------------------

Hi [~jianhe]
This will be a potential one w.r.t ZK especially for RMApp and RMAppAttemt. If an error is not notified and RM is not fail-fast, there are chances that RMApp will be NEW_SAVING. So is it ok to fire a failure event directly to RMApp and RMAppAttempt if any of its Store/Update/Remove events are failed due to store exception. Such a direct error handling can mark app and appattempts into error state rather than keeping in limbo state. I would like to try this if its fine. Pls share your thoughts.

> Newly submitted app maybe stuck at saving state if store operation failure is ignored in ZKRMStateStore
> -------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4118
>                 URL: https://issues.apache.org/jira/browse/YARN-4118
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jian He
>            Assignee: Sunil G
>
> In YARN-2019, we took a decision to ignore the failure and not fail the RM when ZK is unavailable.
> However, it leaves newly submitted app stuck at saving state.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)