You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Yuliya Feldman (JIRA)" <ji...@apache.org> on 2015/06/15 04:20:00 UTC

[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM

    [ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585370#comment-14585370 ] 

Yuliya Feldman commented on YARN-3803:
--------------------------------------

In LocalizedResource class in state transition there are following transitions:
{code}
// From INIT (ref == 0, awaiting req)
    .addTransition(ResourceState.INIT, ResourceState.DOWNLOADING,
        ResourceEventType.REQUEST, new FetchResourceTransition())

    // From DOWNLOADING (ref > 0, may be localizing)
    .addTransition(ResourceState.DOWNLOADING, ResourceState.DOWNLOADING,
        ResourceEventType.REQUEST, new DuplicateFetchResourceTransition())
{code}

So it assumes that if "from state" and "to state" is _DOWNLOADING_ and _ResourceEventType_ is _REQUEST_ then resource is being downloaded and transition becomes _DuplicateFetchResourceTransition_.
Problem is that "ref" is not greater then 0 here, as resources were cleaned up during first attempt and we end up in the situation where nothing is happening until RM kills this app.


> Application hangs after more then one localization attempt fails on the same NM
> -------------------------------------------------------------------------------
>
>                 Key: YARN-3803
>                 URL: https://issues.apache.org/jira/browse/YARN-3803
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.0, 2.5.1
>            Reporter: Yuliya Feldman
>            Assignee: Yuliya Feldman
>            Priority: Minor
>
> In the sandbox (single node) environment with LinuxContainerExecutor when first Application Localization attempt fails second attempt can not proceed and subsequently application hangs until RM kills it as non-responding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)