You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by "Peter Simon (JIRA)" <ji...@apache.org> on 2017/12/28 13:38:00 UTC

[jira] [Created] (YARN-7686) Yarn containers failover if datanode/nodemanager fails

Peter Simon created YARN-7686:
---------------------------------

             Summary: Yarn containers failover if datanode/nodemanager fails
                 Key: YARN-7686
                 URL: https://issues.apache.org/jira/browse/YARN-7686
             Project: Hadoop YARN
          Issue Type: New Feature
          Components: resourcemanager
    Affects Versions: 2.6.0
            Reporter: Peter Simon


While running an application on Yarn, one of the datanodes/nodemanagers went offline due to power issues. The first application attempt was failed due to lost containers. When the second attempt started, there were no heartbeat interval happened to the Namenode, and the second attempt still got the datanode/nodemanager as possible worker node for the containers. While the host was unreachable, therefore the container attempts were failed, led to the second application attempt also failed, caused the application failure.
There could be a failover process for container attempts, so if on one node new container can't be brought up, the ResourceManager should try to allocate the new container on a different node.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org