You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Rohith (JIRA)" <ji...@apache.org> on 2015/04/01 07:12:53 UTC

[jira] [Commented] (YARN-3416) deadlock in a job between map and reduce cores allocation

    [ https://issues.apache.org/jira/browse/YARN-3416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389999#comment-14389999 ] 

Rohith commented on YARN-3416:
------------------------------

bq. there are only 4 NodeManagers in cluster, so it is possible all 4 NodeManagers are in the blacklist
In yarn-1680, all the NM's were not in the blacklist. Only 1 NM is blacklisted. This scenario can happen in larger cluster also. I have observed similar issue in 25 nodes cluster also.
     The reason for suspect would be same as yarn-1680 is in your cluster 300 reducers are running which occupied 300 cores. It means there is no place for running mappers. But at this moment if any reducers dont get mapper output(any reason), the mapper is marked as failure and the nodes are blacklisted. Blacklisted nodes has some resources which can run some of the containers. In MR, reducer preempt is decided on several factors, out of that one factoe is headroom. But RM sends headroom considering blacklisted nodes which causes MR not to trigger reducer preemption. I am suspecting only. There would be real potential hidden bug also. If you provide full AM logs, I can help you in analyzing whether it is same as yarn-1680 or not.


> deadlock in a job between map and reduce cores allocation 
> ----------------------------------------------------------
>
>                 Key: YARN-3416
>                 URL: https://issues.apache.org/jira/browse/YARN-3416
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.6.0
>            Reporter: mai shurong
>            Priority: Critical
>
> I submit a  big job, which has 500 maps and 350 reduce, to a queue(fairscheduler) with 300 max cores. When the big mapreduce job is running 100% maps, the 300 reduces have occupied 300 max cores in the queue. And then, a map fails and retry, waiting for a core, while the 300 reduces are waiting for failed map to finish. So a deadlock occur. As a result, the job is blocked, and the later job in the queue cannot run because no available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)