You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Nicolas Fraison (JIRA)" <ji...@apache.org> on 2017/11/08 09:59:01 UTC

[jira] [Commented] (MAPREDUCE-6659) Mapreduce App master waits long to kill containers on lost nodes.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243637#comment-16243637 ] 

Nicolas Fraison commented on MAPREDUCE-6659:
--------------------------------------------

[~jlowe], MAPREDUCE-5465 is already applied on the hadoop release I use (cdh5.5.0).
I've tested on cdh5.5 and trunk the behaviour when a nodemanager is lost and it is the same. 
The RM send a LostNM event to the AM which try to cleanup containers running on it (on cdh5.5 and on trunk). The attempt is failed only after a timeout to connect to the lost NM.
The main difference between cdh5.5 and the trunk is the timeout being really slower in trunk (3 min instead of 30 min at least).
This is thanks to patches https://issues.apache.org/jira/browse/YARN-4414 and https://issues.apache.org/jira/browse/YARN-3554
Backporting those patches can be consider sufficient, what do you think about this?

> Mapreduce App master waits long to kill containers on lost nodes.
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-6659
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6659
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.6.0
>            Reporter: Laxman
>            Assignee: Nicolas Fraison
>
> MR Application master waits for very long time to cleanup and relaunch the tasks on lost nodes. Wait time is actually 2.5 hours (ipc.client.connect.max.retries * ipc.client.connect.max.retries.on.timeouts * ipc.client.connect.timeout = 10 * 45 * 20 = 9000 seconds = 2.5 hours)
> Some similar issue related in RM-AM rpc protocol is fixed in YARN-3809.
> As fixed in YARN-3809, we may need to introduce new configurations to control this RPC retry behavior.
> Also, I feel this total retry time should honor and capped maximum to global task time out (mapreduce.task.timeout = 600000 default)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org