You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-dev@hadoop.apache.org by Sunil govind <su...@huawei.com> on 2013/12/31 06:25:13 UTC

Resource Manager wasting time in allocating many containers and AM rejecting same under a specific scenario

In ResourceManager TaskImpl class, RetroactiveKilledTransition and RetroactiveFailureTransition methods are there.
In a specific scenario, like when a Node becomes unstable [bad node] Or when an external signal is raised to kill a Successful task which is completed,
Then RetroactiveKilledTransition will get invoked. But this is not considered as failedAttempts. Hence this data structure will be empty in this case.
This cause the MAP to be re-launched as a normal Map Task and not as a Failed Map.

Assume the cluster is taken over by Reducers alone, and a Successful map is killed because of external command [./mapred kill-task <ID>] Or because of a bad node.
In this case the ask for the map is sent from AM, but it should wait till the RM process all the reducer requests in its queue. [Priority as 10]
New map task priority is 20. If it was 5 as a Failed Map, it would be processed immediately.

If 100s of reducers are there in cluster to be processed, and the cluster is small scale, it may take minutes to process this map task.
And many allocation for the reducers will be rejected by AM.

Is this expected behavior? Kindly let know whether this can be improved.