You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Giridharan Anantharaman <gi...@pv.com> on 2012/06/06 19:35:26 UTC

Reduce task does not time out if one the mapper hosts is not reachable

Hi

I am using version 1.0.1 and the so called reduce hang problem had to do with my screw up in cluster configuration, which i have since fixed, or so i think. However, this raised some other questions, hence this email.

- I have a bunch of MR jobs that run daily and i noticed that one of them (not the same one) would hang. From the mapred admin console, it would be like map complete 100% and reduce stuck at some percent in copy phase. After some digging around in the task tracker logs, i found that reduce task could not copy map outputs. Here is the exception: 2012-06-06 08:17:15,404 WARN org.apache.hadoop.mapred.ReduceTask: java.net.UnknownHostException:

- So clearly my slave nodes could not see each other. Everytime reduce task was scheduled on one of the slave nodes and one of the mapper tasks on the other slave node, this problem would occur, since all slaves were not listed in /etc/hosts file on the slave box. I fixed that and all is well. That said, my question is shouldn't reduce task time out after a while when it cannot copy the mapper output ? It just seems to retry continously. I even let the MR job sit there for upto 8 hrs (what usually completes in 10 or 15 mins) to see if it would time out and fail the job.

-Giri