You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Francisco Rodera <tr...@gmail.com> on 2013/09/17 13:55:17 UTC

How does one make a hadoop task attempt to fail after too many data fetch failures?

Hi,

I have a hadoop reduce task attempt that will never fail or get completed
unless I manually fail/kill it.

The problem surfaces when the task tracker node (due to network issues that
I am still investigating) looses connectivity with other task trackers/data
nodes, but not with the job tracker.

Basically, the reduce task is not able to fetch the necessary data from
other data nodes due to time out issues and it blacklists them. I have no
problem with that, actually the blacklisting is expected and needed, the
problem is that it will keep retrying the same blacklisted hosts for hours
(honoring what it seems to be an exponential back-off algorithm) until I
manually kill it. Latest long running task attempt had been retrying for >9
hours.

I see hundreds of messages like these in the log:

2013-09-09 22:34:47,251 WARN org.apache.hadoop.mapred.ReduceTask
(MapOutputCopier attempt_201309091958_0004_r_000044_0.1):
attempt_201309091958_0004_r_000044_0 copy failed:
attempt_201309091958_0004_m_001100_0 from X.X.X.X
2013-09-09 22:34:47,252 WARN org.apache.hadoop.mapred.ReduceTask
(MapOutputCopier attempt_201309091958_0004_r_000044_0.1):
java.net.SocketTimeoutException: connect timed out

Is there any way or setting to specify that after n retries or seconds the
task should fail on and get restarted on its own in another task tracker
host?

These are some of the relevant reduce/timeout Hadoop cluster parameters I
have set in my cluster:

<property><name>mapreduce.reduce.shuffle.connect.timeout</name><value>180000</value></property>
<property><name>mapreduce.reduce.shuffle.read.timeout</name><value>180000</value></property>
<property><name>mapreduce.reduce.shuffle.maxfetchfailures</name><value>10</value></property>

<property><name>mapred.task.timeout</name><value>600000</value></property>
<property><name>mapred.jobtracker.blacklist.fault-timeout-window</name><value>180</value></property>
<property><name>mapred.healthChecker.script.timeout</name><value>600000</value></property>

BTW, this job is running on an AWS EMR cluster (Hadoop version: 0.20.205).

Thanks in advance.

Francisco.