You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Amar Kamat <am...@yahoo-inc.com> on 2008/06/20 06:56:40 UTC

Re: Too many fetch failures AND Shuffle error

Sayali Kulkarni wrote:
> Hello,
> I have been getting 
> Too many fetch failures (in the map operation)
> and 
> shuffle error (in the reduce operation)
>
>   
Can you post the reducer logs. How many nodes are there in the cluster? 
Are you seeing this for all the maps and reducers? Are the reducers 
progressing at all? Are all the maps that the reducer is failing from a 
remote machine? Are all the failed maps/reducers from the same machine? 
Can you provide some more details.
Amar
> and am unable to complete any job on the cluster.
>
> I have 5 slaves in the cluster. So I have the following values in the hadoop-site.xml file:
>   <name>mapred.map.tasks</name>
>   <value>53</value>
> // 53 = nearest prime to 5*10
>
>   <name>mapred.reduce.tasks</name>
>   <value>7</value>
> // 7 = nearest prime to 5
>
> Please let me know what would be the suggest fix for this.
>
> Hadoop version I am using is hadoop-0.16.3 and it is installed on  Ubuntu.
>
> Thanks!
> --Sayali
>
>
>        
> ---------------------------------
> Sent from Yahoo! Mail.
> A Smarter Email.
>   


Re: Too many fetch failures AND Shuffle error

Posted by Amar Kamat <am...@yahoo-inc.com>.
Tarandeep Singh wrote:
> I am getting this error as well.
> As Sayali mentioned in his mail, I updated the /etc/hosts file with the
> slave machines IP addresses, but I am still getting this error.
>
> Amar, which is the url that you were talking about in your mail -
> "There will be a URL associated with a map that the reducer try to fetch
> (check the reducer logs for this url)"
>
> Please tell me where should I look for it... I will try to access it
> manually to see if this error is due to firewall.
>   
One thing you can do is to see if all the maps that have failed while 
fetching are from remote host. Look at the web-ui to find out where the 
map task finished and look at the reduce task logs to find out which 
maps-fetches failed.

I am not sure if the reduce task logs have it. Try this
port=tasktracker.http.port (this is set through conf)
tthost = tasktracker hostname (destination tasktracker from where the 
map out needs to be fetched)
jobid = complete job id "job_...."
mapid = the task attemptid "attempt_..." that has successfully completed 
the map
reduce-partition-id = this is the partition number for reduce task. 
task_..._r_$i_$j will have reduce-partition-id as int-value($i).

url = 
http://'$tthost':'$port'/mapOutput?job='$jobid'&map='$mapid'&reduce='$reduce-partition-id'
'$var' is what you have to substitute.
Amar
> Thanks,
> Taran
>
> On Thu, Jun 19, 2008 at 11:43 PM, Amar Kamat <am...@yahoo-inc.com> wrote:
>
>   
>> Yeah. With 2 nodes the reducers will go up to 16% because the reducer are
>> able to fetch maps from the same machine (locally) but fails to copy it from
>> the remote machine. A common reason in such cases is the *restricted machine
>> access* (firewall etc). The web-server on a machine/node hosts map outputs
>> which the reducers on the other machine are not able to access. There will
>> be a URL associated with a map that the reducer try to fetch (check the
>> reducer logs for this url). Just try accessing it manually from the
>> reducer's machine/node. Most likely this experiment should also fail. Let us
>> know if this is not the case.
>> Amar
>>
>> Sayali Kulkarni wrote:
>>
>>     
>>> Can you post the reducer logs. How many nodes are there in the cluster?
>>>       
>>>>         
>>> There are 6 nodes in the cluster - 1 master and 5 slaves
>>>  I tried to reduce the number of nodes, and found that the problem is
>>> solved only if there is a single node in the cluster. So I can deduce that
>>> the problem is there in some configuration.
>>>
>>> Configuration file:
>>> <?xml version="1.0"?>
>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>>
>>> <!-- Put site-specific property overrides in this file. -->
>>>
>>> <configuration>
>>>
>>> <property>
>>>  <name>hadoop.tmp.dir</name>
>>>  <value>/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}</value>
>>>  <description>A base for other temporary directories.</description>
>>> </property>
>>>
>>> <property>
>>>  <name>fs.default.name</name>
>>>  <value>hdfs://10.105.41.25:54310</value>
>>>  <description>The name of the default file system.  A URI whose
>>>  scheme and authority determine the FileSystem implementation.  The
>>>  uri's scheme determines the config property (fs.SCHEME.impl) naming
>>>  the FileSystem implementation class.  The uri's authority is used to
>>>  determine the host, port, etc. for a filesystem.</description>
>>> </property>
>>>
>>> <property>
>>>  <name>mapred.job.tracker</name>
>>>  <value>10.105.41.25:54311</value>
>>>  <description>The host and port that the MapReduce job tracker runs
>>>  at.  If "local", then jobs are run in-process as a single map
>>>  and reduce task.
>>>  </description>
>>> </property>
>>>
>>> <property>
>>>  <name>dfs.replication</name>
>>>  <value>2</value>
>>>  <description>Default block replication.
>>>  The actual number of replications can be specified when the file is
>>> created.
>>>  The default is used if replication is not specified in create time.
>>>  </description>
>>> </property>
>>>
>>>
>>> <property>
>>>  <name>mapred.child.java.opts</name>
>>>  <value>-Xmx1048M</value>
>>> </property>
>>>
>>> <property>
>>>        <name>mapred.local.dir</name>
>>>        <value>/extra/HADOOP/hadoop-0.16.3/tmp/mapred</value>
>>> </property>
>>>
>>> <property>
>>>  <name>mapred.map.tasks</name>
>>>  <value>53</value>
>>>  <description>The default number of map tasks per job.  Typically set
>>>  to a prime several times greater than number of available hosts.
>>>  Ignored when mapred.job.tracker is "local".
>>>  </description>
>>> </property>
>>>
>>> <property>
>>>  <name>mapred.reduce.tasks</name>
>>>  <value>7</value>
>>>  <description>The default number of reduce tasks per job.  Typically set
>>>  to a prime close to the number of available hosts.  Ignored when
>>>  mapred.job.tracker is "local".
>>>  </description>
>>> </property>
>>>
>>> </configuration>
>>>
>>>
>>> ============
>>> This is the output that I get when running the tasks with 2 nodes in the
>>> cluster:
>>>
>>> 08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to
>>> process : 1
>>> 08/06/20 11:07:45 INFO mapred.JobClient: Running job:
>>> job_200806201106_0001
>>> 08/06/20 11:07:46 INFO mapred.JobClient:  map 0% reduce 0%
>>> 08/06/20 11:07:53 INFO mapred.JobClient:  map 8% reduce 0%
>>> 08/06/20 11:07:55 INFO mapred.JobClient:  map 17% reduce 0%
>>> 08/06/20 11:07:57 INFO mapred.JobClient:  map 26% reduce 0%
>>> 08/06/20 11:08:00 INFO mapred.JobClient:  map 34% reduce 0%
>>> 08/06/20 11:08:01 INFO mapred.JobClient:  map 43% reduce 0%
>>> 08/06/20 11:08:04 INFO mapred.JobClient:  map 47% reduce 0%
>>> 08/06/20 11:08:05 INFO mapred.JobClient:  map 52% reduce 0%
>>> 08/06/20 11:08:08 INFO mapred.JobClient:  map 60% reduce 0%
>>> 08/06/20 11:08:09 INFO mapred.JobClient:  map 69% reduce 0%
>>> 08/06/20 11:08:10 INFO mapred.JobClient:  map 73% reduce 0%
>>> 08/06/20 11:08:12 INFO mapred.JobClient:  map 78% reduce 0%
>>> 08/06/20 11:08:13 INFO mapred.JobClient:  map 82% reduce 0%
>>> 08/06/20 11:08:15 INFO mapred.JobClient:  map 91% reduce 1%
>>> 08/06/20 11:08:16 INFO mapred.JobClient:  map 95% reduce 1%
>>> 08/06/20 11:08:18 INFO mapred.JobClient:  map 99% reduce 3%
>>> 08/06/20 11:08:23 INFO mapred.JobClient:  map 100% reduce 3%
>>> 08/06/20 11:08:25 INFO mapred.JobClient:  map 100% reduce 7%
>>> 08/06/20 11:08:28 INFO mapred.JobClient:  map 100% reduce 10%
>>> 08/06/20 11:08:30 INFO mapred.JobClient:  map 100% reduce 11%
>>> 08/06/20 11:08:33 INFO mapred.JobClient:  map 100% reduce 12%
>>> 08/06/20 11:08:35 INFO mapred.JobClient:  map 100% reduce 14%
>>> 08/06/20 11:08:38 INFO mapred.JobClient:  map 100% reduce 15%
>>> 08/06/20 11:09:54 INFO mapred.JobClient:  map 100% reduce 13%
>>> 08/06/20 11:09:54 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_r_000002_0, Status : FAILED
>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>> 08/06/20 11:09:56 INFO mapred.JobClient:  map 100% reduce 9%
>>> 08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_r_000003_0, Status : FAILED
>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>> 08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000011_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:09:57 INFO mapred.JobClient:  map 95% reduce 9%
>>> 08/06/20 11:09:59 INFO mapred.JobClient:  map 100% reduce 9%
>>> 08/06/20 11:10:04 INFO mapred.JobClient:  map 100% reduce 10%
>>> 08/06/20 11:10:07 INFO mapred.JobClient:  map 100% reduce 11%
>>> 08/06/20 11:10:09 INFO mapred.JobClient:  map 100% reduce 13%
>>> 08/06/20 11:10:12 INFO mapred.JobClient:  map 100% reduce 14%
>>> 08/06/20 11:10:14 INFO mapred.JobClient:  map 100% reduce 15%
>>> 08/06/20 11:10:17 INFO mapred.JobClient:  map 100% reduce 16%
>>> 08/06/20 11:10:24 INFO mapred.JobClient:  map 100% reduce 13%
>>> 08/06/20 11:10:24 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_r_000000_0, Status : FAILED
>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>> 08/06/20 11:10:29 INFO mapred.JobClient:  map 100% reduce 11%
>>> 08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_r_000001_0, Status : FAILED
>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>> 08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000003_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:10:32 INFO mapred.JobClient:  map 100% reduce 12%
>>> 08/06/20 11:10:37 INFO mapred.JobClient:  map 100% reduce 13%
>>> 08/06/20 11:10:42 INFO mapred.JobClient:  map 100% reduce 14%
>>> 08/06/20 11:10:47 INFO mapred.JobClient:  map 100% reduce 16%
>>> 08/06/20 11:10:52 INFO mapred.JobClient:  map 95% reduce 16%
>>> 08/06/20 11:10:52 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000020_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:10:54 INFO mapred.JobClient:  map 100% reduce 16%
>>> 08/06/20 11:11:02 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000017_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:11:09 INFO mapred.JobClient:  map 100% reduce 17%
>>> 08/06/20 11:11:24 INFO mapred.JobClient:  map 95% reduce 17%
>>> 08/06/20 11:11:24 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000007_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:11:27 INFO mapred.JobClient:  map 100% reduce 17%
>>> 08/06/20 11:11:32 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000012_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:11:34 INFO mapred.JobClient:  map 95% reduce 17%
>>> 08/06/20 11:11:34 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000019_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:11:39 INFO mapred.JobClient:  map 91% reduce 18%
>>> 08/06/20 11:11:39 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000002_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:11:41 INFO mapred.JobClient:  map 95% reduce 18%
>>> 08/06/20 11:11:42 INFO mapred.JobClient:  map 100% reduce 19%
>>> 08/06/20 11:11:42 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000006_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:11:44 INFO mapred.JobClient:  map 100% reduce 17%
>>> 08/06/20 11:11:44 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_r_000003_1, Status : FAILED
>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>> 08/06/20 11:11:51 INFO mapred.JobClient:  map 100% reduce 18%
>>> 08/06/20 11:11:54 INFO mapred.JobClient:  map 100% reduce 19%
>>> 08/06/20 11:11:59 INFO mapred.JobClient:  map 95% reduce 19%
>>> 08/06/20 11:11:59 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000010_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:12:02 INFO mapred.JobClient:  map 100% reduce 19%
>>> 08/06/20 11:12:07 INFO mapred.JobClient:  map 100% reduce 20%
>>> 08/06/20 11:12:08 INFO mapred.JobClient:  map 100% reduce 33%
>>> 08/06/20 11:12:09 INFO mapred.JobClient:  map 100% reduce 47%
>>> 08/06/20 11:12:11 INFO mapred.JobClient:  map 100% reduce 60%
>>> 08/06/20 11:12:16 INFO mapred.JobClient:  map 100% reduce 62%
>>> 08/06/20 11:12:24 INFO mapred.JobClient:  map 100% reduce 63%
>>> 08/06/20 11:12:26 INFO mapred.JobClient:  map 100% reduce 64%
>>> 08/06/20 11:12:31 INFO mapred.JobClient:  map 100% reduce 65%
>>> 08/06/20 11:12:31 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000019_1, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:12:36 INFO mapred.JobClient:  map 100% reduce 66%
>>> 08/06/20 11:12:38 INFO mapred.JobClient:  map 100% reduce 67%
>>> 08/06/20 11:12:39 INFO mapred.JobClient:  map 100% reduce 80%
>>>
>>> ===============
>>>
>>>
>>>
>>>       
>>>> Are you seeing this for all the maps and reducers?
>>>>
>>>>         
>>> Yes, this happens on all the maps and reducers. I tried to keep just 2
>>> nodes in the cluster but still the problem exists.
>>>
>>>
>>>
>>>       
>>>> Are the reducers progressing at all?
>>>>
>>>>
>>>>         
>>> The reducers continue to execute upto a certain point, but after that they
>>> just do not proceed at all. They just stop at an average of 16%.
>>>
>>>
>>>       
>>>> Are all the maps that the reducer is failing from a remote machine?
>>>>
>>>>         
>>> Yes.
>>>
>>>
>>>
>>>       
>>>> Are all the failed maps/reducers from the same machine?
>>>>
>>>>         
>>> All the maps and reducers are failing anyways.
>>> Thanks for the help in advance,
>>>
>>> Regards,
>>> Sayali
>>>
>>>       ---------------------------------
>>> Sent from Yahoo! Mail.
>>> A Smarter Email.
>>>
>>>
>>>       
>>     
>
>   


Re: Too many fetch failures AND Shuffle error

Posted by Tarandeep Singh <ta...@gmail.com>.
I am getting this error as well.
As Sayali mentioned in his mail, I updated the /etc/hosts file with the
slave machines IP addresses, but I am still getting this error.

Amar, which is the url that you were talking about in your mail -
"There will be a URL associated with a map that the reducer try to fetch
(check the reducer logs for this url)"

Please tell me where should I look for it... I will try to access it
manually to see if this error is due to firewall.

Thanks,
Taran

On Thu, Jun 19, 2008 at 11:43 PM, Amar Kamat <am...@yahoo-inc.com> wrote:

> Yeah. With 2 nodes the reducers will go up to 16% because the reducer are
> able to fetch maps from the same machine (locally) but fails to copy it from
> the remote machine. A common reason in such cases is the *restricted machine
> access* (firewall etc). The web-server on a machine/node hosts map outputs
> which the reducers on the other machine are not able to access. There will
> be a URL associated with a map that the reducer try to fetch (check the
> reducer logs for this url). Just try accessing it manually from the
> reducer's machine/node. Most likely this experiment should also fail. Let us
> know if this is not the case.
> Amar
>
> Sayali Kulkarni wrote:
>
>> Can you post the reducer logs. How many nodes are there in the cluster?
>>>
>>>
>> There are 6 nodes in the cluster - 1 master and 5 slaves
>>  I tried to reduce the number of nodes, and found that the problem is
>> solved only if there is a single node in the cluster. So I can deduce that
>> the problem is there in some configuration.
>>
>> Configuration file:
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>>
>> <property>
>>  <name>hadoop.tmp.dir</name>
>>  <value>/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}</value>
>>  <description>A base for other temporary directories.</description>
>> </property>
>>
>> <property>
>>  <name>fs.default.name</name>
>>  <value>hdfs://10.105.41.25:54310</value>
>>  <description>The name of the default file system.  A URI whose
>>  scheme and authority determine the FileSystem implementation.  The
>>  uri's scheme determines the config property (fs.SCHEME.impl) naming
>>  the FileSystem implementation class.  The uri's authority is used to
>>  determine the host, port, etc. for a filesystem.</description>
>> </property>
>>
>> <property>
>>  <name>mapred.job.tracker</name>
>>  <value>10.105.41.25:54311</value>
>>  <description>The host and port that the MapReduce job tracker runs
>>  at.  If "local", then jobs are run in-process as a single map
>>  and reduce task.
>>  </description>
>> </property>
>>
>> <property>
>>  <name>dfs.replication</name>
>>  <value>2</value>
>>  <description>Default block replication.
>>  The actual number of replications can be specified when the file is
>> created.
>>  The default is used if replication is not specified in create time.
>>  </description>
>> </property>
>>
>>
>> <property>
>>  <name>mapred.child.java.opts</name>
>>  <value>-Xmx1048M</value>
>> </property>
>>
>> <property>
>>        <name>mapred.local.dir</name>
>>        <value>/extra/HADOOP/hadoop-0.16.3/tmp/mapred</value>
>> </property>
>>
>> <property>
>>  <name>mapred.map.tasks</name>
>>  <value>53</value>
>>  <description>The default number of map tasks per job.  Typically set
>>  to a prime several times greater than number of available hosts.
>>  Ignored when mapred.job.tracker is "local".
>>  </description>
>> </property>
>>
>> <property>
>>  <name>mapred.reduce.tasks</name>
>>  <value>7</value>
>>  <description>The default number of reduce tasks per job.  Typically set
>>  to a prime close to the number of available hosts.  Ignored when
>>  mapred.job.tracker is "local".
>>  </description>
>> </property>
>>
>> </configuration>
>>
>>
>> ============
>> This is the output that I get when running the tasks with 2 nodes in the
>> cluster:
>>
>> 08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to
>> process : 1
>> 08/06/20 11:07:45 INFO mapred.JobClient: Running job:
>> job_200806201106_0001
>> 08/06/20 11:07:46 INFO mapred.JobClient:  map 0% reduce 0%
>> 08/06/20 11:07:53 INFO mapred.JobClient:  map 8% reduce 0%
>> 08/06/20 11:07:55 INFO mapred.JobClient:  map 17% reduce 0%
>> 08/06/20 11:07:57 INFO mapred.JobClient:  map 26% reduce 0%
>> 08/06/20 11:08:00 INFO mapred.JobClient:  map 34% reduce 0%
>> 08/06/20 11:08:01 INFO mapred.JobClient:  map 43% reduce 0%
>> 08/06/20 11:08:04 INFO mapred.JobClient:  map 47% reduce 0%
>> 08/06/20 11:08:05 INFO mapred.JobClient:  map 52% reduce 0%
>> 08/06/20 11:08:08 INFO mapred.JobClient:  map 60% reduce 0%
>> 08/06/20 11:08:09 INFO mapred.JobClient:  map 69% reduce 0%
>> 08/06/20 11:08:10 INFO mapred.JobClient:  map 73% reduce 0%
>> 08/06/20 11:08:12 INFO mapred.JobClient:  map 78% reduce 0%
>> 08/06/20 11:08:13 INFO mapred.JobClient:  map 82% reduce 0%
>> 08/06/20 11:08:15 INFO mapred.JobClient:  map 91% reduce 1%
>> 08/06/20 11:08:16 INFO mapred.JobClient:  map 95% reduce 1%
>> 08/06/20 11:08:18 INFO mapred.JobClient:  map 99% reduce 3%
>> 08/06/20 11:08:23 INFO mapred.JobClient:  map 100% reduce 3%
>> 08/06/20 11:08:25 INFO mapred.JobClient:  map 100% reduce 7%
>> 08/06/20 11:08:28 INFO mapred.JobClient:  map 100% reduce 10%
>> 08/06/20 11:08:30 INFO mapred.JobClient:  map 100% reduce 11%
>> 08/06/20 11:08:33 INFO mapred.JobClient:  map 100% reduce 12%
>> 08/06/20 11:08:35 INFO mapred.JobClient:  map 100% reduce 14%
>> 08/06/20 11:08:38 INFO mapred.JobClient:  map 100% reduce 15%
>> 08/06/20 11:09:54 INFO mapred.JobClient:  map 100% reduce 13%
>> 08/06/20 11:09:54 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_r_000002_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/06/20 11:09:56 INFO mapred.JobClient:  map 100% reduce 9%
>> 08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_r_000003_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000011_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:09:57 INFO mapred.JobClient:  map 95% reduce 9%
>> 08/06/20 11:09:59 INFO mapred.JobClient:  map 100% reduce 9%
>> 08/06/20 11:10:04 INFO mapred.JobClient:  map 100% reduce 10%
>> 08/06/20 11:10:07 INFO mapred.JobClient:  map 100% reduce 11%
>> 08/06/20 11:10:09 INFO mapred.JobClient:  map 100% reduce 13%
>> 08/06/20 11:10:12 INFO mapred.JobClient:  map 100% reduce 14%
>> 08/06/20 11:10:14 INFO mapred.JobClient:  map 100% reduce 15%
>> 08/06/20 11:10:17 INFO mapred.JobClient:  map 100% reduce 16%
>> 08/06/20 11:10:24 INFO mapred.JobClient:  map 100% reduce 13%
>> 08/06/20 11:10:24 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_r_000000_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/06/20 11:10:29 INFO mapred.JobClient:  map 100% reduce 11%
>> 08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_r_000001_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000003_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:10:32 INFO mapred.JobClient:  map 100% reduce 12%
>> 08/06/20 11:10:37 INFO mapred.JobClient:  map 100% reduce 13%
>> 08/06/20 11:10:42 INFO mapred.JobClient:  map 100% reduce 14%
>> 08/06/20 11:10:47 INFO mapred.JobClient:  map 100% reduce 16%
>> 08/06/20 11:10:52 INFO mapred.JobClient:  map 95% reduce 16%
>> 08/06/20 11:10:52 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000020_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:10:54 INFO mapred.JobClient:  map 100% reduce 16%
>> 08/06/20 11:11:02 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000017_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:09 INFO mapred.JobClient:  map 100% reduce 17%
>> 08/06/20 11:11:24 INFO mapred.JobClient:  map 95% reduce 17%
>> 08/06/20 11:11:24 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000007_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:27 INFO mapred.JobClient:  map 100% reduce 17%
>> 08/06/20 11:11:32 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000012_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:34 INFO mapred.JobClient:  map 95% reduce 17%
>> 08/06/20 11:11:34 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000019_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:39 INFO mapred.JobClient:  map 91% reduce 18%
>> 08/06/20 11:11:39 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000002_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:41 INFO mapred.JobClient:  map 95% reduce 18%
>> 08/06/20 11:11:42 INFO mapred.JobClient:  map 100% reduce 19%
>> 08/06/20 11:11:42 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000006_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:44 INFO mapred.JobClient:  map 100% reduce 17%
>> 08/06/20 11:11:44 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_r_000003_1, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/06/20 11:11:51 INFO mapred.JobClient:  map 100% reduce 18%
>> 08/06/20 11:11:54 INFO mapred.JobClient:  map 100% reduce 19%
>> 08/06/20 11:11:59 INFO mapred.JobClient:  map 95% reduce 19%
>> 08/06/20 11:11:59 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000010_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:12:02 INFO mapred.JobClient:  map 100% reduce 19%
>> 08/06/20 11:12:07 INFO mapred.JobClient:  map 100% reduce 20%
>> 08/06/20 11:12:08 INFO mapred.JobClient:  map 100% reduce 33%
>> 08/06/20 11:12:09 INFO mapred.JobClient:  map 100% reduce 47%
>> 08/06/20 11:12:11 INFO mapred.JobClient:  map 100% reduce 60%
>> 08/06/20 11:12:16 INFO mapred.JobClient:  map 100% reduce 62%
>> 08/06/20 11:12:24 INFO mapred.JobClient:  map 100% reduce 63%
>> 08/06/20 11:12:26 INFO mapred.JobClient:  map 100% reduce 64%
>> 08/06/20 11:12:31 INFO mapred.JobClient:  map 100% reduce 65%
>> 08/06/20 11:12:31 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000019_1, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:12:36 INFO mapred.JobClient:  map 100% reduce 66%
>> 08/06/20 11:12:38 INFO mapred.JobClient:  map 100% reduce 67%
>> 08/06/20 11:12:39 INFO mapred.JobClient:  map 100% reduce 80%
>>
>> ===============
>>
>>
>>
>>> Are you seeing this for all the maps and reducers?
>>>
>> Yes, this happens on all the maps and reducers. I tried to keep just 2
>> nodes in the cluster but still the problem exists.
>>
>>
>>
>>> Are the reducers progressing at all?
>>>
>>>
>> The reducers continue to execute upto a certain point, but after that they
>> just do not proceed at all. They just stop at an average of 16%.
>>
>>
>>> Are all the maps that the reducer is failing from a remote machine?
>>>
>> Yes.
>>
>>
>>
>>> Are all the failed maps/reducers from the same machine?
>>>
>> All the maps and reducers are failing anyways.
>> Thanks for the help in advance,
>>
>> Regards,
>> Sayali
>>
>>       ---------------------------------
>> Sent from Yahoo! Mail.
>> A Smarter Email.
>>
>>
>
>

Re: Too many fetch failures AND Shuffle error

Posted by Amar Kamat <am...@yahoo-inc.com>.
Yeah. With 2 nodes the reducers will go up to 16% because the reducer 
are able to fetch maps from the same machine (locally) but fails to copy 
it from the remote machine. A common reason in such cases is the 
*restricted machine access* (firewall etc). The web-server on a 
machine/node hosts map outputs which the reducers on the other machine 
are not able to access. There will be a URL associated with a map that 
the reducer try to fetch (check the reducer logs for this url). Just try 
accessing it manually from the reducer's machine/node. Most likely this 
experiment should also fail. Let us know if this is not the case.
Amar
Sayali Kulkarni wrote:
>> Can you post the reducer logs. How many nodes are there in the cluster? 
>>     
> There are 6 nodes in the cluster - 1 master and 5 slaves
>  I tried to reduce the number of nodes, and found that the problem is solved only if there is a single node in the cluster. So I can deduce that the problem is there in some configuration.
>
> Configuration file:
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
>
> <property>
>   <name>hadoop.tmp.dir</name>
>   <value>/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}</value>
>   <description>A base for other temporary directories.</description>
> </property>
>
> <property>
>   <name>fs.default.name</name>
>   <value>hdfs://10.105.41.25:54310</value>
>   <description>The name of the default file system.  A URI whose
>   scheme and authority determine the FileSystem implementation.  The
>   uri's scheme determines the config property (fs.SCHEME.impl) naming
>   the FileSystem implementation class.  The uri's authority is used to
>   determine the host, port, etc. for a filesystem.</description>
> </property>
>
> <property>
>   <name>mapred.job.tracker</name>
>   <value>10.105.41.25:54311</value>
>   <description>The host and port that the MapReduce job tracker runs
>   at.  If "local", then jobs are run in-process as a single map
>   and reduce task.
>   </description>
> </property>
>
> <property>
>   <name>dfs.replication</name>
>   <value>2</value>
>   <description>Default block replication.
>   The actual number of replications can be specified when the file is created.
>   The default is used if replication is not specified in create time.
>   </description>
> </property>
>
>
> <property>
>   <name>mapred.child.java.opts</name>
>   <value>-Xmx1048M</value>
> </property>
>
> <property>
>         <name>mapred.local.dir</name>
>         <value>/extra/HADOOP/hadoop-0.16.3/tmp/mapred</value>
> </property>
>
> <property>
>   <name>mapred.map.tasks</name>
>   <value>53</value>
>   <description>The default number of map tasks per job.  Typically set
>   to a prime several times greater than number of available hosts.
>   Ignored when mapred.job.tracker is "local".
>   </description>
> </property>
>
> <property>
>   <name>mapred.reduce.tasks</name>
>   <value>7</value>
>   <description>The default number of reduce tasks per job.  Typically set
>   to a prime close to the number of available hosts.  Ignored when
>   mapred.job.tracker is "local".
>   </description>
> </property>
>
> </configuration>
>
>
> ============
> This is the output that I get when running the tasks with 2 nodes in the cluster:
>
> 08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to process : 1
> 08/06/20 11:07:45 INFO mapred.JobClient: Running job: job_200806201106_0001
> 08/06/20 11:07:46 INFO mapred.JobClient:  map 0% reduce 0%
> 08/06/20 11:07:53 INFO mapred.JobClient:  map 8% reduce 0%
> 08/06/20 11:07:55 INFO mapred.JobClient:  map 17% reduce 0%
> 08/06/20 11:07:57 INFO mapred.JobClient:  map 26% reduce 0%
> 08/06/20 11:08:00 INFO mapred.JobClient:  map 34% reduce 0%
> 08/06/20 11:08:01 INFO mapred.JobClient:  map 43% reduce 0%
> 08/06/20 11:08:04 INFO mapred.JobClient:  map 47% reduce 0%
> 08/06/20 11:08:05 INFO mapred.JobClient:  map 52% reduce 0%
> 08/06/20 11:08:08 INFO mapred.JobClient:  map 60% reduce 0%
> 08/06/20 11:08:09 INFO mapred.JobClient:  map 69% reduce 0%
> 08/06/20 11:08:10 INFO mapred.JobClient:  map 73% reduce 0%
> 08/06/20 11:08:12 INFO mapred.JobClient:  map 78% reduce 0%
> 08/06/20 11:08:13 INFO mapred.JobClient:  map 82% reduce 0%
> 08/06/20 11:08:15 INFO mapred.JobClient:  map 91% reduce 1%
> 08/06/20 11:08:16 INFO mapred.JobClient:  map 95% reduce 1%
> 08/06/20 11:08:18 INFO mapred.JobClient:  map 99% reduce 3%
> 08/06/20 11:08:23 INFO mapred.JobClient:  map 100% reduce 3%
> 08/06/20 11:08:25 INFO mapred.JobClient:  map 100% reduce 7%
> 08/06/20 11:08:28 INFO mapred.JobClient:  map 100% reduce 10%
> 08/06/20 11:08:30 INFO mapred.JobClient:  map 100% reduce 11%
> 08/06/20 11:08:33 INFO mapred.JobClient:  map 100% reduce 12%
> 08/06/20 11:08:35 INFO mapred.JobClient:  map 100% reduce 14%
> 08/06/20 11:08:38 INFO mapred.JobClient:  map 100% reduce 15%
> 08/06/20 11:09:54 INFO mapred.JobClient:  map 100% reduce 13%
> 08/06/20 11:09:54 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000002_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/06/20 11:09:56 INFO mapred.JobClient:  map 100% reduce 9%
> 08/06/20 11:09:56 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000003_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/06/20 11:09:56 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000011_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:09:57 INFO mapred.JobClient:  map 95% reduce 9%
> 08/06/20 11:09:59 INFO mapred.JobClient:  map 100% reduce 9%
> 08/06/20 11:10:04 INFO mapred.JobClient:  map 100% reduce 10%
> 08/06/20 11:10:07 INFO mapred.JobClient:  map 100% reduce 11%
> 08/06/20 11:10:09 INFO mapred.JobClient:  map 100% reduce 13%
> 08/06/20 11:10:12 INFO mapred.JobClient:  map 100% reduce 14%
> 08/06/20 11:10:14 INFO mapred.JobClient:  map 100% reduce 15%
> 08/06/20 11:10:17 INFO mapred.JobClient:  map 100% reduce 16%
> 08/06/20 11:10:24 INFO mapred.JobClient:  map 100% reduce 13%
> 08/06/20 11:10:24 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000000_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/06/20 11:10:29 INFO mapred.JobClient:  map 100% reduce 11%
> 08/06/20 11:10:29 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000001_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/06/20 11:10:29 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000003_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:10:32 INFO mapred.JobClient:  map 100% reduce 12%
> 08/06/20 11:10:37 INFO mapred.JobClient:  map 100% reduce 13%
> 08/06/20 11:10:42 INFO mapred.JobClient:  map 100% reduce 14%
> 08/06/20 11:10:47 INFO mapred.JobClient:  map 100% reduce 16%
> 08/06/20 11:10:52 INFO mapred.JobClient:  map 95% reduce 16%
> 08/06/20 11:10:52 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000020_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:10:54 INFO mapred.JobClient:  map 100% reduce 16%
> 08/06/20 11:11:02 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000017_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:11:09 INFO mapred.JobClient:  map 100% reduce 17%
> 08/06/20 11:11:24 INFO mapred.JobClient:  map 95% reduce 17%
> 08/06/20 11:11:24 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000007_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:11:27 INFO mapred.JobClient:  map 100% reduce 17%
> 08/06/20 11:11:32 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000012_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:11:34 INFO mapred.JobClient:  map 95% reduce 17%
> 08/06/20 11:11:34 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000019_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:11:39 INFO mapred.JobClient:  map 91% reduce 18%
> 08/06/20 11:11:39 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000002_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:11:41 INFO mapred.JobClient:  map 95% reduce 18%
> 08/06/20 11:11:42 INFO mapred.JobClient:  map 100% reduce 19%
> 08/06/20 11:11:42 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000006_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:11:44 INFO mapred.JobClient:  map 100% reduce 17%
> 08/06/20 11:11:44 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000003_1, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/06/20 11:11:51 INFO mapred.JobClient:  map 100% reduce 18%
> 08/06/20 11:11:54 INFO mapred.JobClient:  map 100% reduce 19%
> 08/06/20 11:11:59 INFO mapred.JobClient:  map 95% reduce 19%
> 08/06/20 11:11:59 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000010_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:12:02 INFO mapred.JobClient:  map 100% reduce 19%
> 08/06/20 11:12:07 INFO mapred.JobClient:  map 100% reduce 20%
> 08/06/20 11:12:08 INFO mapred.JobClient:  map 100% reduce 33%
> 08/06/20 11:12:09 INFO mapred.JobClient:  map 100% reduce 47%
> 08/06/20 11:12:11 INFO mapred.JobClient:  map 100% reduce 60%
> 08/06/20 11:12:16 INFO mapred.JobClient:  map 100% reduce 62%
> 08/06/20 11:12:24 INFO mapred.JobClient:  map 100% reduce 63%
> 08/06/20 11:12:26 INFO mapred.JobClient:  map 100% reduce 64%
> 08/06/20 11:12:31 INFO mapred.JobClient:  map 100% reduce 65%
> 08/06/20 11:12:31 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000019_1, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:12:36 INFO mapred.JobClient:  map 100% reduce 66%
> 08/06/20 11:12:38 INFO mapred.JobClient:  map 100% reduce 67%
> 08/06/20 11:12:39 INFO mapred.JobClient:  map 100% reduce 80%
>
> ===============
>
>   
>> Are you seeing this for all the maps and reducers? 
>>     
> Yes, this happens on all the maps and reducers. I tried to keep just 2 nodes in the cluster but still the problem exists.
>
>   
>> Are the reducers progressing at all?
>>     
> The reducers continue to execute upto a certain point, but after that they just do not proceed at all. They just stop at an average of 16%. 
>
>   
>> Are all the maps that the reducer is failing from a remote machine? 
>>     
> Yes.
>
>   
>> Are all the failed maps/reducers from the same machine? 
>>     
> All the maps and reducers are failing anyways. 
>
> Thanks for the help in advance,
>
> Regards,
> Sayali
>
>        
> ---------------------------------
> Sent from Yahoo! Mail.
> A Smarter Email.
>   


Re: Too many fetch failures AND Shuffle error

Posted by Sayali Kulkarni <sa...@yahoo.co.in>.
> Can you post the reducer logs. How many nodes are there in the cluster? 
There are 6 nodes in the cluster - 1 master and 5 slaves
 I tried to reduce the number of nodes, and found that the problem is solved only if there is a single node in the cluster. So I can deduce that the problem is there in some configuration.

Configuration file:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://10.105.41.25:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>10.105.41.25:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

<property>
  <name>dfs.replication</name>
  <value>2</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>


<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx1048M</value>
</property>

<property>
        <name>mapred.local.dir</name>
        <value>/extra/HADOOP/hadoop-0.16.3/tmp/mapred</value>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>53</value>
  <description>The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>7</value>
  <description>The default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>

</configuration>


============
This is the output that I get when running the tasks with 2 nodes in the cluster:

08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to process : 1
08/06/20 11:07:45 INFO mapred.JobClient: Running job: job_200806201106_0001
08/06/20 11:07:46 INFO mapred.JobClient:  map 0% reduce 0%
08/06/20 11:07:53 INFO mapred.JobClient:  map 8% reduce 0%
08/06/20 11:07:55 INFO mapred.JobClient:  map 17% reduce 0%
08/06/20 11:07:57 INFO mapred.JobClient:  map 26% reduce 0%
08/06/20 11:08:00 INFO mapred.JobClient:  map 34% reduce 0%
08/06/20 11:08:01 INFO mapred.JobClient:  map 43% reduce 0%
08/06/20 11:08:04 INFO mapred.JobClient:  map 47% reduce 0%
08/06/20 11:08:05 INFO mapred.JobClient:  map 52% reduce 0%
08/06/20 11:08:08 INFO mapred.JobClient:  map 60% reduce 0%
08/06/20 11:08:09 INFO mapred.JobClient:  map 69% reduce 0%
08/06/20 11:08:10 INFO mapred.JobClient:  map 73% reduce 0%
08/06/20 11:08:12 INFO mapred.JobClient:  map 78% reduce 0%
08/06/20 11:08:13 INFO mapred.JobClient:  map 82% reduce 0%
08/06/20 11:08:15 INFO mapred.JobClient:  map 91% reduce 1%
08/06/20 11:08:16 INFO mapred.JobClient:  map 95% reduce 1%
08/06/20 11:08:18 INFO mapred.JobClient:  map 99% reduce 3%
08/06/20 11:08:23 INFO mapred.JobClient:  map 100% reduce 3%
08/06/20 11:08:25 INFO mapred.JobClient:  map 100% reduce 7%
08/06/20 11:08:28 INFO mapred.JobClient:  map 100% reduce 10%
08/06/20 11:08:30 INFO mapred.JobClient:  map 100% reduce 11%
08/06/20 11:08:33 INFO mapred.JobClient:  map 100% reduce 12%
08/06/20 11:08:35 INFO mapred.JobClient:  map 100% reduce 14%
08/06/20 11:08:38 INFO mapred.JobClient:  map 100% reduce 15%
08/06/20 11:09:54 INFO mapred.JobClient:  map 100% reduce 13%
08/06/20 11:09:54 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000002_0, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
08/06/20 11:09:56 INFO mapred.JobClient:  map 100% reduce 9%
08/06/20 11:09:56 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000003_0, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
08/06/20 11:09:56 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000011_0, Status : FAILED
Too many fetch-failures
08/06/20 11:09:57 INFO mapred.JobClient:  map 95% reduce 9%
08/06/20 11:09:59 INFO mapred.JobClient:  map 100% reduce 9%
08/06/20 11:10:04 INFO mapred.JobClient:  map 100% reduce 10%
08/06/20 11:10:07 INFO mapred.JobClient:  map 100% reduce 11%
08/06/20 11:10:09 INFO mapred.JobClient:  map 100% reduce 13%
08/06/20 11:10:12 INFO mapred.JobClient:  map 100% reduce 14%
08/06/20 11:10:14 INFO mapred.JobClient:  map 100% reduce 15%
08/06/20 11:10:17 INFO mapred.JobClient:  map 100% reduce 16%
08/06/20 11:10:24 INFO mapred.JobClient:  map 100% reduce 13%
08/06/20 11:10:24 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000000_0, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
08/06/20 11:10:29 INFO mapred.JobClient:  map 100% reduce 11%
08/06/20 11:10:29 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000001_0, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
08/06/20 11:10:29 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000003_0, Status : FAILED
Too many fetch-failures
08/06/20 11:10:32 INFO mapred.JobClient:  map 100% reduce 12%
08/06/20 11:10:37 INFO mapred.JobClient:  map 100% reduce 13%
08/06/20 11:10:42 INFO mapred.JobClient:  map 100% reduce 14%
08/06/20 11:10:47 INFO mapred.JobClient:  map 100% reduce 16%
08/06/20 11:10:52 INFO mapred.JobClient:  map 95% reduce 16%
08/06/20 11:10:52 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000020_0, Status : FAILED
Too many fetch-failures
08/06/20 11:10:54 INFO mapred.JobClient:  map 100% reduce 16%
08/06/20 11:11:02 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000017_0, Status : FAILED
Too many fetch-failures
08/06/20 11:11:09 INFO mapred.JobClient:  map 100% reduce 17%
08/06/20 11:11:24 INFO mapred.JobClient:  map 95% reduce 17%
08/06/20 11:11:24 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000007_0, Status : FAILED
Too many fetch-failures
08/06/20 11:11:27 INFO mapred.JobClient:  map 100% reduce 17%
08/06/20 11:11:32 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000012_0, Status : FAILED
Too many fetch-failures
08/06/20 11:11:34 INFO mapred.JobClient:  map 95% reduce 17%
08/06/20 11:11:34 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000019_0, Status : FAILED
Too many fetch-failures
08/06/20 11:11:39 INFO mapred.JobClient:  map 91% reduce 18%
08/06/20 11:11:39 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000002_0, Status : FAILED
Too many fetch-failures
08/06/20 11:11:41 INFO mapred.JobClient:  map 95% reduce 18%
08/06/20 11:11:42 INFO mapred.JobClient:  map 100% reduce 19%
08/06/20 11:11:42 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000006_0, Status : FAILED
Too many fetch-failures
08/06/20 11:11:44 INFO mapred.JobClient:  map 100% reduce 17%
08/06/20 11:11:44 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000003_1, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
08/06/20 11:11:51 INFO mapred.JobClient:  map 100% reduce 18%
08/06/20 11:11:54 INFO mapred.JobClient:  map 100% reduce 19%
08/06/20 11:11:59 INFO mapred.JobClient:  map 95% reduce 19%
08/06/20 11:11:59 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000010_0, Status : FAILED
Too many fetch-failures
08/06/20 11:12:02 INFO mapred.JobClient:  map 100% reduce 19%
08/06/20 11:12:07 INFO mapred.JobClient:  map 100% reduce 20%
08/06/20 11:12:08 INFO mapred.JobClient:  map 100% reduce 33%
08/06/20 11:12:09 INFO mapred.JobClient:  map 100% reduce 47%
08/06/20 11:12:11 INFO mapred.JobClient:  map 100% reduce 60%
08/06/20 11:12:16 INFO mapred.JobClient:  map 100% reduce 62%
08/06/20 11:12:24 INFO mapred.JobClient:  map 100% reduce 63%
08/06/20 11:12:26 INFO mapred.JobClient:  map 100% reduce 64%
08/06/20 11:12:31 INFO mapred.JobClient:  map 100% reduce 65%
08/06/20 11:12:31 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000019_1, Status : FAILED
Too many fetch-failures
08/06/20 11:12:36 INFO mapred.JobClient:  map 100% reduce 66%
08/06/20 11:12:38 INFO mapred.JobClient:  map 100% reduce 67%
08/06/20 11:12:39 INFO mapred.JobClient:  map 100% reduce 80%

===============

> Are you seeing this for all the maps and reducers? 
Yes, this happens on all the maps and reducers. I tried to keep just 2 nodes in the cluster but still the problem exists.

> Are the reducers progressing at all?
The reducers continue to execute upto a certain point, but after that they just do not proceed at all. They just stop at an average of 16%. 

> Are all the maps that the reducer is failing from a remote machine? 
Yes.

> Are all the failed maps/reducers from the same machine? 
All the maps and reducers are failing anyways. 

Thanks for the help in advance,

Regards,
Sayali

       
---------------------------------
Sent from Yahoo! Mail.
A Smarter Email.