You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Ellis H. Wilson III" <el...@cse.psu.edu> on 2012/06/19 16:27:44 UTC

Error: Too Many Fetch Failures

Hi all,

This is my first email to the list, so feel free to be candid in your 
complaints if I'm doing something canonically uncouth in my requests for 
assistance.

I'm using Hadoop 0.23 on 50 machines, each connected with gigabit 
ethernet and each having solely a single hard disk.  I am getting the 
following error repeatably for the TeraSort benchmark.  TeraGen runs 
without error, but TeraSort runs predictably until this error pops up 
between 64% and 70% completion.  This doesn't occur for every execution 
of the benchmark, as about one out of four times that I run the 
benchmark it does run to completion (TeraValidate included).

Error at the CLI:
"12/06/10 11:17:50 INFO mapreduce.Job:  map 100% reduce 64%
12/06/10 11:20:45 INFO mapreduce.Job: Task Id : 
attempt_1339331790635_0002_m_004337_0, Status : FAILED
Container killed by the ApplicationMaster.

Too Many fetch failures.Failing the attempt
12/06/10 11:21:45 WARN mapreduce.Job: Error reading task output Read 
timed out
12/06/10 11:23:06 WARN mapreduce.Job: Error reading task output Read 
timed out
12/06/10 11:23:07 INFO mapreduce.Job: Task Id : 
attempt_1339331790635_0002_m_004613_0, Status : FAILED"

I am still warming up to Yarn, so am not deft yet at getting all the 
logfiles I need, but under more careful inspection of the logs I could 
find and the machines themselves it seems like this is related to many 
numbers of sockets being up concurrently, which at some point prevents 
further connections being made from the requesting Reduce to the Map 
which has the data desired, leading the Reducer to believe there is some 
error in getting that data.  These errors continue to be spewed once 
about every 3 minutes for about 45 minutes until at last the job dies 
completely.

I have attached my -site.xml files so that a better idea of my 
configuration is evident, and any and all suggestions or queries for 
more info are welcome.  Things I have tried already, per the document I 
found at 
http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera:

mapred.reduce.slowstart.completed.maps = 0.80 (seems to help, but it 
hurts performance as I'm the only person running on the cluster, and it 
doesn't cure the problem -- just increases chance of completion from 1/4 
to 1/3 at best)

tasktracker.http.threads = 80 (default is 40 I think, and I've tried 
this and even much higher values to no avail)

Best, and Thanks in Advance,

ellis

Re: Error: Too Many Fetch Failures

Posted by "Ellis H. Wilson III" <el...@cse.psu.edu>.

On 06/19/12 23:10, Ellis H. Wilson III wrote:
> On 06/19/12 20:42, Raj Vishwanathan wrote:
>> You are probably having a very low somaxconn parameter ( default
>> centos has it at 128 , if I remember correctly). You can check the
>> value under /proc/sys/net/core/somaxconn
>
> Aha! Excellent, it does seem it's at the default, and that particular
> sysctl item had slipped my notice:
> [ellis@pool100 ~]$ cat /proc/sys/net/core/somaxconn
> 128
>
>> Can you also check the value of ulimit -n? It could be low.
>
> I did look for and alter this already, but it is set fairly high from
> what I can tell:
> [ellis@pool100 ~]$ ulimit -n
> 16384
>
> I altered both of these in /etc/sysctl.conf and have forced them to be
> re-read with `sysctl -p` on all nodes. I will report back if this fixes
> the issues tomorrow.

To anyone who runs into this problem in the future, I found that 
increasing the somaxconn parameter fixed the fetch failures issue 
completely (from 3 tests run so far on largish datasets).  This should 
be particularly useful for others who are dealing with an extremely high 
TaskTracker to DataNode ratio (10:1 in my case).

Thanks again to Raj for this solution, and others for their suggestions.

Best,

ellis

Re: Error: Too Many Fetch Failures

Posted by "Ellis H. Wilson III" <el...@cse.psu.edu>.

On 06/19/12 20:42, Raj Vishwanathan wrote:
> You are probably having a very low somaxconn parameter ( default centos has it at 128 , if I remember correctly). You can check the value under /proc/sys/net/core/somaxconn

Aha!  Excellent, it does seem it's at the default, and that particular 
sysctl item had slipped my notice:
[ellis@pool100 ~]$ cat /proc/sys/net/core/somaxconn
128

> Can you also check the value of ulimit -n? It could be  low.

I did look for and alter this already, but it is set fairly high from 
what I can tell:
[ellis@pool100 ~]$ ulimit -n
16384

I altered both of these in /etc/sysctl.conf and have forced them to be 
re-read with `sysctl -p` on all nodes.  I will report back if this fixes 
the issues tomorrow.

Thanks again to all!

ellis

Re: Error: Too Many Fetch Failures

Posted by Raj Vishwanathan <ra...@yahoo.com>.


You are probably having a very low somaxconn parameter ( default centos has it at 128 , if I remember correctly). You can check the value under /proc/sys/net/core/somaxconn

Can you also check the value of ulimit -n? It could be  low.

Raj



>________________________________
> From: Ellis H. Wilson III <el...@cse.psu.edu>
>To: common-user@hadoop.apache.org 
>Sent: Tuesday, June 19, 2012 12:32 PM
>Subject: Re: Error: Too Many Fetch Failures
> 
>On 06/19/12 13:38, Vinod Kumar Vavilapalli wrote:
>> 
>> Replies/more questions inline.
>> 
>> 
>>> I'm using Hadoop 0.23 on 50 machines, each connected with gigabit ethernet and each having solely a single hard disk.  I am getting the following error repeatably for the TeraSort benchmark.  TeraGen runs without error, but TeraSort runs predictably until this error pops up between 64% and 70% completion.  This doesn't occur for every execution of the benchmark, as about one out of four times that I run the benchmark it does run to completion (TeraValidate included).
>> 
>> 
>> How many containers are you running per node?
>
>Per my attached config files, I specify that yarn.nodemanager.resource.memory-mb = 3072, and the default /seems/ to be set at 1024MB for maps and reducers, so I have 3 containers running per node.  I have verified that this indeed is the case in the web client.  Three of these 1GB "slots" in the cluster appear to be occupied by something else during the execution of TeraSort, so I specify that TeraGen create .5TB using 441 maps (3waves * (50nodes * 3containerslots - 3occupiedslots)), and TeraSort to use 147 reducers.  This seems to give me the guarantees I had with Hadoop 1.0 that each node gets an equal number of reducers, and my job doesn't drag on due to straggler reducers.
>
>> Clearly maps are getting killed because of fetch failures. Can you look at the logs of the NodeManager where this particular map task ran. That may have logs related to why reducers are not able to fetch map-outputs. It is possible that because you have only one disk per node, some of these nodes have bad or unfunctional disks and thereby causing fetch failures.
>
>I will rerun and report the exact error messages from the NodeManagers.  Can you give me more exacting advice on collecting logs of this sort, for as I mentioned I'm new to doing so with the new version of Hadoop? I have been looking in /tmp/logs and hadoop/logs, but perhaps there is somewhere else to look as well?
>
>Last, I am certain this is not related to failing disks, as this exact error occurs at much higher frequencies when I run Hadoop on a NAS box, which is the core of my research at the moment.  Nevertheless, I posted to this list instead of Dev as this was on vanilla CentOS-5.5 machines using just the HDDs within each, and therefore should be a highly typical setup.  In particular, I see these errors coming from numerous nodes all at once, and the subset of nodes giving the problems are not repeatable from one run to the next, though the resulting error is.
>
>> If that is the case, either you can offline these nodes or bump up mapreduce.reduce.shuffle.maxfetchfailures to tolerate these failures, the default is 10. There are other some tweaks which I can tell if you can find more details from your logs.
>
>I'd prefer to not bump up maxfetchfailures, and would rather simply fix the issue that is causing the fetch to fail in the beginning.  This isn't a large cluster, having only 50 nodes, nor are the links (1gig) or storage capabilities (1 sata drive) great or strange relative to any normal installation.  I have to assume here that I've mis-configured something :(.
>
>Best,
>
>ellis
>
>
>

Re: Error: Too Many Fetch Failures

Posted by "Ellis H. Wilson III" <el...@cse.psu.edu>.

On 06/19/12 13:38, Vinod Kumar Vavilapalli wrote:
>
> Replies/more questions inline.
>
>
>> I'm using Hadoop 0.23 on 50 machines, each connected with gigabit ethernet and each having solely a single hard disk.  I am getting the following error repeatably for the TeraSort benchmark.  TeraGen runs without error, but TeraSort runs predictably until this error pops up between 64% and 70% completion.  This doesn't occur for every execution of the benchmark, as about one out of four times that I run the benchmark it does run to completion (TeraValidate included).
>
>
> How many containers are you running per node?

Per my attached config files, I specify that 
yarn.nodemanager.resource.memory-mb = 3072, and the default /seems/ to 
be set at 1024MB for maps and reducers, so I have 3 containers running 
per node.  I have verified that this indeed is the case in the web 
client.  Three of these 1GB "slots" in the cluster appear to be occupied 
by something else during the execution of TeraSort, so I specify that 
TeraGen create .5TB using 441 maps (3waves * (50nodes * 3containerslots 
- 3occupiedslots)), and TeraSort to use 147 reducers.  This seems to 
give me the guarantees I had with Hadoop 1.0 that each node gets an 
equal number of reducers, and my job doesn't drag on due to straggler 
reducers.

> Clearly maps are getting killed because of fetch failures. Can you look at the logs of the NodeManager where this particular map task ran. That may have logs related to why reducers are not able to fetch map-outputs. It is possible that because you have only one disk per node, some of these nodes have bad or unfunctional disks and thereby causing fetch failures.

I will rerun and report the exact error messages from the NodeManagers. 
  Can you give me more exacting advice on collecting logs of this sort, 
for as I mentioned I'm new to doing so with the new version of Hadoop? 
I have been looking in /tmp/logs and hadoop/logs, but perhaps there is 
somewhere else to look as well?

Last, I am certain this is not related to failing disks, as this exact 
error occurs at much higher frequencies when I run Hadoop on a NAS box, 
which is the core of my research at the moment.  Nevertheless, I posted 
to this list instead of Dev as this was on vanilla CentOS-5.5 machines 
using just the HDDs within each, and therefore should be a highly 
typical setup.  In particular, I see these errors coming from numerous 
nodes all at once, and the subset of nodes giving the problems are not 
repeatable from one run to the next, though the resulting error is.

> If that is the case, either you can offline these nodes or bump up mapreduce.reduce.shuffle.maxfetchfailures to tolerate these failures, the default is 10. There are other some tweaks which I can tell if you can find more details from your logs.

I'd prefer to not bump up maxfetchfailures, and would rather simply fix 
the issue that is causing the fetch to fail in the beginning.  This 
isn't a large cluster, having only 50 nodes, nor are the links (1gig) or 
storage capabilities (1 sata drive) great or strange relative to any 
normal installation.  I have to assume here that I've mis-configured 
something :(.

Best,

ellis

Re: Error: Too Many Fetch Failures

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Replies/more questions inline.


> I'm using Hadoop 0.23 on 50 machines, each connected with gigabit ethernet and each having solely a single hard disk.  I am getting the following error repeatably for the TeraSort benchmark.  TeraGen runs without error, but TeraSort runs predictably until this error pops up between 64% and 70% completion.  This doesn't occur for every execution of the benchmark, as about one out of four times that I run the benchmark it does run to completion (TeraValidate included).


How many containers are you running per node?


> Error at the CLI:
> "12/06/10 11:17:50 INFO mapreduce.Job:  map 100% reduce 64%
> 12/06/10 11:20:45 INFO mapreduce.Job: Task Id : attempt_1339331790635_0002_m_004337_0, Status : FAILED
> Container killed by the ApplicationMaster.
> 
> Too Many fetch failures.Failing the attempt


Clearly maps are getting killed because of fetch failures. Can you look at the logs of the NodeManager where this particular map task ran. That may have logs related to why reducers are not able to fetch map-outputs. It is possible that because you have only one disk per node, some of these nodes have bad or unfunctional disks and thereby causing fetch failures.

If that is the case, either you can offline these nodes or bump up mapreduce.reduce.shuffle.maxfetchfailures to tolerate these failures, the default is 10. There are other some tweaks which I can tell if you can find more details from your logs.

HTH,
+Vinod

Re: Error: Too Many Fetch Failures

Posted by "Ellis H. Wilson III" <el...@cse.psu.edu>.

On 06/19/12 14:11, Minh Duc Nguyen wrote:
> Take at look at slide 25:
> http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera
>
> It describes a similar error so hopefully this will help you.

I appreciate your prompt response Minh, but as you will notice in the 
end of my very email, I mentioned that I had previously seen this slide 
and tried two of those solutions, to no avail.  I should also note that 
I added /etc/hosts to each of my nodes such that, if it was a DNS issue, 
that would handle it.  The only other proposed solution suggested 
upgrading Jetty, but I wasn't sure about (sorry for the naiveté) how one 
could tell the version of Jetty in use.  Any ideas?  Or is this no 
longer an issue with Hadoop 2.0?

Best,

ellis


> On Tue, Jun 19, 2012 at 10:27 AM, Ellis H. Wilson III<el...@cse.psu.edu>  wrote:
>> Hi all,
>>
>> This is my first email to the list, so feel free to be candid in your
>> complaints if I'm doing something canonically uncouth in my requests for
>> assistance.
>>
>> I'm using Hadoop 0.23 on 50 machines, each connected with gigabit ethernet
>> and each having solely a single hard disk.  I am getting the following error
>> repeatably for the TeraSort benchmark.  TeraGen runs without error, but
>> TeraSort runs predictably until this error pops up between 64% and 70%
>> completion.  This doesn't occur for every execution of the benchmark, as
>> about one out of four times that I run the benchmark it does run to
>> completion (TeraValidate included).
>>
>> Error at the CLI:
>> "12/06/10 11:17:50 INFO mapreduce.Job:  map 100% reduce 64%
>> 12/06/10 11:20:45 INFO mapreduce.Job: Task Id :
>> attempt_1339331790635_0002_m_004337_0, Status : FAILED
>> Container killed by the ApplicationMaster.
>>
>> Too Many fetch failures.Failing the attempt
>> 12/06/10 11:21:45 WARN mapreduce.Job: Error reading task output Read timed
>> out
>> 12/06/10 11:23:06 WARN mapreduce.Job: Error reading task output Read timed
>> out
>> 12/06/10 11:23:07 INFO mapreduce.Job: Task Id :
>> attempt_1339331790635_0002_m_004613_0, Status : FAILED"
>>
>> I am still warming up to Yarn, so am not deft yet at getting all the
>> logfiles I need, but under more careful inspection of the logs I could find
>> and the machines themselves it seems like this is related to many numbers of
>> sockets being up concurrently, which at some point prevents further
>> connections being made from the requesting Reduce to the Map which has the
>> data desired, leading the Reducer to believe there is some error in getting
>> that data.  These errors continue to be spewed once about every 3 minutes
>> for about 45 minutes until at last the job dies completely.
>>
>> I have attached my -site.xml files so that a better idea of my configuration
>> is evident, and any and all suggestions or queries for more info are
>> welcome.  Things I have tried already, per the document I found at
>> http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera:
>>
>> mapred.reduce.slowstart.completed.maps = 0.80 (seems to help, but it hurts
>> performance as I'm the only person running on the cluster, and it doesn't
>> cure the problem -- just increases chance of completion from 1/4 to 1/3 at
>> best)
>>
>> tasktracker.http.threads = 80 (default is 40 I think, and I've tried this
>> and even much higher values to no avail)
>>
>> Best, and Thanks in Advance,
>>
>> ellis
>

Re: Error: Too Many Fetch Failures

Posted by Minh Duc Nguyen <md...@gmail.com>.

Take at look at slide 25:
http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera

It describes a similar error so hopefully this will help you.

   ~ Minh

On Tue, Jun 19, 2012 at 10:27 AM, Ellis H. Wilson III <el...@cse.psu.edu> wrote:
> Hi all,
>
> This is my first email to the list, so feel free to be candid in your
> complaints if I'm doing something canonically uncouth in my requests for
> assistance.
>
> I'm using Hadoop 0.23 on 50 machines, each connected with gigabit ethernet
> and each having solely a single hard disk.  I am getting the following error
> repeatably for the TeraSort benchmark.  TeraGen runs without error, but
> TeraSort runs predictably until this error pops up between 64% and 70%
> completion.  This doesn't occur for every execution of the benchmark, as
> about one out of four times that I run the benchmark it does run to
> completion (TeraValidate included).
>
> Error at the CLI:
> "12/06/10 11:17:50 INFO mapreduce.Job:  map 100% reduce 64%
> 12/06/10 11:20:45 INFO mapreduce.Job: Task Id :
> attempt_1339331790635_0002_m_004337_0, Status : FAILED
> Container killed by the ApplicationMaster.
>
> Too Many fetch failures.Failing the attempt
> 12/06/10 11:21:45 WARN mapreduce.Job: Error reading task output Read timed
> out
> 12/06/10 11:23:06 WARN mapreduce.Job: Error reading task output Read timed
> out
> 12/06/10 11:23:07 INFO mapreduce.Job: Task Id :
> attempt_1339331790635_0002_m_004613_0, Status : FAILED"
>
> I am still warming up to Yarn, so am not deft yet at getting all the
> logfiles I need, but under more careful inspection of the logs I could find
> and the machines themselves it seems like this is related to many numbers of
> sockets being up concurrently, which at some point prevents further
> connections being made from the requesting Reduce to the Map which has the
> data desired, leading the Reducer to believe there is some error in getting
> that data.  These errors continue to be spewed once about every 3 minutes
> for about 45 minutes until at last the job dies completely.
>
> I have attached my -site.xml files so that a better idea of my configuration
> is evident, and any and all suggestions or queries for more info are
> welcome.  Things I have tried already, per the document I found at
> http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera:
>
> mapred.reduce.slowstart.completed.maps = 0.80 (seems to help, but it hurts
> performance as I'm the only person running on the cluster, and it doesn't
> cure the problem -- just increases chance of completion from 1/4 to 1/3 at
> best)
>
> tasktracker.http.threads = 80 (default is 40 I think, and I've tried this
> and even much higher values to no avail)
>
> Best, and Thanks in Advance,
>
> ellis