You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by David Rosenstrauch <da...@darose.net> on 2011/04/01 00:04:49 UTC

Re: What does "Too many fetch-failures" mean? How do I debug it?

On 03/31/2011 05:13 PM, W.P. McNeill wrote:
> I'm running a big job on my cluster and a handful of attempts are failing
> with a "Too many fetch-failures" error message. They're all on the same
> node, but that node doesn't appear to be down. Subsequent attempts succeed,
> so this looks like a transient stress issue rather than a problem with my
> code. I'm guessing it's something like HDFS not being able to keep up, but
> I'm not sure, and Googling only turns up people just as confused as I am.
>
> What does this error mean and how do I dig into it more?
>
> Thanks.

We've seen that happen in a number of situations, and it's a bit tricky 
to debug.

In the general sense it means that a machine wasn't able to fetch a 
block from HDFS - i.e., there was a network problem that prevented the 
machine from communicating with the other machine and fetch the block. 
The reasons why this could happen though are numerous.  We've seen this 
in at least 2 situations:  1) the HDFS machine was having a huge load 
spike and so didn't respond, and 2) we accidentally gave several nodes 
the same name, so Hadoop wasn't able to correctly contact the "real" 
node for that name.

Your specific issue may be different, though, so you'll need to debug 
the network error yourself.

HTH,

DR