You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Karl Anderson <kr...@monkey.org> on 2009/01/07 05:07:38 UTC
Re: Unusual Failure of jobs
On 22-Dec-08, at 10:24 AM, Nathan Marz wrote:
> I have been experiencing some unusual behavior from Hadoop recently.
> When trying to run a job, some of the tasks fail with:
>
> java.io.IOException: Task process exit with nonzero status of 1.
> at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403
>
>
> Not all the tasks fail, but enough tasks fail such that the job
> fails. Unfortunately, there are no further logs for these tasks.
> Trying to retrieve the logs produces:
>
> HTTP ERROR: 410
>
> Failed to retrieve stdout log for task:
> attempt_200811101232_0218_m_000001_0
>
> RequestURI=/tasklog
>
>
> It seems like the tasktracker isn't able to even start the tasks on
> those machines. Has anyone seen anything like this before?
I see this on jobs that also get the "too many open files" task
errors, or on subsequent jobs. I've always assumed that it's another
manifestation of the same problem. Once I start getting these errors,
I keep getting them until I shut down the cluster, although I don't
always get enough to cause a job to fail. I haven't bothered
restarting individual boxes or services.
I haven't been able to reproduce it consistently, but it seems to
happen when I have many small input files; a job with one large input
file broke after I split the input up. I'm using Streaming.
Karl Anderson
kra@monkey.org
http://monkey.org/~kra