You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Shawn Gervais <pr...@project10.net> on 2006/04/07 09:22:59 UTC

Large fetch fails with "Task process exit with nonzero status"

Hello,

I am trying to perform a large fetch (1 million pages), and observing 
some reduce tasks dying with the following message:

Timed out.java.io.IOException: Task process exit with nonzero status. at 
org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:273) at 
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:145)

A little bit about my environment:

- I am running a test cluster of 16 machines, dual 3GHz Xeons with 2GB 
of RAM each, running JRE 1.5.0_06
- Running Nutch 0.8-dev, built from trunk this afternoon. Hadoop 0.1.0 
taken from the nightly build.

All fetch tasks (32 of 32) complete successfully, as do most reduce jobs 
. However, one or two reduce jobs will fail with the above message. Upon 
failure, they are rescheduled to another tracker as expected.

The rescheduled reduce task will run up until the same point as the 
previous one died, and then sit around for ~10 minutes and die with the 
same message. The jobtracker will reschedule the reduce task a few times 
before giving up -- the entire job is aborted.

I was able to perform a successful fetch of 250,000 pages in my initial 
tests. I then tried to scale it up to 1M pages and I'm now stuck :/

Can anyone provide some clues as to where I might start on debugging 
this issue?

Regards,
-Shawn

Re: Large fetch fails with "Task process exit with nonzero status"

Posted by Ken Krugler <kk...@transpac.com>.

Hi Shawn,

>I am trying to perform a large fetch (1 million pages), and 
>observing some reduce tasks dying with the following message:
>
>Timed out.java.io.IOException: Task process exit with nonzero 
>status. at 
>org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:273) at 
>org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:145)

In the past, the two things that triggered this type of error when we 
went to bigger jobs were:

1. Running out of file descriptors.

2. IPC timeouts with big splits.

So try bumping your file descriptors (on all servers) to say 16K, and 
increasing the IPC timeout value in your config file.

-- Ken


>A little bit about my environment:
>
>- I am running a test cluster of 16 machines, dual 3GHz Xeons with 
>2GB of RAM each, running JRE 1.5.0_06
>- Running Nutch 0.8-dev, built from trunk this afternoon. Hadoop 
>0.1.0 taken from the nightly build.
>
>All fetch tasks (32 of 32) complete successfully, as do most reduce 
>jobs . However, one or two reduce jobs will fail with the above 
>message. Upon failure, they are rescheduled to another tracker as 
>expected.
>
>The rescheduled reduce task will run up until the same point as the 
>previous one died, and then sit around for ~10 minutes and die with 
>the same message. The jobtracker will reschedule the reduce task a 
>few times before giving up -- the entire job is aborted.
>
>I was able to perform a successful fetch of 250,000 pages in my 
>initial tests. I then tried to scale it up to 1M pages and I'm now 
>stuck :/
>
>Can anyone provide some clues as to where I might start on debugging 
>this issue?
>
>Regards,
>-Shawn


-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"