You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Shawn Gervais <pr...@project10.net> on 2006/04/12 23:46:49 UTC

How best to debug failed fetch-reduce task

Greetings list,

I am trying to debug why my fetch process is dying on the reduce side - 
I see a single reduce task out of 16 dying with the following message:

Timed out.java.io.IOException: Task process exit with nonzero status. at 
org.apache.hadoop.mapred.TaskRunner.runChild

Which is caused by:

060412 083015 task_r_8dpshs 0.8685376% reduce > reduce
060412 083016 task_r_8dpshs 0.8685678% reduce > reduce
060412 084023 Task task_r_8dpshs timed out.  Killing.

I have unsuccessfully attempted to determine the cause of this timeout. 
It seems to only occur on larger fetches -- I performed a successful 
fetch of 1M pages after commenting out '-.*(/.+?)/.*?\1/.*?\1/' from the 
regex-urlfilter.txt file (per some suggestions on the list), prior to 
that 1M was unstable. I then proceeded to launch a fetch of 10M pages, 
or about 1/5th of my target amount, and ran into the same problem again. 
JDK 1.4 versus 1.5 seems to make no difference.

When the reduce side of the fetch fails like this, it seems to render 
the entire segment unusable. I cannot re-run the fetch on the failed 
segment, nor can I updatedb using the failed segment. So in the end it 
seems I am left with useless data, and ~6 hours wasted.

When I have been at the terminal to observe the timed out process before 
it is reaped, I have seen that it continues to use 100% of a single 
processor. strace of the java process did not produce any usable leads. 
When the reduce task is reassigned, either to the same machine or 
another, it will die around the same percentage completion.

Is there an option I can enable somewhere that will allow for more 
verbose output to be written to the logs? Any other suggestions on 
debugging this issue? It seems to me that it might be possible to take a 
snapshot of the task while it is running (i.e. data and the task job 
jar), so that I can debug it in isolation without re-running an entire 
fetch process. I am unsure of how this might be done, though.

Regards,
-Shawn

Re: How best to debug failed fetch-reduce task

Posted by Doug Cutting <cu...@apache.org>.
Shawn Gervais wrote:
> When I have been at the terminal to observe the timed out process before 
> it is reaped, I have seen that it continues to use 100% of a single 
> processor. strace of the java process did not produce any usable leads. 
> When the reduce task is reassigned, either to the same machine or 
> another, it will die around the same percentage completion.

Did you try 'kill -QUIT' the process?  That should print a stack trace 
for every thread.

> Is there an option I can enable somewhere that will allow for more 
> verbose output to be written to the logs? Any other suggestions on 
> debugging this issue?

You could put add some print statements to FetcherOutputFormat.java, in 
the RecordWriter.write() method, printing each key (URL) written.  That 
might let you figure out what page is hanging things.

> It seems to me that it might be possible to take a 
> snapshot of the task while it is running (i.e. data and the task job 
> jar), so that I can debug it in isolation without re-running an entire 
> fetch process. I am unsure of how this might be done, though.

Once you know the page (assuming it is determinisitic) then you should 
be able to run a fetch of just that page to test things.

Doug