You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matei Zaharia <ma...@eecs.berkeley.edu> on 2007/11/18 05:07:49 UTC

Reduce job in invertlinks and index tasks often fails

Hi,

I am trying to index about 2 million pages I've crawled using Nutch.  
When I run the bin/nutch invertlinks and index commands, I often get  
my reduce tasks failing with the following message:

Task task_200711171111_0003_r_000000_1 failed to report status for 600  
seconds. Killing!

(The 600 seconds ranges from 600 to 605 or so). This is while they are  
copying input data. Is there a way around this timeout?

I've also noticed that Nutch always uses only one reducer for these  
tasks, despite the size of the DB. Is this by design or is there a way  
to configure the number and make the jobs finish faster? The jobs take  
about 2 hours, most of which is spent running the sole reducer.

Thanks,
Matei