You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matei Zaharia <ma...@eecs.berkeley.edu> on 2007/11/18 05:07:49 UTC
Reduce job in invertlinks and index tasks often fails
Hi,
I am trying to index about 2 million pages I've crawled using Nutch.
When I run the bin/nutch invertlinks and index commands, I often get
my reduce tasks failing with the following message:
Task task_200711171111_0003_r_000000_1 failed to report status for 600
seconds. Killing!
(The 600 seconds ranges from 600 to 605 or so). This is while they are
copying input data. Is there a way around this timeout?
I've also noticed that Nutch always uses only one reducer for these
tasks, despite the size of the DB. Is this by design or is there a way
to configure the number and make the jobs finish faster? The jobs take
about 2 hours, most of which is spent running the sole reducer.
Thanks,
Matei