You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by mo...@richmondinformatics.com on 2006/03/21 15:31:58 UTC
Recover an aborted fetch process

Hello Team,

*** Using nutch-0.8-dev of 2006-03-18, I have successfully completed the
generate, fetch, updatedb, invertlinks and index cycles for a few segments
growing to about a million pages each, across a cluster of five machines.

The last fetch reported completing 100% of the map and reduce tasks and then
produced the "Exception in thread "main" java.io.IOException: Job failed!"
below:

060320 215719 Fetcher: starting
060320 215719 Fetcher: segment: /user/root/crawlG/segments/20060320210831
060320 215719 parsing file:/home/nutch/nutch-2006-03-18/conf/hadoop-default.xml
060320 215719 parsing file:/home/nutch/nutch-2006-03-18/conf/nutch-default.xml
060320 215719 parsing
jar:file:/home/nutch/nutch-2006-03-18/lib/hadoop-0.1-dev.jar!/mapred-default.xml
060320 215719 parsing file:/home/nutch/nutch-2006-03-18/conf/nutch-site.xml
060320 215719 parsing file:/home/nutch/nutch-2006-03-18/conf/hadoop-site.xml
060320 215719 Client connection to 193.203.240.118:50020: starting
060320 215719 Client connection to 193.203.240.118:50000: starting
060320 215719 parsing file:/home/nutch/nutch-2006-03-18/conf/hadoop-default.xml
060320 215719 parsing file:/home/nutch/nutch-2006-03-18/conf/hadoop-site.xml
060320 215721 Running job: job_ps2g67
060320 215722  map 0%  reduce 0%
060320 215851  map 1%  reduce 0%
...
... snipped for brevity ....
...
060321 063748  map 100%  reduce 98%
060321 064938  map 100%  reduce 100%
Exception in thread "main" java.io.IOException: Job failed!
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:310)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:366)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:400)

*** I have tried and failed to implement the procedure in the FAQ to recover "an
aborted fetch process" although I'm not sure that the fetch was aborted.

*** However, I have tested creating a fresh "crawl directory" and successfully
completed a few cycles, using the same commands as I used above.  The commands
follow:

# bin/nutch generate crawlG/crawldb crawlG/segments -topN 1000
# bin/nutch fetch crawlG/segments/20060320084332 -threads 150
# bin/nutch updatedb crawlG/crawldb crawlG/segments/20060320084332
# bin/nutch invertlinks crawlG/linkdb crawlG/segments/20060320084332
# bin/nutch index crawlG/indexes crawlG/crawldb crawlG/linkdb
crawlG/segments/20060320084332

"bin/nutch readdb crawlT/crawldb -stats" prduced the following statistics after
the last successful cycle:

060320 193128 Statistics for CrawlDb: crawlG/crawldb
060320 193128 TOTAL urls:       19722161
060320 193128 avg score:        1.051
060320 193128 max score:        5365.96
060320 193128 min score:        1.0
060320 193128 retry 0:  19685263
060320 193128 retry 1:  36311
060320 193128 retry 2:  510
060320 193128 retry 3:  77
060320 193128 status 1 (DB_unfetched):  17851813
060320 193128 status 2 (DB_fetched):    1808788
060320 193128 status 3 (DB_gone):       61560
060320 193128 CrawlDb statistics: done

*** A "dfs -du" of my segments directory follows:

# bin/hadoop dfs -du /user/root/crawlG/segments
060321 132948 parsing file:/home/nutch/nutch-2006-03-18/conf/hadoop-default.xml
060321 132948 parsing file:/home/nutch/nutch-2006-03-18/conf/hadoop-site.xml
060321 132948 No FS indicated, using default:nutch1.houxou.com:50000
060321 132948 Client connection to 193.203.240.118:50000: starting
Found 16 items
/user/root/crawlG/segments/20060319183327       71296
/user/root/crawlG/segments/20060319190004       391516
/user/root/crawlG/segments/20060319190946       143496
/user/root/crawlG/segments/20060319190955       143703
/user/root/crawlG/segments/20060319192435       5825
/user/root/crawlG/segments/20060319192546       344023
/user/root/crawlG/segments/20060319194405       1463553
/user/root/crawlG/segments/20060319194555       1461686
/user/root/crawlG/segments/20060319200925       8210262
/user/root/crawlG/segments/20060319201136       8178156
/user/root/crawlG/segments/20060319204103       342444
/user/root/crawlG/segments/20060319204639       49113161
/user/root/crawlG/segments/20060319210546       220718005
/user/root/crawlG/segments/20060319215723       7098084697
/user/root/crawlG/segments/20060320084332       6320550466
/user/root/crawlG/segments/20060320210831       4976683531

I'd be grateful for any ideas.  Including whether there'd be any mileage in
deleting the errant segment and reluctantly starting the cycle again...?

Many thanks,

Monu Ogbe