You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by mo...@richmondinformatics.com on 2006/03/21 15:31:58 UTC
Recover an aborted fetch process
Hello Team,
*** Using nutch-0.8-dev of 2006-03-18, I have successfully completed the
generate, fetch, updatedb, invertlinks and index cycles for a few segments
growing to about a million pages each, across a cluster of five machines.
The last fetch reported completing 100% of the map and reduce tasks and then
produced the "Exception in thread "main" java.io.IOException: Job failed!"
below:
060320 215719 Fetcher: starting
060320 215719 Fetcher: segment: /user/root/crawlG/segments/20060320210831
060320 215719 parsing file:/home/nutch/nutch-2006-03-18/conf/hadoop-default.xml
060320 215719 parsing file:/home/nutch/nutch-2006-03-18/conf/nutch-default.xml
060320 215719 parsing
jar:file:/home/nutch/nutch-2006-03-18/lib/hadoop-0.1-dev.jar!/mapred-default.xml
060320 215719 parsing file:/home/nutch/nutch-2006-03-18/conf/nutch-site.xml
060320 215719 parsing file:/home/nutch/nutch-2006-03-18/conf/hadoop-site.xml
060320 215719 Client connection to 193.203.240.118:50020: starting
060320 215719 Client connection to 193.203.240.118:50000: starting
060320 215719 parsing file:/home/nutch/nutch-2006-03-18/conf/hadoop-default.xml
060320 215719 parsing file:/home/nutch/nutch-2006-03-18/conf/hadoop-site.xml
060320 215721 Running job: job_ps2g67
060320 215722 map 0% reduce 0%
060320 215851 map 1% reduce 0%
...
... snipped for brevity ....
...
060321 063748 map 100% reduce 98%
060321 064938 map 100% reduce 100%
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:310)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:366)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:400)
*** I have tried and failed to implement the procedure in the FAQ to recover "an
aborted fetch process" although I'm not sure that the fetch was aborted.
*** However, I have tested creating a fresh "crawl directory" and successfully
completed a few cycles, using the same commands as I used above. The commands
follow:
# bin/nutch generate crawlG/crawldb crawlG/segments -topN 1000
# bin/nutch fetch crawlG/segments/20060320084332 -threads 150
# bin/nutch updatedb crawlG/crawldb crawlG/segments/20060320084332
# bin/nutch invertlinks crawlG/linkdb crawlG/segments/20060320084332
# bin/nutch index crawlG/indexes crawlG/crawldb crawlG/linkdb
crawlG/segments/20060320084332
"bin/nutch readdb crawlT/crawldb -stats" prduced the following statistics after
the last successful cycle:
060320 193128 Statistics for CrawlDb: crawlG/crawldb
060320 193128 TOTAL urls: 19722161
060320 193128 avg score: 1.051
060320 193128 max score: 5365.96
060320 193128 min score: 1.0
060320 193128 retry 0: 19685263
060320 193128 retry 1: 36311
060320 193128 retry 2: 510
060320 193128 retry 3: 77
060320 193128 status 1 (DB_unfetched): 17851813
060320 193128 status 2 (DB_fetched): 1808788
060320 193128 status 3 (DB_gone): 61560
060320 193128 CrawlDb statistics: done
*** A "dfs -du" of my segments directory follows:
# bin/hadoop dfs -du /user/root/crawlG/segments
060321 132948 parsing file:/home/nutch/nutch-2006-03-18/conf/hadoop-default.xml
060321 132948 parsing file:/home/nutch/nutch-2006-03-18/conf/hadoop-site.xml
060321 132948 No FS indicated, using default:nutch1.houxou.com:50000
060321 132948 Client connection to 193.203.240.118:50000: starting
Found 16 items
/user/root/crawlG/segments/20060319183327 71296
/user/root/crawlG/segments/20060319190004 391516
/user/root/crawlG/segments/20060319190946 143496
/user/root/crawlG/segments/20060319190955 143703
/user/root/crawlG/segments/20060319192435 5825
/user/root/crawlG/segments/20060319192546 344023
/user/root/crawlG/segments/20060319194405 1463553
/user/root/crawlG/segments/20060319194555 1461686
/user/root/crawlG/segments/20060319200925 8210262
/user/root/crawlG/segments/20060319201136 8178156
/user/root/crawlG/segments/20060319204103 342444
/user/root/crawlG/segments/20060319204639 49113161
/user/root/crawlG/segments/20060319210546 220718005
/user/root/crawlG/segments/20060319215723 7098084697
/user/root/crawlG/segments/20060320084332 6320550466
/user/root/crawlG/segments/20060320210831 4976683531
I'd be grateful for any ideas. Including whether there'd be any mileage in
deleting the errant segment and reluctantly starting the cycle again...?
Many thanks,
Monu Ogbe