You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Dan Plubell <dp...@swbell.net> on 2008/05/14 18:22:54 UTC

Recover Nutch Crawl

I'm using the org.apache.nutch.crawl.Crawl (Nutch 0.9) class on a single machine.  The fetcher completely ok.  But, the LinkDb.invert step failed because the machine ran out of disk space.
Can I start the LinkDb.invert manually and the rest of the steps manually?
In the \mapred\local directory there are several \map_* directories.  Do the invert, index, dedup, merge steps need these directory?  I need to recover some disk space and I'm wondering if these can be deleted to recover disk space.
Here's the entries from the log file...
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080504231054]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20080421114841
LinkDb: adding segment: crawl/segments/20080421115732
LinkDb: adding segment: crawl/segments/20080421130158
LinkDb: adding segment: crawl/segments/20080421144524
LinkDb: adding segment: crawl/segments/20080421214809
LinkDb: adding segment: crawl/segments/20080422042411
LinkDb: adding segment: crawl/segments/20080422114958
LinkDb: adding segment: crawl/segments/20080424063149
LinkDb: adding segment: crawl/segments/20080430101435
LinkDb: adding segment: crawl/segments/20080504231054
Exception in thread "main" java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
 at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:232)
 at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:209)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:136)
Thanks,
Dan