You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ha ward <sm...@gmail.com> on 2006/10/26 17:59:26 UTC
Problem in executing Nutch Tutorial
Hi,
I am a newbie. Please assist!
I am using cygwin (windows xp) and Nutch 0.8.1.
In crawl-urlfilter.txt, I modified:
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^http://([a-z0-9]*\.)*cnn.com/
$ mkdir urls
$ echo 'http://www.cnn.com" > urls/seeds.txt
$ nutch crawl urls -dir db -depth 1 -topN 10
I got the following error:
myname@localhost /cygdrive/d/corpus/data
$ nutch crawl urls -dir db -depth 1 -threads 1 -topN 10
crawl started in: db
rootUrlDir = urls
threads = 1
depth = 1
topN = 10
Injector: starting
Injector: crawlDb: db/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: starting
Generator: segment: db/segments/20061026061130
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: db/segments/20061026061130
Fetcher: threads: 1
fetching http://www.cnn.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: db/crawldb
CrawlDb update: segment: db/segments/20061026061130
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: db/linkdb
LinkDb: adding segment: db/segments/20061026061130
LinkDb: done
Indexer: starting
Indexer: linkdb: db/linkdb
Indexer: adding segment: db/segments/20061026061130
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:296)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
Help!!!
Regards,
Haward