You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Omar <or...@yahoo.com> on 2006/09/27 20:15:22 UTC

Problems with Nutch 0.8.1

Hello,

I'm new to Nutch. I've downloaded the latest version (0.8.1) and I'm using
WinXP. I did follow the instructions on the tutorial
(http://lucene.apache.org/nutch/tutorial8.html) but I having problems
crawling a small intranet site. Here are my steps:

$ bin/nutch crawl testa -dir test4 -depth 3 -topN 50 >& crawl.log

-- Output of crawl looks fine 

$ more crawl.log
crawl started in: test4
rootUrlDir = testa
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: test4/crawldb
Injector: urlDir: testa
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: starting
Generator: segment: test4/segments/20060927105913
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: test4/segments/20060927105913
Fetcher: threads: 10
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: test4/crawldb
CrawlDb update: segment: test4/segments/20060927105913

-- Checking the output 

$ bin/nutch  readdb test4 -url 
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2
        at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:445)

$ bin/nutch  readdb test4 -stats
CrawlDb statistics start: test4
Exception in thread "main" java.io.IOException: Input directory
c:/ir/nutch-0.8
1/test4/current in local is invalid.
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
        at
org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.ja
a:259)
        at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:440)

-- also tried checking the integrity of the crawl ...

$ bin/nutch org.apache.nutch.searcher.NutchBean apache
Total hits: 0

What is wrong? Thanks for any help.

--Omar
-- 
View this message in context: http://www.nabble.com/Problems-with-Nutch-0.8.1-tf2346446.html#a6532322
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problems with Nutch 0.8.1

Posted by Omar <or...@yahoo.com>.

Just want to add that I did try it the same on Linux with the same results.
Furthermore, the logfile contains an error:

/ir/nutch-0.8.1/logs> tail hadoop.log
2006-09-27 15:24:47,747 INFO  indexer.Indexer - Optimizing index.
2006-09-27 15:24:48,490 INFO  indexer.Indexer - Indexer: done
2006-09-27 15:24:48,496 INFO  indexer.DeleteDuplicates - Dedup: starting
2006-09-27 15:24:48,522 INFO  indexer.DeleteDuplicates - Dedup: adding
indexes i
n: test4/indexes
2006-09-27 15:24:51,251 INFO  indexer.DeleteDuplicates - Dedup: done
2006-09-27 15:24:51,255 INFO  indexer.IndexMerger - Adding
test4/indexes/part-00
000
2006-09-27 15:24:51,259 INFO  crawl.Crawl - crawl finished: test4
2006-09-27 15:26:16,914 INFO  crawl.CrawlDbReader - CrawlDb statistics
start: -t
est4
2006-09-27 15:26:17,681 ERROR mapred.JobClient - Input directory
/home/omar/ir/n
utch-0.8.1/-test4/current in local is invalid.

I read a post that there were similar issues and that there is a patch
available. Where can I download it? Thanks.
-- 
View this message in context: http://www.nabble.com/Problems-with-Nutch-0.8.1-tf2346446.html#a6536254
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problems with Nutch 0.8.1 - Fixed

Posted by Omar <or...@yahoo.com>.

Fixed. Had a typo in the config file. My bad, everything works.
-- 
View this message in context: http://www.nabble.com/Problems-with-Nutch-0.8.1-tf2346446.html#a6537850
Sent from the Nutch - User mailing list archive at Nabble.com.