You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Tomi N/A <he...@gmail.com> on 2007/04/12 09:33:29 UTC

crawl problem with nutch 0.9

I wanted to do a test vertical crawl (db.ignore.external.links=true)
of several dozen sites using "nutch crawl urlDir -threads 10 -depth 6
-topN 32768 -dir /var/nutch/testindex"...
FWIW, I ran the crawl on an Athlon 1900 with 1.5GB RAM and the crawl
directory size is about 2,4GB. Maximum memory usage was about
1.6-1.7GB (went into swap).

This is what I found at the end of hadoop.log when the process finished:

2007-04-12 04:11:24,903 INFO  indexer.Indexer - Indexer: done
2007-04-12 04:11:25,138 INFO  indexer.DeleteDuplicates - Dedup: starting
2007-04-12 04:11:26,178 INFO  indexer.DeleteDuplicates - Dedup: adding indexes i
n: /var/nutch/testindex/indexes
2007-04-12 04:12:59,636 INFO  indexer.DeleteDuplicates - Dedup: done
2007-04-12 04:12:59,637 INFO  indexer.IndexMerger - merging indexes to: /var/nut
ch/testindex/index
2007-04-12 04:12:59,684 INFO  indexer.IndexMerger - Adding
/var/nutch/testindex/indexes/part-00000
2007-04-12 04:16:09,532 INFO  indexer.IndexMerger - done merging
2007-04-12 04:16:09,728 INFO  crawl.Crawl - crawl finished: /var/nutch/testindex


Looks to me like everything was in perfect order, but I got the
following error when querying the index throught the nutch web ui:
"HTTP Status 404 - /var/nutch/testindex/index/segments (No such file
or directory)"

This is what I saw in the /var/nutch/testindex/index directory:
$ ls
_0.fdt  _0.fnm  _0.nrm  _0.tii  segments_2
_0.fdx  _0.frq  _0.prx  _0.tis  segments.gen

Obviously, there is no segments file.
Any ideas why that is?

TIA,
t.n.a.

Re: crawl problem with nutch 0.9

Posted by Tomi N/A <he...@gmail.com>.

Sorry, my fault (no surprise :)): I didn't know the nutch web UI is
dependent on it's crawler version: I was using the 0.8.1 UI with a 0.9
index.

t.n.a.