You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tomi N/A <he...@gmail.com> on 2007/04/12 09:33:29 UTC
crawl problem with nutch 0.9
I wanted to do a test vertical crawl (db.ignore.external.links=true)
of several dozen sites using "nutch crawl urlDir -threads 10 -depth 6
-topN 32768 -dir /var/nutch/testindex"...
FWIW, I ran the crawl on an Athlon 1900 with 1.5GB RAM and the crawl
directory size is about 2,4GB. Maximum memory usage was about
1.6-1.7GB (went into swap).
This is what I found at the end of hadoop.log when the process finished:
2007-04-12 04:11:24,903 INFO indexer.Indexer - Indexer: done
2007-04-12 04:11:25,138 INFO indexer.DeleteDuplicates - Dedup: starting
2007-04-12 04:11:26,178 INFO indexer.DeleteDuplicates - Dedup: adding indexes i
n: /var/nutch/testindex/indexes
2007-04-12 04:12:59,636 INFO indexer.DeleteDuplicates - Dedup: done
2007-04-12 04:12:59,637 INFO indexer.IndexMerger - merging indexes to: /var/nut
ch/testindex/index
2007-04-12 04:12:59,684 INFO indexer.IndexMerger - Adding
/var/nutch/testindex/indexes/part-00000
2007-04-12 04:16:09,532 INFO indexer.IndexMerger - done merging
2007-04-12 04:16:09,728 INFO crawl.Crawl - crawl finished: /var/nutch/testindex
Looks to me like everything was in perfect order, but I got the
following error when querying the index throught the nutch web ui:
"HTTP Status 404 - /var/nutch/testindex/index/segments (No such file
or directory)"
This is what I saw in the /var/nutch/testindex/index directory:
$ ls
_0.fdt _0.fnm _0.nrm _0.tii segments_2
_0.fdx _0.frq _0.prx _0.tis segments.gen
Obviously, there is no segments file.
Any ideas why that is?
TIA,
t.n.a.
Re: crawl problem with nutch 0.9
Posted by Tomi N/A <he...@gmail.com>.
Sorry, my fault (no surprise :)): I didn't know the nutch web UI is
dependent on it's crawler version: I was using the 0.8.1 UI with a 0.9
index.
t.n.a.