You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by hzhong <he...@gmail.com> on 2007/08/20 20:52:06 UTC
nutch links repository
Hello,
I have gotten nutch to crawl and index webpages.
How can I see all the webpages nutch crawled? In other words, I want to
know which urls nutch has crawled.
Are all the urls ever crawled stored in crawlDB?
Thank you very much
--
View this message in context: http://www.nabble.com/nutch-links-repository-tf4300815.html#a12241768
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: nutch links repository
Posted by John Mendenhall <jo...@surfutopia.net>.
> How can I see all the webpages nutch crawled? In other words, I want to
> know which urls nutch has crawled.
>
> Are all the urls ever crawled stored in crawlDB?
Run /usr/local/nutch/bin/nutch readdb with the dump
option and it will dump all the urls out into a new
directory and you can peruse it at your leisure.
Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>)
<crawldb> directory name where crawldb is located
-stats print overall statistics to System.out
-dump <out_dir> dump the whole db to a text file in <out_dir>
-url <url> print information on <url> to System.out
-topN <nnnn> <out_dir> [<min>] dump top <nnnn> urls sorted by score to <out_dir>
[<min>] skip records with scores below this value.
This can significantly improve performance.
Or, you can write your own class that outputs
whatever you want from the database...
JohnM
--
john mendenhall
john@surfutopia.net
surf utopia
internet services