You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by hzhong <he...@gmail.com> on 2007/08/20 20:52:06 UTC

nutch links repository

Hello,

I have gotten nutch to crawl and index webpages.  

How can I see all the webpages nutch crawled?  In other words, I want to
know which urls nutch has crawled.

Are all the urls ever crawled stored in crawlDB?  

Thank you very much
-- 
View this message in context: http://www.nabble.com/nutch-links-repository-tf4300815.html#a12241768
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch links repository

Posted by John Mendenhall <jo...@surfutopia.net>.

> How can I see all the webpages nutch crawled?  In other words, I want to
> know which urls nutch has crawled.
> 
> Are all the urls ever crawled stored in crawlDB?  

Run /usr/local/nutch/bin/nutch readdb with the dump
option and it will dump all the urls out into a new
directory and you can peruse it at your leisure.

Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>)
        <crawldb>       directory name where crawldb is located
        -stats  print overall statistics to System.out
        -dump <out_dir> dump the whole db to a text file in <out_dir>
        -url <url>      print information on <url> to System.out
        -topN <nnnn> <out_dir> [<min>]  dump top <nnnn> urls sorted by score to <out_dir>
                [<min>] skip records with scores below this value.
                        This can significantly improve performance.

Or, you can write your own class that outputs
whatever you want from the database...

JohnM

-- 
john mendenhall
john@surfutopia.net
surf utopia
internet services