You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by te...@gmail.com on 2006/07/25 14:28:18 UTC

Links

Is there any way to find out what web pages on a specific domain have
been crawled by Nutch ?
In other words is there any way to get the list of urls that were
downloaded and processed by Nutch ?

Re: Links

Posted by Thomas Delnoij <di...@gmail.com>.

There's 'nutch readdb' command ->

tdelnoij@montblanc:~> nutch readdb
Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN
<nnnn> <out_dir> [<min>] | -url <url>)
        <crawldb>       directory name where crawldb is located
        -stats  print overall statistics to System.out
        -dump <out_dir> dump the whole db to a text file in <out_dir>
        -url <url>      print information on <url> to System.out
        -topN <nnnn> <out_dir> [<min>]  dump top <nnnn> urls sorted by
score to <out_dir>
                [<min>] skip records with scores below this value.
                        This can significantly improve performance.

Is this what you're looking for?

Rgrds, Thomas

On 7/25/06, termopro@gmail.com <te...@gmail.com> wrote:
> Is there any way to find out what web pages on a specific domain have
> been crawled by Nutch ?
> In other words is there any way to get the list of urls that were
> downloaded and processed by Nutch ?
>
>