You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Paul Tomblin <pt...@xcski.com> on 2009/08/07 03:14:11 UTC

Print out a list of every URL fetched?

If I want to print out a list of every URL as it's fetched, or better
yet write that list to a file, is there a good plugin to implement?
I'm guessing URLFilter isn't the best because it might see urls that
don't actually get fetched as well as ones that return 304, 4xx or 5xx
response codes.  Ideally, it should only print ones that are being
re-indexed.

-- 
http://www.linkedin.com/in/paultomblin

Re: Print out a list of every URL fetched?

Posted by Paul Tomblin <pt...@xcski.com>.

Not  quite what I want - that will show me every url that's ever been
crawled, not just the ones fetched this time, nor is it real-time.


On Fri, Aug 7, 2009 at 3:23 AM, Sebastian
Nagel<se...@exorbyte.com> wrote:
> Hi Paul,
>
> you can use
>
>  $NUTCH_HOME/bin/nutch readdb my_crawl/crawldb/ -dump dump_crawldb/ -format csv
>
> then in dump_crawldb you'll find a CSV file with all URLs in your crawlDb.
> One column indicates the status. Select only those records with "db_fetched"
> and you'll have your list.
>
> Sebastian
>



-- 
http://www.linkedin.com/in/paultomblin

Re: Print out a list of every URL fetched?

Posted by Sebastian Nagel <se...@exorbyte.com>.

Hi Paul,

you can use

 $NUTCH_HOME/bin/nutch readdb my_crawl/crawldb/ -dump dump_crawldb/ -format csv

then in dump_crawldb you'll find a CSV file with all URLs in your crawlDb.
One column indicates the status. Select only those records with "db_fetched"
and you'll have your list.

Sebastian