You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Manoj Bist <ma...@gmail.com> on 2008/01/16 08:55:22 UTC

Need pointers regarding accessing crawled data/plugin etc.

Hi,

I would really appreciate if someone could provide pointers to doing the
following(via plugins or otherwise). I have gone through plugin central on
nutch wiki.

1.)  Is it possible to have a control on the 'policy'  to decide how soon a
url is fetched. For e.g. if a document does not change frequently, I would
like to fetch it less frequently etc.

2.) Is there an iterator for all the fetched urls?  I want to do operations
like the following:

   2.1     for each url in crawled urls {
              get content of the url and process it.
        }

   2.2. Access the content of the url directly i.e. given url 'u', access
the content of 'u'.


Thanks,

Manoj.
-- 
Tired of reading blogs? Listen to  your favorite blogs at
http://www.blogbard.com   !!!!