You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Manoj Bist <ma...@gmail.com> on 2008/01/16 08:55:22 UTC
Need pointers regarding accessing crawled data/plugin etc.
Hi,
I would really appreciate if someone could provide pointers to doing the
following(via plugins or otherwise). I have gone through plugin central on
nutch wiki.
1.) Is it possible to have a control on the 'policy' to decide how soon a
url is fetched. For e.g. if a document does not change frequently, I would
like to fetch it less frequently etc.
2.) Is there an iterator for all the fetched urls? I want to do operations
like the following:
2.1 for each url in crawled urls {
get content of the url and process it.
}
2.2. Access the content of the url directly i.e. given url 'u', access
the content of 'u'.
Thanks,
Manoj.
--
Tired of reading blogs? Listen to your favorite blogs at
http://www.blogbard.com !!!!