You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Rodrigo Reyes C." <ro...@avity.com> on 2009/03/23 20:11:27 UTC

How do I prioritise URLs to be fetched?

Hi all

I am relatively new to nutch and I am trying to understand how it crawls
websites, but more specifically, how it creates and prioritises its Fetch
List. So I have a couple of questions I would like to ask:

   1. Which are Nutch crawl URL sources? I think they are both WebDB and
   segments but I am not sure.
   2. How does nutch prioritise crawling? By content expiration date only?
   3. Is there some way affect the way nutch orders URLs to be fetched? I've
   been reading the Generator class but haven't found an extension point for
   this.

Thanks in advance...

Rodrigo