You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2007/10/25 00:54:57 UTC

[Nutch Wiki] Update of "FAQ" by robotgenius

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by robotgenius:
http://wiki.apache.org/nutch/FAQ

------------------------------------------------------------------------------
  
  Please have a look on PrefixURLFilter.
  Adding some regular expressions to the urlfilter.regex.file might work, but adding a list with thousands of regular expressions would slow down your system excessively.
+ 
+ Alternatively, you can set db.ignore.external.links to "true", and inject seeds from the domains you wish to crawl (these seeds must link to all pages you wish to crawl, directly or indirectly).  Doing this will let the crawl go through only these domains without leaving to start crawling external links.  Unfortunately there is no way to record external links encountered for future processing, although a very small patch to the generator code can allow you to log these links to hadoop.log.
  
  ==== How can I recover an aborted fetch process? ====