You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2005/07/18 19:29:15 UTC

[Nutch Wiki] Update of "DissectingTheNutchCrawler" by ErikHatcher

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by ErikHatcher:
http://wiki.apache.org/nutch/DissectingTheNutchCrawler

The comment on the change is:
spelling correction

------------------------------------------------------------------------------
   1. URLFilter interface. By default, the class {{{net.nutch.net.RegexURLFilter}}} is used, which reads regular expression patterns from regex-urlfilter.txt. So, you can: 
     *  Edit that file to tune its behavior
     *  Or, write a new class that implements {{{net.nutch.net.URLFilter}}}, and change nutch-site.xml to use it. 
-  1. Protocol interface. To add support for a new protocol, write or add a plugin to the "plugins" directory. To change protocol behavior, modify the approprite plugin. 
+  1. Protocol interface. To add support for a new protocol, write or add a plugin to the "plugins" directory. To change protocol behavior, modify the appropriate plugin. 
   1. Parser interface. As for Protocol, you should add/create a plugin for any new content-types. Otherwise, you will need to replace the appropriate plugin if you want to modify its behavior. 
   1. If you need to make other changes, refer to our discussion of '''Fetcher''' and '''FetchListTool'''. Consider subclassing these classes, overriding the appropriate method, then calling your class from the "nutch" script using the full class path.