You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ned Rockson <nr...@stanford.edu> on 2007/09/13 23:13:34 UTC

Question about filters

I'm very unclear on what I need to set in nutch-site.xml to make sure
the correct filters are applied.  Essentially, I want to apply regex,
prefix and suffix filters, so I have this in my nutch-site.xml:

<property>
  <name>urlfilter.order</name>
  <value>org.apache.nutch.urlfilter.prefix.PrefixURLFilter
org.apache.nutch.urlfilter.suffix.SuffixURLFilter
org.apache.nutch.urlfilter.regex.RegexURLFilter</value>
  <description>The order by which url filters are applied.
  If empty, all available url filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.urlfilter.regex.RegexURLFilter
org.apache.nutch.urlfilter.prefix.PrefixURLFilter
  then RegexURLFilter is applied first, and PrefixURLFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  </description>
</property>

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-(regex|prefix|suffix)|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>

This seems strange though because there is crawl-urlfilter.txt and
automaton-urlfilter.txt, so how is this chosen at runtime?  Also, why
do I have to include the whole path for the urlfilter.order but not
for plugin.includes?

Thanks in advance,
Ned