You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ned Rockson <nr...@stanford.edu> on 2007/09/13 23:13:34 UTC
Question about filters
I'm very unclear on what I need to set in nutch-site.xml to make sure
the correct filters are applied. Essentially, I want to apply regex,
prefix and suffix filters, so I have this in my nutch-site.xml:
<property>
<name>urlfilter.order</name>
<value>org.apache.nutch.urlfilter.prefix.PrefixURLFilter
org.apache.nutch.urlfilter.suffix.SuffixURLFilter
org.apache.nutch.urlfilter.regex.RegexURLFilter</value>
<description>The order by which url filters are applied.
If empty, all available url filters (as dictated by properties
plugin-includes and plugin-excludes above) are loaded and applied in system
defined order. If not empty, only named filters are loaded and applied
in given order. For example, if this property has value:
org.apache.nutch.urlfilter.regex.RegexURLFilter
org.apache.nutch.urlfilter.prefix.PrefixURLFilter
then RegexURLFilter is applied first, and PrefixURLFilter second.
Since all filters are AND'ed, filter ordering does not have impact
on end result, but it may have performance implication, depending
on relative expensiveness of filters.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-(regex|prefix|suffix)|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
This seems strange though because there is crawl-urlfilter.txt and
automaton-urlfilter.txt, so how is this chosen at runtime? Also, why
do I have to include the whole path for the urlfilter.order but not
for plugin.includes?
Thanks in advance,
Ned