You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eyeris Rodriguez Rueda <er...@uci.cu> on 2015/06/08 15:08:34 UTC
problem with plugin.includes and indexingfilter.order properties
Hello all comunity.
I am using nutch 1.9 in local mode and solr 4.10
I have implemented my custom indexing filter for restrict document without required fields like title and others(i call index-required), but i have detected that indexing filter is not executing in my desire order.
I need that my custom indexing filter be executed in last order.
I have read that if indexingfilter.order property is empty so the order is defined by plugin.includes property but for some reason this is NOT happening(maybe a bug?).
I have tried using indexingfilter.order but this have another inconvenience because i need to put every class that index or is ignore no mather if is present in plugin.includes i think this is a inconvenience also because for example language identifier has another indexer for language and needs to be inserted in indexingfilter.order or language is empty in solr.
any suggestion will appreciated
this are both properties
<property>
<name>plugin.includes</name>
<value>protocol-(ftp|http|httpclient)|urlfilter-regex|parse-(html|tika|metatags)|mimetype-filter|index-(basic|anchor|more|metadata|required)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|microformats-customtag|language-identifier|links-extractor|zip-language-identifier</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
<property>
<name>indexingfilter.order</name>
<value></value>
<description>The order by which index filters are applied.
If empty, all available index filters (as dictated by properties
plugin-includes and plugin-excludes above) are loaded and applied in system
defined order. If not empty, only named filters are loaded and applied
in given order. For example, if this property has value:
org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter
then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
Filter ordering might have impact on result if one filter depends on output of
another filter.
</description>
</property>
Re: problem with plugin.includes and indexingfilter.order properties
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,
> I have read that if indexingfilter.order property is empty so the order is defined by
> plugin.includes property but for some reason this is NOT happening(maybe a bug?).
The property plugin.includes is just a regular expressions to filter all installed
plugins against. It cannot define any order. Plugins are loaded in "system defined order",
that means the ordering does not change from run to run, but you shouldn't
expect any specific order.
> I have tried using indexingfilter.order but this have another inconvenience
> because i need to put every class that index
That's right if the order property is used you need to add all plugins of the
given type that are used. It is not possible to define a partial ordering.
Sebastian
On 06/08/2015 03:08 PM, Eyeris Rodriguez Rueda wrote:
> Hello all comunity.
> I am using nutch 1.9 in local mode and solr 4.10
> I have implemented my custom indexing filter for restrict document without required fields like title and others(i call index-required), but i have detected that indexing filter is not executing in my desire order.
> I need that my custom indexing filter be executed in last order.
> I have read that if indexingfilter.order property is empty so the order is defined by plugin.includes property but for some reason this is NOT happening(maybe a bug?).
> I have tried using indexingfilter.order but this have another inconvenience because i need to put every class that index or is ignore no mather if is present in plugin.includes i think this is a inconvenience also because for example language identifier has another indexer for language and needs to be inserted in indexingfilter.order or language is empty in solr.
>
> any suggestion will appreciated
>
> this are both properties
>
>
> <property>
> <name>plugin.includes</name>
> <value>protocol-(ftp|http|httpclient)|urlfilter-regex|parse-(html|tika|metatags)|mimetype-filter|index-(basic|anchor|more|metadata|required)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|microformats-customtag|language-identifier|links-extractor|zip-language-identifier</value>
> <description>Regular expression naming plugin directory names to
> include. Any plugin not matching this expression is excluded.
> In any case you need at least include the nutch-extensionpoints plugin. By
> default Nutch includes crawling just HTML and plain text via HTTP,
> and basic indexing and search plugins. In order to use HTTPS please enable
> protocol-httpclient, but be aware of possible intermittent problems with the
> underlying commons-httpclient library.
> </description>
> </property>
>
>
> <property>
> <name>indexingfilter.order</name>
> <value></value>
> <description>The order by which index filters are applied.
> If empty, all available index filters (as dictated by properties
> plugin-includes and plugin-excludes above) are loaded and applied in system
> defined order. If not empty, only named filters are loaded and applied
> in given order. For example, if this property has value:
> org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter
> then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
>
> Filter ordering might have impact on result if one filter depends on output of
> another filter.
> </description>
> </property>
>