You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eyeris Rodriguez Rueda <er...@uci.cu> on 2015/06/08 15:08:34 UTC

problem with plugin.includes and indexingfilter.order properties

Hello all comunity.
I am using nutch 1.9 in local mode and solr 4.10
I have implemented my custom indexing filter for restrict document without required fields like title and others(i call index-required), but i have detected that indexing filter is not executing in my desire order.
I need that my custom indexing filter be executed in last order.
I have read that if indexingfilter.order property is empty so the order is defined by plugin.includes property but for some reason this is NOT happening(maybe a bug?).
I have tried using indexingfilter.order but this have another inconvenience because i need to put every class that index or is ignore no mather if is present in plugin.includes i think this is a inconvenience also because for example language identifier has another indexer for language and needs to be inserted in indexingfilter.order or language is empty in solr.

any suggestion will appreciated 

this are both properties


<property>
  <name>plugin.includes</name>
  <value>protocol-(ftp|http|httpclient)|urlfilter-regex|parse-(html|tika|metatags)|mimetype-filter|index-(basic|anchor|more|metadata|required)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|microformats-customtag|language-identifier|links-extractor|zip-language-identifier</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with the 
  underlying commons-httpclient library.
  </description>
</property>


<property>
  <name>indexingfilter.order</name>
  <value></value>
  <description>The order by which index filters are applied.
  If empty, all available index filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter
  then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
  
  Filter ordering might have impact on result if one filter depends on output of
  another filter.
  </description>
</property>


Re: problem with plugin.includes and indexingfilter.order properties

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,


> I have read that if indexingfilter.order property is empty so the order is defined by
> plugin.includes property but for some reason this is NOT happening(maybe a bug?).

The property plugin.includes is just a regular expressions to filter all installed
plugins against. It cannot define any order. Plugins are loaded in "system defined order",
that means the ordering does not change from run to run, but you shouldn't
expect any specific order.

> I have tried using indexingfilter.order but this have another inconvenience
> because i need to put every class that index

That's right if the order property is used you need to add all plugins of the
given type that are used. It is not possible to define a partial ordering.

Sebastian

On 06/08/2015 03:08 PM, Eyeris Rodriguez Rueda wrote:
> Hello all comunity.
> I am using nutch 1.9 in local mode and solr 4.10
> I have implemented my custom indexing filter for restrict document without required fields like title and others(i call index-required), but i have detected that indexing filter is not executing in my desire order.
> I need that my custom indexing filter be executed in last order.
> I have read that if indexingfilter.order property is empty so the order is defined by plugin.includes property but for some reason this is NOT happening(maybe a bug?).
> I have tried using indexingfilter.order but this have another inconvenience because i need to put every class that index or is ignore no mather if is present in plugin.includes i think this is a inconvenience also because for example language identifier has another indexer for language and needs to be inserted in indexingfilter.order or language is empty in solr.
> 
> any suggestion will appreciated 
> 
> this are both properties
> 
> 
> <property>
>   <name>plugin.includes</name>
>   <value>protocol-(ftp|http|httpclient)|urlfilter-regex|parse-(html|tika|metatags)|mimetype-filter|index-(basic|anchor|more|metadata|required)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|microformats-customtag|language-identifier|links-extractor|zip-language-identifier</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin. By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please enable 
>   protocol-httpclient, but be aware of possible intermittent problems with the 
>   underlying commons-httpclient library.
>   </description>
> </property>
> 
> 
> <property>
>   <name>indexingfilter.order</name>
>   <value></value>
>   <description>The order by which index filters are applied.
>   If empty, all available index filters (as dictated by properties
>   plugin-includes and plugin-excludes above) are loaded and applied in system
>   defined order. If not empty, only named filters are loaded and applied
>   in given order. For example, if this property has value:
>   org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter
>   then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
>   
>   Filter ordering might have impact on result if one filter depends on output of
>   another filter.
>   </description>
> </property>
>