You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Avni, Itamar" <It...@verint.com> on 2009/12/23 10:12:29 UTC

How to make IndexingFilter plugin to work on same MIME types as HtmlParseFilter?

Hi all,

We have a plugin that implements HtmlParseFilter. As such it is held by HtmlParser (via an HtmlParseFilters object. though any object can hold an HtmlParseFilters object, only HtmlParser does), which is one of Nutch's built-in Parser plugin.
Our HtmlParseFilter plugin's filter method is called through HtmlParser's getParse method.

HtmlParser is configured, in parse-plugins.xml, to work on the following MIME types:

*         application/xhtml+xml

*         text/html

*         text/sgml

*         text/xml
And our HtmlParseFilter plugin is applied on such MIME types URLs.


Our plugin also implements the IndexingFilter interface. As such it is held by IndexerMapReduce (via an IndexingFilters object).
Our IndexingFilter plugin's filter method is called through IndexerMapReduce's reduce method. We use it to add a field of our own to the NutchDdocument.


The problem is that our plugin, as an IndexingFilter, runs on everything, regardless MIMEs.
We want it to be applied only on content that it was applied on as an HtmlParseFilter.

Any suggestions?

Thanks

Itamar Avni


This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries.
The information is intended to be for the use of the individual(s) or
entity(ies) named above.  If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message.  If you have received this electronic message in error, please notify us by replying to this e-mail.