You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marcus Böhm <wi...@gmx.de> on 2011/01/03 18:40:48 UTC

How to write a plugin to ignore certain parts of a HTML Page?

Hello everybody,

i am certainly working on the requirement to index a website with the 
help of nutch. But it should be possible to exclude certain parts of a 
Page by marking it somehow in the HTML Code (additional markup or custom 
attribute). Now i am wondering where i should start with my 
implementation. I started reading the Wiki and found following possible 
starting points to write a custom plugin:

    * Parser
    * HTMLParseFilter
    * IndexingFilter

All these interfaces sound somehow like they could work for me. The 
Interface IndexingFilter's method filter mentions that it can manipulate 
a document that should be parsed (sounds good to me). Otherwise the 
Interface Parser sounded reasonable at first too.

So please tell me if i am heading into the right direction and which 
Interface/Extension Point i should choose.

Thanks for your help in advance!

With kind regards,
Marcus