You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Kamil Wnuk <ka...@gmail.com> on 2005/09/01 23:03:56 UTC

extension point for omitting page content?

I am looking to create a plugin that removes all text found between
certain comment tags from the content received by a parser before any
information (such as meta tags and links) is extracted. Since
implementations of the HTMLParseFilter class are applied after this
information is extracted it would be useful in the future to have a
similar pre-meta-and-link-extraction parse filter extension point.

For now, is there a better way to do this than replacing the current
parse-html plugin with one that does the content omission noted above?

Thank you,
Kamil