You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Gavin Engel <ga...@engel.com> on 2010/06/08 03:37:21 UTC

Blacklisting/whitelisting html elements by name/id/class?

Hi,

What is the best way to provide either a whitelist (or blacklist) of html
classes (or names or id's) for Nutch to include (or exclude) prior to
inserting data into Lucene?

I ask because we want to index pages from sites, but without much of the
page, like header, menu, and footer.

thanks for considering,
-Gavin