You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Khang Ich <kh...@gmail.com> on 2010/10/09 03:33:14 UTC

HTMLTag on Nutch Parser

Hi,


I have one simple problem: doing the regular expression while parsing HTML
in Nutch parser.

For example, while crawling and parsing ton of web pages, I'd like to write
a plugin in Nutch so that it can matched some specific pattern, annotate it
and store it. As far as I know Nutch has the HTMLMetaTag argument in method
HtmlParseFilter.filter().

My concern is can we also have other html tags like span and so on ? If it
is which packages/classes should I look into ?


THanks

-- Khang