You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Khang Ich <kh...@gmail.com> on 2010/10/09 03:33:14 UTC
HTMLTag on Nutch Parser
Hi,
I have one simple problem: doing the regular expression while parsing HTML
in Nutch parser.
For example, while crawling and parsing ton of web pages, I'd like to write
a plugin in Nutch so that it can matched some specific pattern, annotate it
and store it. As far as I know Nutch has the HTMLMetaTag argument in method
HtmlParseFilter.filter().
My concern is can we also have other html tags like span and so on ? If it
is which packages/classes should I look into ?
THanks
-- Khang