You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Sagar Vibhute <sa...@gmail.com> on 2007/10/16 08:57:28 UTC

Selective/Configurable HTML Parsing?

Hi,

I need some help with understanding how the HTML parser works in nutch. I
have to write a plugin which while crawling text will help me identify
certain words/phrases that will be pre-specified.

eg: I might want to index pages with a specific in case they have the name Jimi
Hendrix occuring on them.

In such a case, how do I write an extension that allows me to check for the
occurence of a certain word on the page? Meaning, where do I start? I have
read the html parser code in the nutch source files, to an extent I could
understand it. Is there a text-library/dictionary that nutch uses while it
parses the page content? I read the documentation on neko parser, but am
still not able to understand it completely.

- Sagar

Re: Selective/Configurable HTML Parsing?

Posted by Andrzej Bialecki <ab...@getopt.org>.
Sagar Vibhute wrote:
> Hi,
> 
> I need some help with understanding how the HTML parser works in nutch. I
> have to write a plugin which while crawling text will help me identify
> certain words/phrases that will be pre-specified.
> 
> eg: I might want to index pages with a specific in case they have the name Jimi
> Hendrix occuring on them.
> 
> In such a case, how do I write an extension that allows me to check for the
> occurence of a certain word on the page? Meaning, where do I start? I have
> read the html parser code in the nutch source files, to an extent I could
> understand it. Is there a text-library/dictionary that nutch uses while it
> parses the page content? I read the documentation on neko parser, but am
> still not able to understand it completely.

You should take a look at HtmlParseFilter interface - this is something 
that you need to implement as a plugin. The plugin will receive the 
parsed HTML document, and you can traverse the document DOM tree or 
analyze the extracted plain text of the document.

See also the documentation on the Wiki about how to write Nutch plugins.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com