You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Brian Whitman <br...@variogr.am> on 2008/01/17 19:47:02 UTC

largest text block from parse tree?

We do a lot of post-processing of text output by nutch to get  
"aboutness," do machine learning & NLP on, etc.

One problem we always have is that the nutch full text output is from  
all parts of the page. For example a long essay or a blog post: you'll  
get the text of the post but also all the ads, navigation text,  
sidebar material, etc.

Has anyone dealt with this problem? Is there some heuristic I can  
apply somewhere in nutch's parser to either denote or filter by the  
largest html block of text before it outputs the "content" lucene field?


Re: largest text block from parse tree?

Posted by Andrzej Bialecki <ab...@getopt.org>.
Brian Whitman wrote:
> We do a lot of post-processing of text output by nutch to get 
> "aboutness," do machine learning & NLP on, etc.
> 
> One problem we always have is that the nutch full text output is from 
> all parts of the page. For example a long essay or a blog post: you'll 
> get the text of the post but also all the ads, navigation text, sidebar 
> material, etc.
> 
> Has anyone dealt with this problem? Is there some heuristic I can apply 
> somewhere in nutch's parser to either denote or filter by the largest 
> html block of text before it outputs the "content" lucene field?

I have implemented this (sorry, closed source).

The overall idea is to divide all possible HTML tags into groups - one 
group being "structural" tags, which define the page structure at large 
(e.g. div, table, iframe, form), the other group being "text formatting" 
tags (e.g. b, i, span, font, etc ..).

Then you strip all formatting tags and coalesce their text content into 
larger blocks, until you end up with blocks of plain text divided only 
by structural tags.

You record the position of each block in the DOM tree. Then you sort 
them by size (defined in characters or in tokens - I prefer the latter), 
cut off a percentage of the smallest ones (or those that fit under a 
fixed threshold), and restore the original text from the remaining blocks.

You can implement it as an HtmlParseFilter.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com