You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Cam Bazz <ca...@gmail.com> on 2008/06/10 13:27:05 UTC

text extraction from html based on uniqueness metric

Hello, (I have posted this in solr as well)

I am indexing newspaper articles as an excercise in solr. When dealing with
newspaper articles in previous experiences I always tried to get the div or
the table that contains the actual news, using nekohtml traversing tru the
dom tree and getting the text from the div or table that contains the
article. When dealing with many newspapers, it is a hassle to custom code to
extract relevant information. There is usually a lot of garbage in the html.
>From categories to ads, and further more they change, so a static coding is
problematic.

I have been thinking if I could measure the frequency or uniqueness for each
node, and find the news automatically - but I have not come up with an
implementation.

Has anyone did/contemplated/used something similar? Maybe there is already a
way - using lucene, or even hadoop.

Otis from solr mailing list suggested a NovelAnalyzer from the lucene
development code. I think hadoop people might have an idea about this...

Best,

Re: text extraction from html based on uniqueness metric

Posted by Edward Capriolo <ed...@gmail.com>.
I have never tried this method. The concept came from a research paper
I ran into. The goal was to detect the language of piece of text by
looking at several factors. Average length of word, average length of
sentence, average number of vowels in a word, etc. He used these to
score and article, and it worked well in determining the language of
the text. It worked well.

This is a fairly basic program that you might see in Artificial
Intelligence, you can create a score and try to determine what the
block of text you are looking for is. The answer is not going to be
perfect, and I can not imagine many out-of-the box solutions will do
exactly what you need. (Just a guess)

The one plus about this is that you can take html right out of the
equation. I believe the java HTML tag parsers has some quick 'toText'
method that will dump the text of a web page.

Also your would think most online newspapers carry a  NewsML XML
version or RSS version of their paper.