You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Marc Hofer <ma...@marc-hofer.de> on 2009/11/28 20:57:49 UTC

TU Berlin Winter of Code Project - II. Layer: Preprocessing

Hello everybody,

having already presented the draft of our architecture, I would like now 
to discuss the second layer more in detail. As mentioned before we have 
chosen UIMA for this layer. The main aggregate currently consists of the 
Whitespace Tokenizer Annotator, the Snowball Annotator (Stemming) and a 
list-based StopwordFilter. Before processing this aggregate in a 
map-only job in Hadoop, we want to filter all HTML tags and forward only 
this preprocessed data to the aggregate. The reason for this is that it 
is difficult to change the document during processing in UIMA and it is 
impractical to work all the time on documents containing HTML tags.

Furthermore we are planning to add the Tagger Annotator, which 
implements a Hidden Markov Model tagger. Here we aren't sure, which 
tokens with their corresponding part of speech tags to delete or not and 
so using them for the feature extraction. One purpose could be to use at 
the very beginning only substantives and verbs.

We are very interested in your comments and remarks and it would be nice 
to hear from you.

Cheers,
Marc

Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing

Posted by Ken Krugler <kk...@transpac.com>.

There are two separate issues here - HTML parsing (sometimes called  
cleanup) vs. getting rid of boilerplate content, which is also often  
called HTML cleanup.

TagSoup & NekoHTML are examples of the former - code that "fixes up"  
HTML documents so you can apply standard XML parsing techniques.

The articles originally referenced below, as well as my prior note  
about nCleaner, are talking about the latter - trying to get rid of  
headers, footers, ads, etc.

-- Ken

On Nov 28, 2009, at 12:30pm, Marc Hofer wrote:

> Hi Drew,
>
> currently we are using a HTML Filter module of the Univeristy  
> Duisburg-Essen, that can be found here: http://www.is.informatik.uni-duisburg.de/projects/java-unidu/filter.html
>
> Another idea was to try Jericho or NekoHTML.
> http://www.java2s.com/Product/Java/Development/HTML-Parser.htm
>
> Thanks for your advice, we will test it and let you know, whether it  
> works well.
>
> Marc
>
> Drew Farris schrieb:
>> Hi Marc,
>> How are you planning on cleaning up the HTML documents?
>> Perhaps something like this would be useful: I came across an
>> interesting approach a few days ago, it would be interesting to hear
>> more from someone who has tried something like this:
>> http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/
>> Described further, with java implementations here:
>> http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.html
>> Drew
>> On Sat, Nov 28, 2009 at 2:57 PM, Marc Hofer <ma...@marc-hofer.de>  
>> wrote:
>>> Hello everybody,
>>>
>>> having already presented the draft of our architecture, I would  
>>> like now to
>>> discuss the second layer more in detail. As mentioned before we  
>>> have chosen
>>> UIMA for this layer. The main aggregate currently consists of the  
>>> Whitespace
>>> Tokenizer Annotator, the Snowball Annotator (Stemming) and a list- 
>>> based
>>> StopwordFilter. Before processing this aggregate in a map-only job  
>>> in
>>> Hadoop, we want to filter all HTML tags and forward only this  
>>> preprocessed
>>> data to the aggregate. The reason for this is that it is difficult  
>>> to change
>>> the document during processing in UIMA and it is impractical to  
>>> work all the
>>> time on documents containing HTML tags.
>>>
>>> Furthermore we are planning to add the Tagger Annotator, which  
>>> implements a
>>> Hidden Markov Model tagger. Here we aren't sure, which tokens with  
>>> their
>>> corresponding part of speech tags to delete or not and so using  
>>> them for the
>>> feature extraction. One purpose could be to use at the very  
>>> beginning only
>>> substantives and verbs.
>>>
>>> We are very interested in your comments and remarks and it would  
>>> be nice to
>>> hear from you.
>>>
>>> Cheers,
>>> Marc
>>>
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing

Posted by Marc Hofer <ma...@marc-hofer.de>.

Hi Drew,

currently we are using a HTML Filter module of the Univeristy 
Duisburg-Essen, that can be found here: 
http://www.is.informatik.uni-duisburg.de/projects/java-unidu/filter.html

Another idea was to try Jericho or NekoHTML.
http://www.java2s.com/Product/Java/Development/HTML-Parser.htm

Thanks for your advice, we will test it and let you know, whether it 
works well.

Marc

Drew Farris schrieb:
> Hi Marc,
> 
> How are you planning on cleaning up the HTML documents?
> 
> Perhaps something like this would be useful: I came across an
> interesting approach a few days ago, it would be interesting to hear
> more from someone who has tried something like this:
> http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/
> 
> Described further, with java implementations here:
> http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.html
> 
> Drew
> 
> On Sat, Nov 28, 2009 at 2:57 PM, Marc Hofer <ma...@marc-hofer.de> wrote:
>> Hello everybody,
>>
>> having already presented the draft of our architecture, I would like now to
>> discuss the second layer more in detail. As mentioned before we have chosen
>> UIMA for this layer. The main aggregate currently consists of the Whitespace
>> Tokenizer Annotator, the Snowball Annotator (Stemming) and a list-based
>> StopwordFilter. Before processing this aggregate in a map-only job in
>> Hadoop, we want to filter all HTML tags and forward only this preprocessed
>> data to the aggregate. The reason for this is that it is difficult to change
>> the document during processing in UIMA and it is impractical to work all the
>> time on documents containing HTML tags.
>>
>> Furthermore we are planning to add the Tagger Annotator, which implements a
>> Hidden Markov Model tagger. Here we aren't sure, which tokens with their
>> corresponding part of speech tags to delete or not and so using them for the
>> feature extraction. One purpose could be to use at the very beginning only
>> substantives and verbs.
>>
>> We are very interested in your comments and remarks and it would be nice to
>> hear from you.
>>
>> Cheers,
>> Marc
>>
> 
>

Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing

Posted by Drew Farris <dr...@gmail.com>.

Hi Marc,

How are you planning on cleaning up the HTML documents?

Perhaps something like this would be useful: I came across an
interesting approach a few days ago, it would be interesting to hear
more from someone who has tried something like this:
http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/

Described further, with java implementations here:
http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.html

Drew

On Sat, Nov 28, 2009 at 2:57 PM, Marc Hofer <ma...@marc-hofer.de> wrote:
> Hello everybody,
>
> having already presented the draft of our architecture, I would like now to
> discuss the second layer more in detail. As mentioned before we have chosen
> UIMA for this layer. The main aggregate currently consists of the Whitespace
> Tokenizer Annotator, the Snowball Annotator (Stemming) and a list-based
> StopwordFilter. Before processing this aggregate in a map-only job in
> Hadoop, we want to filter all HTML tags and forward only this preprocessed
> data to the aggregate. The reason for this is that it is difficult to change
> the document during processing in UIMA and it is impractical to work all the
> time on documents containing HTML tags.
>
> Furthermore we are planning to add the Tagger Annotator, which implements a
> Hidden Markov Model tagger. Here we aren't sure, which tokens with their
> corresponding part of speech tags to delete or not and so using them for the
> feature extraction. One purpose could be to use at the very beginning only
> substantives and verbs.
>
> We are very interested in your comments and remarks and it would be nice to
> hear from you.
>
> Cheers,
> Marc
>

Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing

Posted by Marc Hofer <ma...@marc-hofer.de>.

> Hi guys,
>   
Hi Julien,
> Why not using Behemoth to deploy your UIMA application on Hadoop? (
> http://code.google.com/p/behemoth-pebble/)
>   
Behemoth uses for input & output the HDFS. We integrated so far Heritrix 
in combination with the HBase writer ( 
http://code.google.com/p/hbase-writer/ ) and focus on our whole 
architecture on HBase. It will be nice, if Behemoth supports HBase in 
the future.

> Behemoth is meant to do exactly what you described and has already an
> adapter for Nutch & WARC archives. It can take a UIMA pear deploy it on a
> Hadoop cluster and extract some of the UIMA-generated annotations + store
> them at a neutral format which could then be used to generate vectors for
> Mahout. The purpose of Behemoth is to facilitate the deployment of NLP
> components for large scale processing and act as a bridge between common
> inputs (e.g. Nutch, WARC) and other projects (Mahout, Tika) etc...
>   
In the course of facilitating the deployment of NLP components, you are 
perfectly right.
> If we had a mechanism for generating Mahout vectors from Behemoth
> annotations we would be able to use other NLP frameworks such as GATE as
> well. Doing something like this is on the roadmap for Behemoth anyway but it
> sounds like what you are planning to do would be a perfect match.
>
> Any thoughts on this?
>
> Julien
>
>   

Marc

Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing

Posted by Julien Nioche <li...@gmail.com>.

Hi Ted,

Behemoth is at a very early stage so there is plenty of space for
improvement. That sounds like something I'd like to investigate a bit closer
and it could be a minor thing to fix (i.e. quicker than implement something
from scratch). Would you mind sending that guy's email so that I could get
in touch with him?

J.

2009/11/30 Ted Dunning <te...@gmail.com>

> A friend of mine just evaluated Behemoth and found that it imposed very
> large memory overheads relative to just using Gate directly.  He was unable
> to find a way to make use of Behemoth and has opted to implement his own
> distributed Gate architecture.
>
> On Mon, Nov 30, 2009 at 3:23 AM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
> > Why not using Behemoth to deploy your UIMA application on Hadoop? (
> > http://code.google.com/p/behemoth-pebble/)
> >
> > ...
> >
> > Any thoughts on this?
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing

Posted by Julien Nioche <li...@gmail.com>.

Just to clarify : I had an exchange with Ted's friend Mark Davis who has not
used Behemoth yet but is planning to do so. There seems to have been a bit
of misunderstanding between Ted and Mark. Anyway : there is no known memory
issue with Behemoth (yet) and no reason not to give it a try ;-)
On the plus side I've been thinking about what could have caused such a
memory issue and have improved the design of the serialization in Behemoth
today.

Julien

2009/11/30 Ted Dunning <te...@gmail.com>

> A friend of mine just evaluated Behemoth and found that it imposed very
> large memory overheads relative to just using Gate directly.  He was unable
> to find a way to make use of Behemoth and has opted to implement his own
> distributed Gate architecture.
>
> On Mon, Nov 30, 2009 at 3:23 AM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
> > Why not using Behemoth to deploy your UIMA application on Hadoop? (
> > http://code.google.com/p/behemoth-pebble/)
> >
> > ...
> >
> > Any thoughts on this?
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing

Posted by Ted Dunning <te...@gmail.com>.

A friend of mine just evaluated Behemoth and found that it imposed very
large memory overheads relative to just using Gate directly.  He was unable
to find a way to make use of Behemoth and has opted to implement his own
distributed Gate architecture.

On Mon, Nov 30, 2009 at 3:23 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Why not using Behemoth to deploy your UIMA application on Hadoop? (
> http://code.google.com/p/behemoth-pebble/)
>
> ...
>
> Any thoughts on this?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing

Posted by Julien Nioche <li...@gmail.com>.

Hi guys,

Why not using Behemoth to deploy your UIMA application on Hadoop? (
http://code.google.com/p/behemoth-pebble/)

Behemoth is meant to do exactly what you described and has already an
adapter for Nutch & WARC archives. It can take a UIMA pear deploy it on a
Hadoop cluster and extract some of the UIMA-generated annotations + store
them at a neutral format which could then be used to generate vectors for
Mahout. The purpose of Behemoth is to facilitate the deployment of NLP
components for large scale processing and act as a bridge between common
inputs (e.g. Nutch, WARC) and other projects (Mahout, Tika) etc...

If we had a mechanism for generating Mahout vectors from Behemoth
annotations we would be able to use other NLP frameworks such as GATE as
well. Doing something like this is on the roadmap for Behemoth anyway but it
sounds like what you are planning to do would be a perfect match.

Any thoughts on this?

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/11/28 Marc Hofer <ma...@marc-hofer.de>

> Hello everybody,
>
> having already presented the draft of our architecture, I would like now to
> discuss the second layer more in detail. As mentioned before we have chosen
> UIMA for this layer. The main aggregate currently consists of the Whitespace
> Tokenizer Annotator, the Snowball Annotator (Stemming) and a list-based
> StopwordFilter. Before processing this aggregate in a map-only job in
> Hadoop, we want to filter all HTML tags and forward only this preprocessed
> data to the aggregate. The reason for this is that it is difficult to change
> the document during processing in UIMA and it is impractical to work all the
> time on documents containing HTML tags.
>
> Furthermore we are planning to add the Tagger Annotator, which implements a
> Hidden Markov Model tagger. Here we aren't sure, which tokens with their
> corresponding part of speech tags to delete or not and so using them for the
> feature extraction. One purpose could be to use at the very beginning only
> substantives and verbs.
>
> We are very interested in your comments and remarks and it would be nice to
> hear from you.
>
> Cheers,
> Marc
>