You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Benson Margulies <bi...@gmail.com> on 2010/11/03 22:51:57 UTC

Lib to pull text from web pages?

I've just spend a few days turning
http://lab.arc90.com/experiments/readability/ into Java. I'll push it
to github at some point unless Mahout is ready to branch out in this
direction.

Re: Lib to pull text from web pages?

Posted by Grant Ingersoll <gs...@apache.org>.

Sounds useful to me.

On Nov 11, 2010, at 3:37 PM, Isabel Drost wrote:

> On 09.11.2010 Grant Ingersoll wrote:
>> Hi Benson, I looked at the site and it seemed interesting, but didn't dig
>> any deeper.  Can you give a paragraph on what it does?
> 
> If I understood Benson correctly when we were talking at Apache Con, the library 
> is meant to provide a means to remove clutter (including navigational content 
> etc.) from web pages. The original intention was to run it as a clean-up step 
> before displaying the web page in browsers. However given that any text 
> processing usually needs cleaning up web pages and extracting the relevant 
> content as a very first step, the idea was to abuse the methods for automated 
> text extraction as well.
> 
> As a side note - a project with similar goals was mentioned on the Lucene 
> mailing lists a while ago: http://code.google.com/p/boilerpipe/
> 
> Cheers,
> Isabel

Re: Lib to pull text from web pages?

Posted by Benson Margulies <bi...@gmail.com>.

TIka has boilerpipe, which is not bad for web pages in general. I have
a port of readability, which is better than boilerpipe for news
articles in particular. It seems to me that I should investigate if
Tika has room for both.

On Thu, Nov 11, 2010 at 4:04 PM, Ted Dunning <te...@gmail.com> wrote:
> I believe that this is included in Tika now (according to Ken Krugler)
>
> On Thu, Nov 11, 2010 at 12:37 PM, Isabel Drost <is...@apache.org> wrote:
>
>> ...
>>
>> As a side note - a project with similar goals was mentioned on the Lucene
>> mailing lists a while ago: http://code.google.com/p/boilerpipe/
>>
>> Cheers,
>> Isabel
>>
>

Re: Lib to pull text from web pages?

Posted by Ted Dunning <te...@gmail.com>.

I believe that this is included in Tika now (according to Ken Krugler)

On Thu, Nov 11, 2010 at 12:37 PM, Isabel Drost <is...@apache.org> wrote:

> ...
>
> As a side note - a project with similar goals was mentioned on the Lucene
> mailing lists a while ago: http://code.google.com/p/boilerpipe/
>
> Cheers,
> Isabel
>

Re: Lib to pull text from web pages?

Posted by Isabel Drost <is...@apache.org>.

On 09.11.2010 Grant Ingersoll wrote:
> Hi Benson, I looked at the site and it seemed interesting, but didn't dig
> any deeper.  Can you give a paragraph on what it does?

If I understood Benson correctly when we were talking at Apache Con, the library 
is meant to provide a means to remove clutter (including navigational content 
etc.) from web pages. The original intention was to run it as a clean-up step 
before displaying the web page in browsers. However given that any text 
processing usually needs cleaning up web pages and extracting the relevant 
content as a very first step, the idea was to abuse the methods for automated 
text extraction as well.

As a side note - a project with similar goals was mentioned on the Lucene 
mailing lists a while ago: http://code.google.com/p/boilerpipe/

Cheers,
Isabel

Re: Lib to pull text from web pages?

Posted by Grant Ingersoll <gs...@apache.org>.

Hi Benson, I looked at the site and it seemed interesting, but didn't dig any deeper.  Can you give a paragraph on what it does?

On Nov 3, 2010, at 5:51 PM, Benson Margulies wrote:

> I've just spend a few days turning
> http://lab.arc90.com/experiments/readability/ into Java. I'll push it
> to github at some point unless Mahout is ready to branch out in this
> direction.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com