You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Benson Margulies <bi...@gmail.com> on 2010/11/03 22:51:57 UTC
Lib to pull text from web pages?
I've just spend a few days turning
http://lab.arc90.com/experiments/readability/ into Java. I'll push it
to github at some point unless Mahout is ready to branch out in this
direction.
Re: Lib to pull text from web pages?
Posted by Grant Ingersoll <gs...@apache.org>.
Sounds useful to me.
On Nov 11, 2010, at 3:37 PM, Isabel Drost wrote:
> On 09.11.2010 Grant Ingersoll wrote:
>> Hi Benson, I looked at the site and it seemed interesting, but didn't dig
>> any deeper. Can you give a paragraph on what it does?
>
> If I understood Benson correctly when we were talking at Apache Con, the library
> is meant to provide a means to remove clutter (including navigational content
> etc.) from web pages. The original intention was to run it as a clean-up step
> before displaying the web page in browsers. However given that any text
> processing usually needs cleaning up web pages and extracting the relevant
> content as a very first step, the idea was to abuse the methods for automated
> text extraction as well.
>
> As a side note - a project with similar goals was mentioned on the Lucene
> mailing lists a while ago: http://code.google.com/p/boilerpipe/
>
> Cheers,
> Isabel
Re: Lib to pull text from web pages?
Posted by Benson Margulies <bi...@gmail.com>.
TIka has boilerpipe, which is not bad for web pages in general. I have
a port of readability, which is better than boilerpipe for news
articles in particular. It seems to me that I should investigate if
Tika has room for both.
On Thu, Nov 11, 2010 at 4:04 PM, Ted Dunning <te...@gmail.com> wrote:
> I believe that this is included in Tika now (according to Ken Krugler)
>
> On Thu, Nov 11, 2010 at 12:37 PM, Isabel Drost <is...@apache.org> wrote:
>
>> ...
>>
>> As a side note - a project with similar goals was mentioned on the Lucene
>> mailing lists a while ago: http://code.google.com/p/boilerpipe/
>>
>> Cheers,
>> Isabel
>>
>
Re: Lib to pull text from web pages?
Posted by Ted Dunning <te...@gmail.com>.
I believe that this is included in Tika now (according to Ken Krugler)
On Thu, Nov 11, 2010 at 12:37 PM, Isabel Drost <is...@apache.org> wrote:
> ...
>
> As a side note - a project with similar goals was mentioned on the Lucene
> mailing lists a while ago: http://code.google.com/p/boilerpipe/
>
> Cheers,
> Isabel
>
Re: Lib to pull text from web pages?
Posted by Isabel Drost <is...@apache.org>.
On 09.11.2010 Grant Ingersoll wrote:
> Hi Benson, I looked at the site and it seemed interesting, but didn't dig
> any deeper. Can you give a paragraph on what it does?
If I understood Benson correctly when we were talking at Apache Con, the library
is meant to provide a means to remove clutter (including navigational content
etc.) from web pages. The original intention was to run it as a clean-up step
before displaying the web page in browsers. However given that any text
processing usually needs cleaning up web pages and extracting the relevant
content as a very first step, the idea was to abuse the methods for automated
text extraction as well.
As a side note - a project with similar goals was mentioned on the Lucene
mailing lists a while ago: http://code.google.com/p/boilerpipe/
Cheers,
Isabel
Re: Lib to pull text from web pages?
Posted by Grant Ingersoll <gs...@apache.org>.
Hi Benson, I looked at the site and it seemed interesting, but didn't dig any deeper. Can you give a paragraph on what it does?
On Nov 3, 2010, at 5:51 PM, Benson Margulies wrote:
> I've just spend a few days turning
> http://lab.arc90.com/experiments/readability/ into Java. I'll push it
> to github at some point unless Mahout is ready to branch out in this
> direction.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com