You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Benson Margulies <bi...@gmail.com> on 2010/11/04 14:02:10 UTC

Boilerpipe is nice, but what about readability?

I just coded a Java port of the arclabs 'readability' javascript code,
which has a very strong reputation as a device for grabbing the useful
content from newsy web pages.

I could contribute it to Tika, if (a) you wanted it, and (b) there was
some reasonable way to decide or configure which one to use.

Re: Boilerpipe is nice, but what about readability?

Posted by Ken Krugler <kk...@transpac.com>.
Hi Otis - thanks for the nudge.

Hi Benson - yes, something like this would be useful.

My personal preference for how to integrate things like this into Tika  
is to create a ContentHandler. Then it's trivial to use for extracting  
body content, and you can use the TeeContentHandler to add it in  
parallel

See BoilerpipeContentHandler in Tika for one example of this approach.  
Though that code got a bit messy when I changed it to support  
including markup.

-- Ken

On Jan 2, 2011, at 10:55am, Otis Gospodnetic wrote:

> Somehow this nice offer didn't seem to attract any responses -
> http://search-lucene.com/m/ZTMKyJXNR92
>
> +1 for this patch.
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message ----
>> From: Benson Margulies <bi...@gmail.com>
>> To: dev@tika.apache.org
>> Sent: Thu, November 4, 2010 9:02:10 AM
>> Subject: Boilerpipe is nice, but what about readability?
>>
>> I just coded a Java port of the arclabs 'readability' javascript  
>> code,
>> which  has a very strong reputation as a device for grabbing the  
>> useful
>> content from  newsy web pages.
>>
>> I could contribute it to Tika, if (a) you wanted it, and  (b) there  
>> was
>> some reasonable way to decide or configure which one to  use.
>>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Boilerpipe is nice, but what about readability?

Posted by Otis Gospodnetic <og...@yahoo.com>.
Somehow this nice offer didn't seem to attract any responses - 
http://search-lucene.com/m/ZTMKyJXNR92

+1 for this patch.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Benson Margulies <bi...@gmail.com>
> To: dev@tika.apache.org
> Sent: Thu, November 4, 2010 9:02:10 AM
> Subject: Boilerpipe is nice, but what about readability?
> 
> I just coded a Java port of the arclabs 'readability' javascript code,
> which  has a very strong reputation as a device for grabbing the useful
> content from  newsy web pages.
> 
> I could contribute it to Tika, if (a) you wanted it, and  (b) there was
> some reasonable way to decide or configure which one to  use.
>