You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by Ted Dunning <te...@gmail.com> on 2009/12/04 22:59:46 UTC

Re: Announcement: Boilerplate removal library

Nice paper.  I haven't read the software yet, but I would expect it to have
similar qualities.

Have you considered how boilerpipe might be integrated into a Lucene
analyzer?

2009/12/4 Christian Kohlschütter <ko...@l3s.de>

> Dear all,
>
> I am happy to announce the release of Boilerpipe 1.0.
>
> Boilerpipe is a Java library for boilerplate removal and fulltext
> extraction from HTML pages.
> It is based on my paper "Boilerplate Detection using Shallow Text Features"
>  to be presented at WSDM 2010 -- The Third ACM International Conference on
> Web Search and Data Mining, 3-6 February 2010, New York City, NY USA.
>
> The boilerpipe library provides algorithms to detect and remove the surplus
> "clutter" (boilerplate, templates) around the main textual content of a
> website. It already provides specific strategies for common tasks (for
> example: news article extraction) and may also be easily extended for
> individual problem settings. Extracting content is very fast (milliseconds),
> just needs the input document (no global or site-level information required)
> and is usually quite accurate.
>
> You can find Boilerpipe at http://code.google.com/p/boilerpipe/
>
> The code is released under the Apache 2.0 license and you are very welcomed
> to use Boilerpipe for whatever you like to. Please let me know if it helps
> you, if you have questions about it, difficulties with it or ideas how to
> improve it.
>
> Cheers,
> Christian
>
> PS: The website already provides version 1.0.1 (now includes the dependency
> jars in the binary tarball)
> --
> Christian Kohlschütter
> kohlschuetter@L3S.de
>
> Forschungszentrum L3S
> Leibniz Universität Hannover
>
> http://www.L3S.de/~kohlschuetter/ <http://www.L3S.de/%7Ekohlschuetter/>
>
>


-- 
Ted Dunning, CTO
DeepDyve

Re: Announcement: Boilerplate removal library

Posted by Christian Kohlschütter <ko...@L3S.de>.

Hello Hoss,

> 
> Does the code as currently implemented maintain position 
> mapping information?

yes, to some extent. Boilerpipe internally arranges the text as blocks (portions of text), whereas each block may be marked as content or boilerplate. Additionally, the number of tokens in a block is counted. It is therefore relatively easy to keep track on position at document level.

Christian


Am 14.12.2009 um 23:52 schrieb Chris Hostetter:

> 
> : working with such a setup for a long time now). Integrating it into an 
> : Analyzer should be fairly simple as Boilerpipe can return a string which 
> : in turn can be parsed just any other text.
> 
> treating the boilerplate removal library as a black box String->String 
> transformation seems fairly trivial and could easily be done by 
> java applications prior to constructing an Analyzer (ie: 
> String->[boilerblackbox]->String->[Analyzer]->TokenStream)
> 
> Where things wold probably get more complicated is trying to maintaing 
> term position information from the orriginal source text source text (for 
> things like search result highlighting and whatnot) which would probably 
> require doing the boilerplate removal via something like the CharFilter 
> abstraction (or directly in a tokenizer).
> 
> Does the code as currently implemented maintain position 
> mapping information?
> 
> 
> -Hoss
> 

-- 
Christian Kohlschütter
kohlschuetter@L3S.de

L3S Research Center
Forschungszentrum L3S / Leibniz Universität Hannover

http://www.L3S.de/~kohlschuetter

Re: Announcement: Boilerplate removal library

Posted by Chris Hostetter <ho...@fucit.org>.

: working with such a setup for a long time now). Integrating it into an 
: Analyzer should be fairly simple as Boilerpipe can return a string which 
: in turn can be parsed just any other text.

treating the boilerplate removal library as a black box String->String 
transformation seems fairly trivial and could easily be done by 
java applications prior to constructing an Analyzer (ie: 
String->[boilerblackbox]->String->[Analyzer]->TokenStream)

Where things wold probably get more complicated is trying to maintaing 
term position information from the orriginal source text source text (for 
things like search result highlighting and whatnot) which would probably 
require doing the boilerplate removal via something like the CharFilter 
abstraction (or directly in a tokenizer).

Does the code as currently implemented maintain position 
mapping information?


-Hoss

Re: Announcement: Boilerplate removal library

Posted by Christian Kohlschütter <ko...@L3S.de>.

Yes, indeed.
Maybe I should come up with such an Analyzer in a boilerpipe-lucene package...

Christian

Am 14.12.2009 um 16:15 schrieb Ted Dunning:

> Storing the original would be an excellent idea and would be quite doable.
> 
> 2009/12/14 Christian Kohlschütter <ko...@l3s.de>
> 
>> However it would also be great (in order to increase recall) to also store
>> non-content and just add some kind of static boosting for content blocks
>> over non-content blocks. I am not sure whether this will work right now
>> using an Analyzer. What you could do though, is to store the text into
>> separate fields ("content"/"boilerplate") and add field-specific boosts at
>> query time.
>> 
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve

-- 
Christian Kohlschütter
kohlschuetter@L3S.de

L3S Research Center
Forschungszentrum L3S / Leibniz Universität Hannover

http://www.L3S.de/~kohlschuetter

Re: Announcement: Boilerplate removal library

Posted by Ted Dunning <te...@gmail.com>.

Storing the original would be an excellent idea and would be quite doable.

2009/12/14 Christian Kohlschütter <ko...@l3s.de>

> However it would also be great (in order to increase recall) to also store
> non-content and just add some kind of static boosting for content blocks
> over non-content blocks. I am not sure whether this will work right now
> using an Analyzer. What you could do though, is to store the text into
> separate fields ("content"/"boilerplate") and add field-specific boosts at
> query time.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Announcement: Boilerplate removal library

Posted by Christian Kohlschütter <ko...@L3S.de>.

Hi Ted,

thanks for your email, and sorry for replying so late, I have overlooked your posting.

Adding boilerpipe to Lucene is definitely a good idea (I have been working with such a setup for a long time now).
Integrating it into an Analyzer should be fairly simple as Boilerpipe can return a string which in turn can be parsed just any other text.

However it would also be great (in order to increase recall) to also store non-content and just add some kind of static boosting for content blocks over non-content blocks. I am not sure whether this will work right now using an Analyzer. What you could do though, is to store the text into separate fields ("content"/"boilerplate") and add field-specific boosts at query time.

Cheers,
Christian

Am 04.12.2009 um 22:59 schrieb Ted Dunning:

> Nice paper.  I haven't read the software yet, but I would expect it to have
> similar qualities.
> 
> Have you considered how boilerpipe might be integrated into a Lucene
> analyzer?
> 
> 2009/12/4 Christian Kohlschütter <ko...@l3s.de>
> 
>> Dear all,
>> 
>> I am happy to announce the release of Boilerpipe 1.0.
>> 
>> Boilerpipe is a Java library for boilerplate removal and fulltext
>> extraction from HTML pages.
>> It is based on my paper "Boilerplate Detection using Shallow Text Features"
>> to be presented at WSDM 2010 -- The Third ACM International Conference on
>> Web Search and Data Mining, 3-6 February 2010, New York City, NY USA.
>> 
>> The boilerpipe library provides algorithms to detect and remove the surplus
>> "clutter" (boilerplate, templates) around the main textual content of a
>> website. It already provides specific strategies for common tasks (for
>> example: news article extraction) and may also be easily extended for
>> individual problem settings. Extracting content is very fast (milliseconds),
>> just needs the input document (no global or site-level information required)
>> and is usually quite accurate.
>> 
>> You can find Boilerpipe at http://code.google.com/p/boilerpipe/
>> 
>> The code is released under the Apache 2.0 license and you are very welcomed
>> to use Boilerpipe for whatever you like to. Please let me know if it helps
>> you, if you have questions about it, difficulties with it or ideas how to
>> improve it.
>> 
>> Cheers,
>> Christian
>> 
>> PS: The website already provides version 1.0.1 (now includes the dependency
>> jars in the binary tarball)
>> --
>> Christian Kohlschütter
>> kohlschuetter@L3S.de
>> 
>> Forschungszentrum L3S
>> Leibniz Universität Hannover
>> 
>> http://www.L3S.de/~kohlschuetter/ <http://www.L3S.de/%7Ekohlschuetter/>
>> 
>> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve

-- 
Christian Kohlschütter
kohlschuetter@L3S.de

L3S Research Center
Forschungszentrum L3S / Leibniz Universität Hannover

http://www.L3S.de/~kohlschuetter