You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Boris Aleksandrovsky <ba...@gmail.com> on 2010/06/28 22:06:24 UTC

header/footer identification and general scaping tools

I was wondering if any of you know of any open-source solutions for general
issues which arise in web crawling - how do you remove
headers/footers/javascript and generally cleanup html of a web-page before
indexing? We have a first-pass solution implemented using custom code, but
this must be a problem which a lot of people face, so I am asking here.

Thanks,
Boris

Re: header/footer identification and general scaping tools

Posted by Simon Willnauer <si...@googlemail.com>.

Boris, you might wanna look at http://code.google.com/p/boilerpipe/

simon

On Mon, Jun 28, 2010 at 10:48 PM, Boris Aleksandrovsky
<ba...@gmail.com> wrote:
> Thanks, Sashi, I am asking more about a general library which will remove
> those HTML element which are unwanted/useless for indexing. For instance, we
> are using a general method to remove headers by comparing the structure of
> HTML on the top-level document from the site (e.g. www.nytimes.com) and the
> page being crawled (which happens to be further down in the hierarchy).
> Generally the difference will be the header or the footer. Is there a
> library out there which contains a collection of hacks like that?
>
> On Mon, Jun 28, 2010 at 1:31 PM, Shashi Kant <sk...@sloan.mit.edu> wrote:
>
>> I have used TagSoup to parse the HTML and get the elements of interest.
>> http://ccil.org/~cowan/XML/tagsoup/
>>
>>
>>
>> On Mon, Jun 28, 2010 at 4:06 PM, Boris Aleksandrovsky
>> <ba...@gmail.com> wrote:
>> > I was wondering if any of you know of any open-source solutions for
>> general
>> > issues which arise in web crawling - how do you remove
>> > headers/footers/javascript and generally cleanup html of a web-page
>> before
>> > indexing? We have a first-pass solution implemented using custom code,
>> but
>> > this must be a problem which a lot of people face, so I am asking here.
>> >
>> > Thanks,
>> > Boris
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: header/footer identification and general scaping tools

Posted by Boris Aleksandrovsky <ba...@gmail.com>.

Thanks, Sashi, I am asking more about a general library which will remove
those HTML element which are unwanted/useless for indexing. For instance, we
are using a general method to remove headers by comparing the structure of
HTML on the top-level document from the site (e.g. www.nytimes.com) and the
page being crawled (which happens to be further down in the hierarchy).
Generally the difference will be the header or the footer. Is there a
library out there which contains a collection of hacks like that?

On Mon, Jun 28, 2010 at 1:31 PM, Shashi Kant <sk...@sloan.mit.edu> wrote:

> I have used TagSoup to parse the HTML and get the elements of interest.
> http://ccil.org/~cowan/XML/tagsoup/
>
>
>
> On Mon, Jun 28, 2010 at 4:06 PM, Boris Aleksandrovsky
> <ba...@gmail.com> wrote:
> > I was wondering if any of you know of any open-source solutions for
> general
> > issues which arise in web crawling - how do you remove
> > headers/footers/javascript and generally cleanup html of a web-page
> before
> > indexing? We have a first-pass solution implemented using custom code,
> but
> > this must be a problem which a lot of people face, so I am asking here.
> >
> > Thanks,
> > Boris
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: header/footer identification and general scaping tools

Posted by Shashi Kant <sk...@sloan.mit.edu>.

I have used TagSoup to parse the HTML and get the elements of interest.
http://ccil.org/~cowan/XML/tagsoup/



On Mon, Jun 28, 2010 at 4:06 PM, Boris Aleksandrovsky
<ba...@gmail.com> wrote:
> I was wondering if any of you know of any open-source solutions for general
> issues which arise in web crawling - how do you remove
> headers/footers/javascript and generally cleanup html of a web-page before
> indexing? We have a first-pass solution implemented using custom code, but
> this must be a problem which a lot of people face, so I am asking here.
>
> Thanks,
> Boris
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org