You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/07/04 20:55:40 UTC

Stats for link pages

Hi,

Because most of the internet is garbage, i'd like not to index garbage. There 
is a huge number of pages that consist just links and almost no text. 

To filter these pages out i intend to build an indexing filter. The problem is 
how to detect whether a page is considered a link page. From what i've seen 
there should be a distinct ratio between amount of text and number of outlinks 
to the same and other domains.

My question, has anyone come across literature on this topic? Or does someone 
already has such an ratio defined?

Thanks!

Re: Stats for link pages

Posted by Markus Jelsma <ma...@openindex.io>.

Thanks again. If there are more pages to look at, please post them if you know 
any.

On Tuesday 05 July 2011 12:05:33 Alexander Aristov wrote:
> In that article author uses approach when he extracts text (with links),
> splits whole text into chunks (by strings in the simpliest case or by
> paragraph) and then compares chunks with a number of links or grabge text.
> 
> You can take these figures as input and discard a page if the ratio is not
> good.
> 
> Best Regards
> Alexander Aristov
> 
> On 5 July 2011 12:41, Markus Jelsma <ma...@openindex.io> wrote:
> > Thanks, both of you.
> > I'll do some research on the corpus i have. And Sujit's page is always a
> > nice
> > read!
> > 
> > > Alexander,
> > > 
> > > We can already remove boilerplate from HTML pages thanks to Boilerpipe
> > > in Tika (there is an open issue on JIRA for this). Markus is looking
> > > for a
> > 
> > way
> > 
> > > to classify an entire page as content-rich vs mostly links.
> > > Markus : don't know any specific litterature on the subject but
> > 
> > determining
> > 
> > > a ratio of tool words (determiners, conjunctions, etc...) vs size of
> > > the text or number of links sounds like a good approach. I think that
> > > the new scoring API (see wiki) could  also be used / extended to do
> > > this kind of task
> > > 
> > > Jul
> > > 
> > > On 5 July 2011 06:52, Alexander Aristov <al...@gmail.com>
> > 
> > wrote:
> > > > I have successfully used some of algorithms which sort out useful
> > > > text from html pages.
> > > > 
> > > > this page gave me ideas.
> > 
> > http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.htm
> > 
> > > > l
> > > > 
> > > > Best Regards
> > > > Alexander Aristov
> > > > 
> > > > On 4 July 2011 22:55, Markus Jelsma <ma...@openindex.io>
> > 
> > wrote:
> > > > > Hi,
> > > > > 
> > > > > Because most of the internet is garbage, i'd like not to index
> > 
> > garbage.
> > 
> > > > > There
> > > > > is a huge number of pages that consist just links and almost no
> > > > > text.
> > > > > 
> > > > > To filter these pages out i intend to build an indexing filter. The
> > > > 
> > > > problem
> > > > 
> > > > > is
> > > > > how to detect whether a page is considered a link page. From what
> > 
> > i've
> > 
> > > > seen
> > > > 
> > > > > there should be a distinct ratio between amount of text and number
> > > > > of outlinks
> > > > > to the same and other domains.
> > > > > 
> > > > > My question, has anyone come across literature on this topic? Or
> > > > > does someone
> > > > > already has such an ratio defined?
> > > > > 
> > > > > Thanks!

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Stats for link pages

Posted by Ken Krugler <kk...@transpac.com>.

On Jul 5, 2011, at 3:05am, Alexander Aristov wrote:

> In that article author uses approach when he extracts text (with links),
> splits whole text into chunks (by strings in the simpliest case or by
> paragraph) and then compares chunks with a number of links or grabge text.
> 
> You can take these figures as input and discard a page if the ratio is not
> good.

Boilerpipe does something similar, in deciding when to include or discard a block of text.

See the DensityRulesClassifier class. Descriptive text is:

>  * Classifies {@link TextBlock}s as content/not-content through rules that have
>  * been determined using the C4.8 machine learning algorithm, as described in the
>  * paper "Boilerplate Detection using Shallow Text Features", particularly using
>  * text densities and link densities.

So one approach would be to (a) add a new class that implements BoilerpipeFilter, with the logic you need, and then (b) expose the ability in Nutch to configure Boilerpipe to use this new class.

-- Ken

> On 5 July 2011 12:41, Markus Jelsma <ma...@openindex.io> wrote:
> 
>> Thanks, both of you.
>> I'll do some research on the corpus i have. And Sujit's page is always a
>> nice
>> read!
>> 
>>> Alexander,
>>> 
>>> We can already remove boilerplate from HTML pages thanks to Boilerpipe in
>>> Tika (there is an open issue on JIRA for this). Markus is looking for a
>> way
>>> to classify an entire page as content-rich vs mostly links.
>>> Markus : don't know any specific litterature on the subject but
>> determining
>>> a ratio of tool words (determiners, conjunctions, etc...) vs size of the
>>> text or number of links sounds like a good approach. I think that the new
>>> scoring API (see wiki) could  also be used / extended to do this kind of
>>> task
>>> 
>>> Jul
>>> 
>>> On 5 July 2011 06:52, Alexander Aristov <al...@gmail.com>
>> wrote:
>>>> I have successfully used some of algorithms which sort out useful text
>>>> from html pages.
>>>> 
>>>> this page gave me ideas.
>>>> 
>> http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.htm
>>>> l
>>>> 
>>>> Best Regards
>>>> Alexander Aristov
>>>> 
>>>> On 4 July 2011 22:55, Markus Jelsma <ma...@openindex.io>
>> wrote:
>>>>> Hi,
>>>>> 
>>>>> Because most of the internet is garbage, i'd like not to index
>> garbage.
>>>>> There
>>>>> is a huge number of pages that consist just links and almost no text.
>>>>> 
>>>>> To filter these pages out i intend to build an indexing filter. The
>>>> 
>>>> problem
>>>> 
>>>>> is
>>>>> how to detect whether a page is considered a link page. From what
>> i've
>>>> 
>>>> seen
>>>> 
>>>>> there should be a distinct ratio between amount of text and number of
>>>>> outlinks
>>>>> to the same and other domains.
>>>>> 
>>>>> My question, has anyone come across literature on this topic? Or does
>>>>> someone
>>>>> already has such an ratio defined?
>>>>> 
>>>>> Thanks!
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions

Re: Stats for link pages

Posted by Alexander Aristov <al...@gmail.com>.

In that article author uses approach when he extracts text (with links),
splits whole text into chunks (by strings in the simpliest case or by
paragraph) and then compares chunks with a number of links or grabge text.

You can take these figures as input and discard a page if the ratio is not
good.

Best Regards
Alexander Aristov


On 5 July 2011 12:41, Markus Jelsma <ma...@openindex.io> wrote:

> Thanks, both of you.
> I'll do some research on the corpus i have. And Sujit's page is always a
> nice
> read!
>
> > Alexander,
> >
> > We can already remove boilerplate from HTML pages thanks to Boilerpipe in
> > Tika (there is an open issue on JIRA for this). Markus is looking for a
> way
> > to classify an entire page as content-rich vs mostly links.
> > Markus : don't know any specific litterature on the subject but
> determining
> > a ratio of tool words (determiners, conjunctions, etc...) vs size of the
> > text or number of links sounds like a good approach. I think that the new
> > scoring API (see wiki) could  also be used / extended to do this kind of
> > task
> >
> > Jul
> >
> > On 5 July 2011 06:52, Alexander Aristov <al...@gmail.com>
> wrote:
> > > I have successfully used some of algorithms which sort out useful text
> > > from html pages.
> > >
> > > this page gave me ideas.
> > >
> http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.htm
> > > l
> > >
> > > Best Regards
> > > Alexander Aristov
> > >
> > > On 4 July 2011 22:55, Markus Jelsma <ma...@openindex.io>
> wrote:
> > > > Hi,
> > > >
> > > > Because most of the internet is garbage, i'd like not to index
> garbage.
> > > > There
> > > > is a huge number of pages that consist just links and almost no text.
> > > >
> > > > To filter these pages out i intend to build an indexing filter. The
> > >
> > > problem
> > >
> > > > is
> > > > how to detect whether a page is considered a link page. From what
> i've
> > >
> > > seen
> > >
> > > > there should be a distinct ratio between amount of text and number of
> > > > outlinks
> > > > to the same and other domains.
> > > >
> > > > My question, has anyone come across literature on this topic? Or does
> > > > someone
> > > > already has such an ratio defined?
> > > >
> > > > Thanks!
>

Re: Stats for link pages

Posted by Markus Jelsma <ma...@openindex.io>.

Thanks, both of you. 
I'll do some research on the corpus i have. And Sujit's page is always a nice 
read!

> Alexander,
> 
> We can already remove boilerplate from HTML pages thanks to Boilerpipe in
> Tika (there is an open issue on JIRA for this). Markus is looking for a way
> to classify an entire page as content-rich vs mostly links.
> Markus : don't know any specific litterature on the subject but determining
> a ratio of tool words (determiners, conjunctions, etc...) vs size of the
> text or number of links sounds like a good approach. I think that the new
> scoring API (see wiki) could  also be used / extended to do this kind of
> task
> 
> Jul
> 
> On 5 July 2011 06:52, Alexander Aristov <al...@gmail.com> wrote:
> > I have successfully used some of algorithms which sort out useful text
> > from html pages.
> > 
> > this page gave me ideas.
> > http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.htm
> > l
> > 
> > Best Regards
> > Alexander Aristov
> > 
> > On 4 July 2011 22:55, Markus Jelsma <ma...@openindex.io> wrote:
> > > Hi,
> > > 
> > > Because most of the internet is garbage, i'd like not to index garbage.
> > > There
> > > is a huge number of pages that consist just links and almost no text.
> > > 
> > > To filter these pages out i intend to build an indexing filter. The
> > 
> > problem
> > 
> > > is
> > > how to detect whether a page is considered a link page. From what i've
> > 
> > seen
> > 
> > > there should be a distinct ratio between amount of text and number of
> > > outlinks
> > > to the same and other domains.
> > > 
> > > My question, has anyone come across literature on this topic? Or does
> > > someone
> > > already has such an ratio defined?
> > > 
> > > Thanks!

Re: Stats for link pages

Posted by Julien Nioche <li...@gmail.com>.

Alexander,

We can already remove boilerplate from HTML pages thanks to Boilerpipe in
Tika (there is an open issue on JIRA for this). Markus is looking for a way
to classify an entire page as content-rich vs mostly links.
Markus : don't know any specific litterature on the subject but determining
a ratio of tool words (determiners, conjunctions, etc...) vs size of the
text or number of links sounds like a good approach. I think that the new
scoring API (see wiki) could  also be used / extended to do this kind of
task

Jul

On 5 July 2011 06:52, Alexander Aristov <al...@gmail.com> wrote:

> I have successfully used some of algorithms which sort out useful text from
> html pages.
>
> this page gave me ideas.
> http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.html
>
> Best Regards
> Alexander Aristov
>
>
> On 4 July 2011 22:55, Markus Jelsma <ma...@openindex.io> wrote:
>
> > Hi,
> >
> > Because most of the internet is garbage, i'd like not to index garbage.
> > There
> > is a huge number of pages that consist just links and almost no text.
> >
> > To filter these pages out i intend to build an indexing filter. The
> problem
> > is
> > how to detect whether a page is considered a link page. From what i've
> seen
> > there should be a distinct ratio between amount of text and number of
> > outlinks
> > to the same and other domains.
> >
> > My question, has anyone come across literature on this topic? Or does
> > someone
> > already has such an ratio defined?
> >
> > Thanks!
> >
>

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Stats for link pages

Posted by Alexander Aristov <al...@gmail.com>.

I have successfully used some of algorithms which sort out useful text from
html pages.

this page gave me ideas.
http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.html

Best Regards
Alexander Aristov


On 4 July 2011 22:55, Markus Jelsma <ma...@openindex.io> wrote:

> Hi,
>
> Because most of the internet is garbage, i'd like not to index garbage.
> There
> is a huge number of pages that consist just links and almost no text.
>
> To filter these pages out i intend to build an indexing filter. The problem
> is
> how to detect whether a page is considered a link page. From what i've seen
> there should be a distinct ratio between amount of text and number of
> outlinks
> to the same and other domains.
>
> My question, has anyone come across literature on this topic? Or does
> someone
> already has such an ratio defined?
>
> Thanks!
>