You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by David Wallace <da...@nzqa.govt.nz> on 2005/12/20 00:49:32 UTC

Multiple anchors on same site - what's better than making these unique?

Hi all,
I've been grubbing around with Nutch for a while now, although I'm
still working with 0.7 code.  I notice that when anchors are collected
for a document, they're made unique by domain and by anchor text.  
 
I'm using Nutch for an "intranet style" search engine, on a single
site, so I don't really care about the uniqueness by domain.  However, I
can't help thinking that the uniqueness by anchor text probably isn't
what I want.
 
Suppose my site has 3 pages with links to page X, and the same anchor
text.  I'd kind of like to score page X higher than a page where there's
only one incoming link with that anchor text.  But I don't want to have
this effect swamping the other calculations of page score.  In other
words, if my site has 1000 pages with links to page X, this page should
score a wee bit higher than a similar page with just one incoming link,
but not 1000 times higher.
 
I'm thinking of doing some maths with the number of repetitions of an
anchor, then including the result in the page score.  Something like
log(10+n), or maybe n/(n+2); where n is the number of incoming links
with the same anchor text.  Either of these formulas would make 1000
incoming links score roughly 3 times higher than a single incoming link,
which seems about right to me.
 
It looks to me like I'm going to have to make changes deep within the
Lucene page scoring stuff to do this, which I'm not really looking
forward to.  I'd really welcome hearing if anybody has a better solution
to this general problem.  The exact maths isn't too critical.  What's
important is that for small values of n, the page score must increase as
n increases, but the overall effect must diminish as n gets really
large.
 
Thanks in advance,
David.

********************************************************************************
This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or 
communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or 
information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA. 

All emails have been scanned for viruses and content by MailMarshal. 
NZQA reserves the right to monitor all email communications through its network.

********************************************************************************

Re: Multiple anchors on same site - what's better than making these unique?

Posted by Stefan Groschupf <sg...@media-style.com>.
Hi,
did you tried...
<property>
   <name>db.ignore.internal.links</name>
   <value>true</value>
   <description>If true, when adding new links to a page, links from
   the same host are ignored.  This is an effective way to limit the
   size of the link database, keeping the only the highest quality
   links.
   </description>
</property>

... setting to false?

Stefan

Am 20.12.2005 um 00:49 schrieb David Wallace:

> Hi all,
> I've been grubbing around with Nutch for a while now, although I'm
> still working with 0.7 code.  I notice that when anchors are collected
> for a document, they're made unique by domain and by anchor text.
>
> I'm using Nutch for an "intranet style" search engine, on a single
> site, so I don't really care about the uniqueness by domain.   
> However, I
> can't help thinking that the uniqueness by anchor text probably isn't
> what I want.
>
> Suppose my site has 3 pages with links to page X, and the same anchor
> text.  I'd kind of like to score page X higher than a page where  
> there's
> only one incoming link with that anchor text.  But I don't want to  
> have
> this effect swamping the other calculations of page score.  In other
> words, if my site has 1000 pages with links to page X, this page  
> should
> score a wee bit higher than a similar page with just one incoming  
> link,
> but not 1000 times higher.
>
> I'm thinking of doing some maths with the number of repetitions of an
> anchor, then including the result in the page score.  Something like
> log(10+n), or maybe n/(n+2); where n is the number of incoming links
> with the same anchor text.  Either of these formulas would make 1000
> incoming links score roughly 3 times higher than a single incoming  
> link,
> which seems about right to me.
>
> It looks to me like I'm going to have to make changes deep within the
> Lucene page scoring stuff to do this, which I'm not really looking
> forward to.  I'd really welcome hearing if anybody has a better  
> solution
> to this general problem.  The exact maths isn't too critical.  What's
> important is that for small values of n, the page score must  
> increase as
> n increases, but the overall effect must diminish as n gets really
> large.
>
> Thanks in advance,
> David.
>
> ********************************************************************** 
> **********
> This email may contain legally privileged information and is  
> intended only for the addressee. It is not necessarily the official  
> view or
> communication of the New Zealand Qualifications Authority. If you  
> are not the intended recipient you must not use, disclose, copy or  
> distribute this email or
> information in it. If you have received this email in error, please  
> contact the sender immediately. NZQA does not accept any liability  
> for changes made to this email or attachments after sending by NZQA.
>
> All emails have been scanned for viruses and content by MailMarshal.
> NZQA reserves the right to monitor all email communications through  
> its network.
>
> ********************************************************************** 
> **********

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net



Re: Multiple anchors on same site - what's better than making these unique?

Posted by Doug Cutting <cu...@nutch.org>.
David Wallace wrote:
> I've been grubbing around with Nutch for a while now, although I'm
> still working with 0.7 code.  I notice that when anchors are collected
> for a document, they're made unique by domain and by anchor text.  

Note that this is only done when collecting anchor texts, not when 
computing page scores.

> Suppose my site has 3 pages with links to page X, and the same anchor
> text.  I'd kind of like to score page X higher than a page where there's
> only one incoming link with that anchor text.  But I don't want to have
> this effect swamping the other calculations of page score.  In other
> words, if my site has 1000 pages with links to page X, this page should
> score a wee bit higher than a similar page with just one incoming link,
> but not 1000 times higher.
>  
> I'm thinking of doing some maths with the number of repetitions of an
> anchor, then including the result in the page score.  Something like
> log(10+n), or maybe n/(n+2); where n is the number of incoming links
> with the same anchor text.  Either of these formulas would make 1000
> incoming links score roughly 3 times higher than a single incoming link,
> which seems about right to me.

Page scores currently are sqrt(OPIC) in the Nutch trunk.

http://www.nabble.com/-Fwd%3A-Fetch-list-priority--t360125.html#a997304

The OPIC calculation does not consider the domain or anchor text.

Hope this helps.

Doug