You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2005/12/20 01:00:29 UTC

Re: Multiple anchors on same site - what's better than making these unique?

Hi,
did you tried...
<property>
   <name>db.ignore.internal.links</name>
   <value>true</value>
   <description>If true, when adding new links to a page, links from
   the same host are ignored.  This is an effective way to limit the
   size of the link database, keeping the only the highest quality
   links.
   </description>
</property>

... setting to false?

Stefan

Am 20.12.2005 um 00:49 schrieb David Wallace:

> Hi all,
> I've been grubbing around with Nutch for a while now, although I'm
> still working with 0.7 code.  I notice that when anchors are collected
> for a document, they're made unique by domain and by anchor text.
>
> I'm using Nutch for an "intranet style" search engine, on a single
> site, so I don't really care about the uniqueness by domain.   
> However, I
> can't help thinking that the uniqueness by anchor text probably isn't
> what I want.
>
> Suppose my site has 3 pages with links to page X, and the same anchor
> text.  I'd kind of like to score page X higher than a page where  
> there's
> only one incoming link with that anchor text.  But I don't want to  
> have
> this effect swamping the other calculations of page score.  In other
> words, if my site has 1000 pages with links to page X, this page  
> should
> score a wee bit higher than a similar page with just one incoming  
> link,
> but not 1000 times higher.
>
> I'm thinking of doing some maths with the number of repetitions of an
> anchor, then including the result in the page score.  Something like
> log(10+n), or maybe n/(n+2); where n is the number of incoming links
> with the same anchor text.  Either of these formulas would make 1000
> incoming links score roughly 3 times higher than a single incoming  
> link,
> which seems about right to me.
>
> It looks to me like I'm going to have to make changes deep within the
> Lucene page scoring stuff to do this, which I'm not really looking
> forward to.  I'd really welcome hearing if anybody has a better  
> solution
> to this general problem.  The exact maths isn't too critical.  What's
> important is that for small values of n, the page score must  
> increase as
> n increases, but the overall effect must diminish as n gets really
> large.
>
> Thanks in advance,
> David.
>
> ********************************************************************** 
> **********
> This email may contain legally privileged information and is  
> intended only for the addressee. It is not necessarily the official  
> view or
> communication of the New Zealand Qualifications Authority. If you  
> are not the intended recipient you must not use, disclose, copy or  
> distribute this email or
> information in it. If you have received this email in error, please  
> contact the sender immediately. NZQA does not accept any liability  
> for changes made to this email or attachments after sending by NZQA.
>
> All emails have been scanned for viruses and content by MailMarshal.
> NZQA reserves the right to monitor all email communications through  
> its network.
>
> ********************************************************************** 
> **********

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net