You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2005/12/20 01:00:29 UTC
Re: Multiple anchors on same site - what's better than making these unique?
Hi,
did you tried...
<property>
<name>db.ignore.internal.links</name>
<value>true</value>
<description>If true, when adding new links to a page, links from
the same host are ignored. This is an effective way to limit the
size of the link database, keeping the only the highest quality
links.
</description>
</property>
... setting to false?
Stefan
Am 20.12.2005 um 00:49 schrieb David Wallace:
> Hi all,
> I've been grubbing around with Nutch for a while now, although I'm
> still working with 0.7 code. I notice that when anchors are collected
> for a document, they're made unique by domain and by anchor text.
>
> I'm using Nutch for an "intranet style" search engine, on a single
> site, so I don't really care about the uniqueness by domain.
> However, I
> can't help thinking that the uniqueness by anchor text probably isn't
> what I want.
>
> Suppose my site has 3 pages with links to page X, and the same anchor
> text. I'd kind of like to score page X higher than a page where
> there's
> only one incoming link with that anchor text. But I don't want to
> have
> this effect swamping the other calculations of page score. In other
> words, if my site has 1000 pages with links to page X, this page
> should
> score a wee bit higher than a similar page with just one incoming
> link,
> but not 1000 times higher.
>
> I'm thinking of doing some maths with the number of repetitions of an
> anchor, then including the result in the page score. Something like
> log(10+n), or maybe n/(n+2); where n is the number of incoming links
> with the same anchor text. Either of these formulas would make 1000
> incoming links score roughly 3 times higher than a single incoming
> link,
> which seems about right to me.
>
> It looks to me like I'm going to have to make changes deep within the
> Lucene page scoring stuff to do this, which I'm not really looking
> forward to. I'd really welcome hearing if anybody has a better
> solution
> to this general problem. The exact maths isn't too critical. What's
> important is that for small values of n, the page score must
> increase as
> n increases, but the overall effect must diminish as n gets really
> large.
>
> Thanks in advance,
> David.
>
> **********************************************************************
> **********
> This email may contain legally privileged information and is
> intended only for the addressee. It is not necessarily the official
> view or
> communication of the New Zealand Qualifications Authority. If you
> are not the intended recipient you must not use, disclose, copy or
> distribute this email or
> information in it. If you have received this email in error, please
> contact the sender immediately. NZQA does not accept any liability
> for changes made to this email or attachments after sending by NZQA.
>
> All emails have been scanned for viruses and content by MailMarshal.
> NZQA reserves the right to monitor all email communications through
> its network.
>
> **********************************************************************
> **********
---------------------------------------------------------------
company: http://www.media-style.com
forum: http://www.text-mining.org
blog: http://www.find23.net