You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Brent Goran <br...@strategoit.com> on 2005/09/02 18:17:26 UTC
per-page "boost" - concise definition anywhere?
Nutch is assigning each page a "boost", which is per-page (not
per-query) and I think it is somewhat analogous to Google's PageRank
(though of course I'm sure not the same algorithm).
Is there an exact definition of this boost anywhere, e.g. how it is
calculated within nutch?
Re: per-page "boost" - concise definition anywhere?
Posted by Michael Ji <fj...@yahoo.com>.
Hi Ken:
As exactly you described, inside IndexSegment.java,
calculateBoost() method do the real work for calculate
a doc's boost value, which is its' page rank.
Following is its' code
// 1. Start with page's score from DB -- 1.0 if no
link analysis.
float res = pageScore;
// 2. Apply scorePower to this.
res = (float)Math.pow(pageScore, scorePower);
// 3. Optionally boost by log of incoming anchor
count.
if (boostByLinkCount)
res *= (float)Math.log(Math.E + linkCount);
Seems to me, this calculation procedure doesn't count
the weight(page rank) of the inbound links. Only
consider the number of inbound links.
While the typical page rank formula is
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
So, does that mean Nutch's link analysis using
different page ranking concept as google's?
Or I miss some important points?
thanks,
Michael Ji,
>
> In any case, if you just use default Nutch settings
> and don't run the
> DistributedAnalysisTool, then all of the page scores
> are 1.0. So the
> Lucene document boost winds up being ln(e + inbound
> link count). 0
> inbound links == 1.0, 10 links = 2.54, 100 links =
> 4.63, etc.
>
> -- Ken
> --
> Ken Krugler
> TransPac Software, Inc.
> <http://www.transpac.com>
> +1 530-470-9200
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Re: per-page "boost" - concise definition anywhere?
Posted by Ken Krugler <kk...@transpac.com>.
>Nutch is assigning each page a "boost", which is per-page (not
>per-query) and I think it is somewhat analogous to Google's PageRank
>(though of course I'm sure not the same algorithm).
>
>Is there an exact definition of this boost anywhere, e.g. how it is
>calculated within nutch?
The only "definition" I've found is in the code :)
If you look at org.apache.nutch.indexer.IndexSegment.makeDocument,
you'll see the call to calculateBoost, which calculates a Lucene
document boost value using the page score, the indexer.score.power
configuration value, the indexer.boost.by.link.count configuration
boolean, and the number of inbound links.
The number of inbound links can only be accurately determined (based
on pages crawled, of course) via data from the WebDB, which is why
you'd want to run UpdateSegmentsFromDB before indexing pages, if
you've got indexer.boost.by.link.count set to true. Or do your
indexing after merging all of the segments.
How the page score gets calculated is another topic. I understand the
basic approach, which only relies on the injected link score and the
internal/external link score factors. But the "real" link analysis
algorithm could certainly use a write-up in the Wiki. The specific
question, in case Mike is reading, is how nextScore is used (for
linked-to pages that have outlinks) in the
DistributedAnalysisTool.computeRound() method.
Though maybe the mapred work means that code goes away.
In any case, if you just use default Nutch settings and don't run the
DistributedAnalysisTool, then all of the page scores are 1.0. So the
Lucene document boost winds up being ln(e + inbound link count). 0
inbound links == 1.0, 10 links = 2.54, 100 links = 4.63, etc.
-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200
Re: per-page "boost" - concise definition anywhere?
Posted by Michael Ji <fj...@yahoo.com>.
It is done by indexSegment.java file in nutch
Michael Ji
--- Brent Goran <br...@strategoit.com> wrote:
> Nutch is assigning each page a "boost", which is
> per-page (not
> per-query) and I think it is somewhat analogous to
> Google's PageRank
> (though of course I'm sure not the same algorithm).
>
> Is there an exact definition of this boost anywhere,
> e.g. how it is
> calculated within nutch?
>
>
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com