You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Brent Goran <br...@strategoit.com> on 2005/09/02 18:17:26 UTC

per-page "boost" - concise definition anywhere?

Nutch is assigning each page a "boost", which is per-page (not
per-query) and I think it is somewhat analogous to Google's PageRank
(though of course I'm sure not the same algorithm).

Is there an exact definition of this boost anywhere, e.g. how it is
calculated within nutch?



Re: per-page "boost" - concise definition anywhere?

Posted by Michael Ji <fj...@yahoo.com>.
Hi Ken:

As exactly you described, inside IndexSegment.java,
calculateBoost() method do the real work for calculate
a doc's boost value, which is its' page rank.

Following is its' code

// 1. Start with page's score from DB -- 1.0 if no
link analysis.
float res = pageScore;
// 2. Apply scorePower to this.
res = (float)Math.pow(pageScore, scorePower);
// 3. Optionally boost by log of incoming anchor
count.
if (boostByLinkCount)
   res *= (float)Math.log(Math.E + linkCount);


Seems to me, this calculation procedure doesn't count
the weight(page rank) of the inbound links. Only
consider the number of inbound links.

While the typical page rank formula is 
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

So, does that mean Nutch's link analysis using
different page ranking concept as google's? 

Or I miss some important points? 

thanks,

Michael Ji,
 

> 
> In any case, if you just use default Nutch settings
> and don't run the 
> DistributedAnalysisTool, then all of the page scores
> are 1.0. So the 
> Lucene document boost winds up being ln(e + inbound
> link count). 0 
> inbound links == 1.0, 10 links = 2.54, 100 links =
> 4.63, etc.
> 
> -- Ken
> -- 
> Ken Krugler
> TransPac Software, Inc.
> <http://www.transpac.com>
> +1 530-470-9200
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: per-page "boost" - concise definition anywhere?

Posted by Ken Krugler <kk...@transpac.com>.
>Nutch is assigning each page a "boost", which is per-page (not
>per-query) and I think it is somewhat analogous to Google's PageRank
>(though of course I'm sure not the same algorithm).
>
>Is there an exact definition of this boost anywhere, e.g. how it is
>calculated within nutch?

The only "definition" I've found is in the code :)

If you look at org.apache.nutch.indexer.IndexSegment.makeDocument, 
you'll see the call to calculateBoost, which calculates a Lucene 
document boost value using the page score, the indexer.score.power 
configuration value, the indexer.boost.by.link.count configuration 
boolean, and the number of inbound links.

The number of inbound links can only be accurately determined (based 
on pages crawled, of course) via data from the WebDB, which is why 
you'd want to run UpdateSegmentsFromDB before indexing pages, if 
you've got indexer.boost.by.link.count set to true. Or do your 
indexing after merging all of the segments.

How the page score gets calculated is another topic. I understand the 
basic approach, which only relies on the injected link score and the 
internal/external link score factors. But the "real" link analysis 
algorithm could certainly use a write-up in the Wiki. The specific 
question, in case Mike is reading, is how nextScore is used (for 
linked-to pages that have outlinks) in the 
DistributedAnalysisTool.computeRound() method.

Though maybe the mapred work means that code goes away.

In any case, if you just use default Nutch settings and don't run the 
DistributedAnalysisTool, then all of the page scores are 1.0. So the 
Lucene document boost winds up being ln(e + inbound link count). 0 
inbound links == 1.0, 10 links = 2.54, 100 links = 4.63, etc.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Re: per-page "boost" - concise definition anywhere?

Posted by Michael Ji <fj...@yahoo.com>.
It is done by indexSegment.java file in nutch

Michael Ji

--- Brent Goran <br...@strategoit.com> wrote:

> Nutch is assigning each page a "boost", which is
> per-page (not
> per-query) and I think it is somewhat analogous to
> Google's PageRank
> (though of course I'm sure not the same algorithm).
> 
> Is there an exact definition of this boost anywhere,
> e.g. how it is
> calculated within nutch?
> 
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com