You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jay Pound <we...@poundwebhosting.com> on 2005/08/11 22:49:29 UTC

page ranking weights

at which step does nutch figure out the weight of each page, the updatedb
step? or the index step?
Thanks,
-Jay



Re: page ranking weights

Posted by Ken Krugler <kk...@transpac.com>.
>also how does it keep track of incoming links globally on these pages, if
>the weight is determined by # of incoming links then there would have to be
>somewhere it keeps track so when you split your indexes it can still have an
>accurate value for the distributed search?

The WebDB keeps track of this info. It's not in the segments/indexes.

>  > at which step does nutch figure out the weight of each page, the updatedb
>  > step? or the index step?

The updatedb step.

In UpdateDatabaseTool.java's PageContentChanged() method, first all 
of the outlink URLs are harvested from the fetched page. Then a score 
is calculated for each of the pages referenced by these outlink URLs, 
based on the score of the fetched page, multiplied by either the 
internal or external link weight (from Nutch config XML data, both 
1.0 by default), depending on whether the URL is in the same domain 
as the fetched page.

When you inject URLs, there is no referring page, so it arbitrarily 
uses the db.score.injected value (1.0 by default).

So if you leave everything set to default values, and don't perform 
link analysis, I think every page will wind up with a score of 1.0.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Re: page ranking weights

Posted by Piotr Kosiorowski <pk...@gmail.com>.
Boost for the page maybe calculated in few different ways (and in few 
different places in nutch):
1) PageRank based score
	- calculated by "nutch analyze" command based on WebDB
	- during fetchlist generation scores from WebDB are stored in segment
	- indexing phase uses score to set the boost for a page
2) based on number of incoming links
	- during fetchlist generation inlinks are stored in segment
	- during indexing number of inlinks is read from segment and used in 
boost calculation

There is a separate command (updatesegs) to update score and inlink 
information in existing segments.
Regards
Piotr

Jay Pound wrote:
> also how does it keep track of incoming links globally on these pages, if
> the weight is determined by # of incoming links then there would have to be
> somewhere it keeps track so when you split your indexes it can still have an
> accurate value for the distributed search?
> -J
> ----- Original Message ----- 
> From: "Jay Pound" <we...@poundwebhosting.com>
> To: <nu...@lucene.apache.org>
> Sent: Thursday, August 11, 2005 4:49 PM
> Subject: page ranking weights
> 
> 
> 
>>at which step does nutch figure out the weight of each page, the updatedb
>>step? or the index step?
>>Thanks,
>>-Jay
>>
>>
>>
> 
> 
> 
> 


Re: page ranking weights

Posted by Jay Pound <we...@poundwebhosting.com>.
also how does it keep track of incoming links globally on these pages, if
the weight is determined by # of incoming links then there would have to be
somewhere it keeps track so when you split your indexes it can still have an
accurate value for the distributed search?
-J
----- Original Message ----- 
From: "Jay Pound" <we...@poundwebhosting.com>
To: <nu...@lucene.apache.org>
Sent: Thursday, August 11, 2005 4:49 PM
Subject: page ranking weights


> at which step does nutch figure out the weight of each page, the updatedb
> step? or the index step?
> Thanks,
> -Jay
>
>
>