You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2009/01/12 18:32:15 UTC

[Nutch Wiki] Update of "NewPage" by DennisKubes

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/NewPage

The comment on the change is:
Beginning descriptions of how to use the new webgraph scoring system.

------------------------------------------------------------------------------
- emptyemptyempty!
+ This page describes the new scoring (i.e. WebGraph and Link Analysis) functionality in Nutch as of revision 723441.
  
+ == General Information ==
+ The new scoring functionality can be found in org.apache.nutch.scoring.webgraph.  This package contains multiple programs that build web graphs, perform a stable convergent link-analysis, and update the crawldb with those scores.  These programs assume that fetching cycles have already been completed and now the users want to build a global webgraph from those segments and from that webgraph perform link-analysis to get a single global relevancy score for each url.  Building a webgraph assumes that all links are stored in the current segments to be processed.  Links are not held over from one processing cycle to another.  Global link-analysis scores are based on the current links available and scores will change as the link structure of the webgraph changes.
+ 
+ Currently the scoring jobs are not integrated into the Nutch script as commands and must be run in the form bin/nutch org.apache.nutch.scoring.webgraph.XXXX.
+ 
+ === WebGraph ===
+ The WebGraph program is the first job that must be run once all segments are fetched and ready to be processed.  WebGraph is found at org.apache.nutch.scoring.webgraph.WebGraph. Below is a printout of the programs usage.
+ 
+ {{{
+ usage: WebGraph
+  -help                      show this help message
+  -segment <segment>         the segment(s) to use
+  -webgraphdb <webgraphdb>   the web graph database to use
+ }}}
+ 
+ The WebGraph program can take multiple segments to process and requires an output directory in which to place the completed web graph components.  The WebGraph creates three different components, and inlink database, an outlink database, and a node database.  The inlink database is a listing of url and all of its inlinks.  The outlink database is a listing of url and all of its outlinks.  The node database is a listing of url with node meta information including the number of inlinks and outlinks, and eventually the score for that node.
+ 
+ === Loops ===
+ Once the web graph is built we can begin the process of link analysis.  Loops is an optional program that attempts to help weed out spam sites by determining link cycles in a web graph.  An example of a link cycle would be sites A, B, C, and D where A links to B which links to C which links to D which links back to A.  This program is computationally expensive and usually, due to time and space requirement, can't be run on more than a three or four level depth.  While it does identify sites which appear to be spam and those links are then discounted in the later LinkRank program, its benefit to cost ratio is very low.  It is included in this package for completeness and because their may be a better way to perform this function with a different algorithm.  But on current production webgraphs, its use is discouraged.  Loops is found at org.apache.nutch.scoring.webgraph.Loops. Below is a printout of the programs usage.
+ 
+ {{{
+ usage: Loops
+  -help                      show this help message
+  -webgraphdb <webgraphdb>   the web graph database to use
+ }}}
+ 
+ === LinkRank ===
+ With the web graph built we can now run LinkRank to perform an iterative link analysis.  LinkRank is a PageRank like link analysis program that converges to stable global scores for each url.  Similar to PageRank, the LinkRank program starts with a common score for all urls.  It then creates a global score for each url based on the number of incoming links and the scores for those link and the number of outgoing links from the page.  The process is iterative and scores tend to converge after a given number of iterations.  It is different from PageRank in that nepotistic links such as links internal to a website and reciprocal links between websites can be ignored.  The number of iterations can also be configured, by default 10 iterations are performed.  Unlike the previous OPIC scoring, the LinkRank program does not keep scores from one processing time to another.  The web graph and the link scores are recreated at each processing run and so we don't have the problems of ev
 er increasing scores.  LinkRank requires the WebGraph program to have completed successfully and it stores its output scores for each url in the node database of the webgraph. LinkRank is found at org.apache.nutch.scoring.webgraph.LinkRank. Below is a printout of the programs usage.  
+ 
+ {{{
+ usage: LinkRank
+  -help                      show this help message
+  -webgraphdb <webgraphdb>   the web graph db to use
+ }}}
+ 
+ === ScoreUpdater ===
+ Once the LinkRank program has been run and link analysis is completed, the scores must be updated into the crawl database to work with the current Nutch functionality.  The ScoreUpdater program takes the scores stored in the node database of the webgraph and updates them into the crawldb.  If a exists in the crawldb that doesn't exist in the webgraph then its score is cleared in the crawldb.  The ScoreUpdater requires that the WebGraph and LinkRank programs have both been run and requires a crawl database to update.  ScoreUpdater is found at org.apache.nutch.scoring.webgraph.ScoreUpdater. Below is a printout of the programs usage.
+ 
+ {{{
+ usage: ScoreUpdater
+  -crawldb <crawldb>         the crawldb to use
+  -help                      show this help message
+  -webgraphdb <webgraphdb>   the webgraphdb to use
+ }}}
+ 
+ 
+ 
+