You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2011/08/06 23:30:54 UTC

[Nutch Wiki] Trivial Update of "NewScoring" by LewisJohnMcgibbney

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NewScoring" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NewScoring?action=diff&rev1=4&rev2=5

  This page describes the new scoring (i.e. !WebGraph and Link Analysis) functionality in Nutch as of revision 723441. See also the [[NewScoringIndexingExample|new scoring example]].
+ 
+ <<TableOfContents(3)>>
  
  == General Information ==
  The new scoring functionality can be found in org.apache.nutch.scoring.webgraph.  This package contains multiple programs that build web graphs, perform a stable convergent link-analysis, and update the crawldb with those scores.  These programs assume that fetching cycles have already been completed and now the users want to build a global webgraph from those segments and from that webgraph perform link-analysis to get a single global relevancy score for each url.  Building a webgraph assumes that all links are stored in the current segments to be processed.  Links are not held over from one processing cycle to another.  Global link-analysis scores are based on the current links available and scores will change as the link structure of the webgraph changes.
@@ -47, +49 @@

   -webgraphdb <webgraphdb>   the webgraphdb to use
  }}}
  
+ == Questions ==
+ 
+ === If internal links are not ignored, would the !LinkRank scores be equivalent to !PageRank scores? ===
+ 
+ To understand this we are required to explain how the !LinkRank scores are calculated exactly.
+ 
+ The !WebGraph and !LinkRank classes work together.  The WebGraph is were links from either the same domains or same hosts can be ignored (or allowed).  The configuration parameters:
+ {{{
+ link.ignore.internal.host = true|false
+ link.ignore.internal.domain = true|false
+ }}}
+ can be used to change that behavior.  By default it ignores links from the same domain and hosts.  So a link from news.google.com wouldn't be counted and wouldn't raise the score for www.google.com.  The !WebGraph just builds the lists of inlinks, outlinks, and nodes then the !LinkRank class processes that to create the score. !LinkRank does follow very closely to the original pagerank formula which is something like:
+ 
+ '''(1 - !dampingFactor) + (!dampingFactor * !totalInlinkScore)'''
+ 
+ Where !totalInlinkScore is the calculated from all the inlinks pointing to a page, taking into account that this is iterative and pages all start off with !rankOne score which is (1 / !numLinksInWebGraph).
+ 
+ The differences are:
+ 
+  1. The Loops class can be used to identify and remove spam/problem
+     links.  This class was supposed to identify reciprocal links and
+     link cycles and then allow those links to be removed.  Problem is
+     the class is very expensive computationally.  You can set the
+     depth you want it to run but it is worse than exponential so I
+     wouldn't do more than 1-3 depth if at all.  That will get you
+     reciprocal links and small link cycles (a->b->c->a).  Really this
+     doesn't add much to score in the end, I would just leave it off
+     and not run this job.
+  2. You can limit duplicate links from pages and domains.  Say page A
+     points to B twice, you can limit it and only count it once.
+  3. There is a damping factor which is by default set to 0.85.  This
+     is the same as the original pagerank paper.  This is configurable
+     with the link.analyze.damping.factor parameter.
+  4. LinkRank runs a given number of iterations.  Ideally the job would
+     iterate until the scores converge to a point, currently it is a
+     set number of iterations.
+ 
+ !LinkRank scores should be equivalent (close enough) to pagerank scores.  Some things to consider:
+ 
+  1. Pagerank is just one of over 200 signals that google uses (if they
+     still use it) to determine relevancy.  Even if Google still uses
+     it it most likely has changed.  Link analysis scores are good
+     global relevancy scores, but a link score does not a search engine
+     make today.  Oh how I wish it was that simple.  !LinkRank is a good
+     starting point, that's it.
+  2. This is only as good as the amount of pages you have crawled.  The
+     larger your set of crawled segments the better the scores get.
+  3. A link is a link, it is content agnostic.  If you crawl 100m pages
+     and do a !LinkRank on that you will see all the usual suspects
+     (Google, YouTube, Facebook) but you will also see things like the
+     flash download.  To LinkRank a link is a link, it isn't particular
+     in it being a viewable piece of content.
+