You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by EM <em...@cpuedge.com> on 2005/10/12 21:32:57 UTC

suspicious outlink count

202443 Pages consumed: 130000 (at index 130000). Links fetched: 233386.
202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/].
202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315.

If there is maxoutlinks already specified in the xml config, why does 
nutch bother counting anything over that again?

Re: suspicious outlink count

Posted by Piotr Kosiorowski <pk...@gmail.com>.
EM wrote:
> 202443 Pages consumed: 130000 (at index 130000). Links fetched: 233386.
> 202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/].
> 202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315.
> 
> If there is maxoutlinks already specified in the xml config, why does 
> nutch bother counting anything over that again?

During PageRank computation nutch retrieves all links from given page
by MD5. If we have many pages with the same MD5 it can retrieve all 
outlinks from these pages - I saw some "bot traps" that had big site 
structures that had exactly the same MD5 (once I had over a milion of 
identical pages in my index with different urls from the same host).So 
in this case we are getting the union af all such outlinks. In some 
situations having a big number of outlinks is not a problem (like in 
your case - all pages injected from dmoz are outlinks from dmoz) - but 
usually it indicates some problems in your index or at least a reason to 
look at it. So I have decided to print a warning in this case so one can
have a look at such site.
Regards
Piotr