You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by EM <em...@cpuedge.com> on 2005/10/12 21:32:57 UTC
suspicious outlink count
202443 Pages consumed: 130000 (at index 130000). Links fetched: 233386.
202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/].
202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315.
If there is maxoutlinks already specified in the xml config, why does
nutch bother counting anything over that again?
Re: suspicious outlink count
Posted by Piotr Kosiorowski <pk...@gmail.com>.
EM wrote:
> 202443 Pages consumed: 130000 (at index 130000). Links fetched: 233386.
> 202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/].
> 202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315.
>
> If there is maxoutlinks already specified in the xml config, why does
> nutch bother counting anything over that again?
During PageRank computation nutch retrieves all links from given page
by MD5. If we have many pages with the same MD5 it can retrieve all
outlinks from these pages - I saw some "bot traps" that had big site
structures that had exactly the same MD5 (once I had over a milion of
identical pages in my index with different urls from the same host).So
in this case we are getting the union af all such outlinks. In some
situations having a big number of outlinks is not a problem (like in
your case - all pages injected from dmoz are outlinks from dmoz) - but
usually it indicates some problems in your index or at least a reason to
look at it. So I have decided to print a warning in this case so one can
have a look at such site.
Regards
Piotr