You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Siddhartha Reddy <si...@grok.in> on 2008/05/27 08:55:19 UTC

IRLbot: Scaling to 6 Billion Pages and Beyond

Hello,

I thought the following might be of considerable interest to Nutch. If you
have already come across this, please excuse this email.

http://www2008.org/papers/pdf/p427-leeA.pdf

This is a paper published in the WWW 2008 conference. Incidentally, it won
the award for the best paper at the conference.

*Abstract:*
This paper shares our experience in designing a web crawler that can
download billions of pages using a single-server implementation and models
its performance. We show that with the quadratically increasing complexity
of verifying URL uniqueness, BFS crawl order, and fixed per-host
rate-limiting, current crawling algorithms cannot effectively cope with the
sheer volume of URLs generated in large crawls, highly-branching spam,
legitimate multi-million-page blog sites, and infinite loops created by
server-side scripts. We offer a set of techniques for dealing with these
issues and test their performance in an implementation we call IRLbot. In
our recent experiment that lasted 41 days, IRLbot running on a single server
successfully crawled 6.3 billion valid HTML pages (7.6 billion connection
requests) and sustained an average download rate of 319 mb/s (1,789
pages/s). Unlike our prior experiments with algorithms proposed in related
work, this version of IRLbot did not experience any bottlenecks and
successfully handled content from over 117 million hosts, parsed out 394
billion links, and discovered a subset of the web graph with 41 billion
unique nodes.

Regards,
Siddhartha