You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Ryan L. Sun" <li...@gmail.com> on 2012/08/13 20:55:38 UTC

WWW wide crawling using nutch

Hi all,

I'm looking for some estimate/stat regarding WWW wide crawling using
nutch (or 10%/20% of WWW). What kind of hardware do u need and how
long it takes to finish one round of search?

TIA.

RE: WWW wide crawling using nutch

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

I won't try to estimate the size of the public internet but i may have some useful figures. A standard dual core machine with 2GB RAM can process about 15 records per second in ideal conditions with a parsing fetcher without storing content. But this doesn't include indexing time, webgraph building or linkrank calculation. So we could achieve only about 10 records per second on average.

Another cluster with 16 cores and 16GB RAM each gives much better results so not only more hardware is better but more powerful hardware as well. With it we could, in ideal conditions, fetch and parse about 500 records per machine per second. When taking the other jobs into account it drops to an average of 300 records per second per machine.

Under normal conditions it is between 150 and 250. With these figures you would have only a fraction of the internet after a year and not revisiting pages, even if you have a hunderd powerful machines.

It's also impossible to do with a standard Nutch as you will quickly run into a lot of trouble with useless pages and crawler traps. Another very significant problem is duplicate websites such as www and non-www pages but these duplicates come in many more exotic varieties. You also have to manage extremely large black lists (many millions) of dead hosts. You need to prevent those from polluting your CrawlDB, dead URL's can quickly grow very large.

Crawling the internet means managing a lot of crap.

Good luck
Markus
 
-----Original message-----
> From:Ryan L. Sun <li...@gmail.com>
> Sent: Mon 13-Aug-2012 20:58
> To: user@nutch.apache.org
> Subject: WWW wide crawling using nutch
> 
> Hi all,
> 
> I'm looking for some estimate/stat regarding WWW wide crawling using
> nutch (or 10%/20% of WWW). What kind of hardware do u need and how
> long it takes to finish one round of search?
> 
> TIA.
>