You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Julien Nioche <li...@gmail.com> on 2012/12/17 21:53:42 UTC

Comparing Nutch and Common Crawl

Hi,

I was chatting with the people from the Common Crawl project (
www.commoncrawl.org) this afternoon and we thought it would be interesting
to have some sort of comparison between the space / memory / CPU
requirements of their crawler and what it would take to process a similar
amount with Nutch 1.x and 2.x. The aim is not so much to prove that one
system is superior to the other (they both have their pluses and minuses)
but to get a better picture of the situation.

One way to do this would be to gather stats from Nutch users operating
large crawls. Alternatively one could push the content of the CC dataset
into e.g. Nutch 2 on Hbase to see how much space it would take and how the
crawl would fare on that. I am pretty sure that would reveal all sorts of
interesting issues and would be a good thing to do to test the Nutch + Gora
stack.

Would anyone be interested in sharing their stats? Anyone with spare time
and machine to populate a crawldb with the CC dataset and get some stats?

Thanks

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Comparing Nutch and Common Crawl

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Julien,

I've been winding down my Nutch server's this last few weeks in prep for
moving away.
I would however be very interested in stepping up to also provide some
stats come the new year. I don't know the duration of time you guys think
this should be carried out over however I am very keen to participate when
I can come early Janurary.

Best

Lewis

On Mon, Dec 17, 2012 at 9:59 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hi,
>
> Interesting indeed. Apart from our customers we operate a cluster of a few
> high octane machines for research purposes that crawls the entire internet
> as much as it physically can. We run a modified Nutch 1.x and some custom
> jobs that analyze the crawled data and allow us to crawl the internet more
> efficiently. The cluster is far too small to quickly read through all the
> data. We only have 80GB of RAM and 80 CPU cores so it takes a while to read
> the ~760GB crawldb containing about 5.7 billion records, it takes about 40
> minutes. Compiling a webgraph and calculating the page rank takes about 32
> hours. Fetching and parsing is less intensive, at peak efficiency we can
> process over 700 pages per second including the reduce phase time and job
> set up and clean up.
>
> We can provide all the information you would like to have.
>
> Cheers.
>
> -----Original message-----
> > From:Julien Nioche <li...@gmail.com>
> > Sent: Mon 17-Dec-2012 22:00
> > To: user@nutch.apache.org; dev@nutch.apache.org
> > Cc: Lisa Green <li...@commoncrawl.org>
> > Subject: Comparing Nutch and Common Crawl
> >
> > Hi,
> >
> > I was chatting with the people from the Common Crawl project (
> www.commoncrawl.org <http://www.commoncrawl.org> ) this afternoon and we
> thought it would be interesting to have some sort of comparison between the
> space / memory / CPU requirements of their crawler and what it would take
> to process a similar amount with Nutch 1.x and 2.x. The aim is not so much
> to prove that one system is superior to the other (they both have their
> pluses and minuses) but to get a better picture of the situation.
> >
> > One way to do this would be to gather stats from Nutch users operating
> large crawls. Alternatively one could push the content of the CC dataset
> into e.g. Nutch 2 on Hbase to see how much space it would take and how the
> crawl would fare on that. I am pretty sure that would reveal all sorts of
> interesting issues and would be a good thing to do to test the Nutch + Gora
> stack.
> >
> > Would anyone be interested in sharing their stats? Anyone with spare
> time and machine to populate a crawldb with the CC dataset and get some
> stats?
> >
> > Thanks
> >
> > Julien
> >
> > --
> >  <http://digitalpebble.com/img/logo.gif>
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/>
> > http://www.digitalpebble.com <http://www.digitalpebble.com>
> > http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>
> >
> >
>



-- 
*Lewis*

RE: Comparing Nutch and Common Crawl

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

Interesting indeed. Apart from our customers we operate a cluster of a few high octane machines for research purposes that crawls the entire internet as much as it physically can. We run a modified Nutch 1.x and some custom jobs that analyze the crawled data and allow us to crawl the internet more efficiently. The cluster is far too small to quickly read through all the data. We only have 80GB of RAM and 80 CPU cores so it takes a while to read the ~760GB crawldb containing about 5.7 billion records, it takes about 40 minutes. Compiling a webgraph and calculating the page rank takes about 32 hours. Fetching and parsing is less intensive, at peak efficiency we can process over 700 pages per second including the reduce phase time and job set up and clean up.

We can provide all the information you would like to have.

Cheers. 
 
-----Original message-----
> From:Julien Nioche <li...@gmail.com>
> Sent: Mon 17-Dec-2012 22:00
> To: user@nutch.apache.org; dev@nutch.apache.org
> Cc: Lisa Green <li...@commoncrawl.org>
> Subject: Comparing Nutch and Common Crawl
> 
> Hi,
> 
> I was chatting with the people from the Common Crawl project (www.commoncrawl.org <http://www.commoncrawl.org> ) this afternoon and we thought it would be interesting to have some sort of comparison between the space / memory / CPU requirements of their crawler and what it would take to process a similar amount with Nutch 1.x and 2.x. The aim is not so much to prove that one system is superior to the other (they both have their pluses and minuses) but to get a better picture of the situation.
> 
> One way to do this would be to gather stats from Nutch users operating large crawls. Alternatively one could push the content of the CC dataset into e.g. Nutch 2 on Hbase to see how much space it would take and how the crawl would fare on that. I am pretty sure that would reveal all sorts of interesting issues and would be a good thing to do to test the Nutch + Gora stack.
> 
> Would anyone be interested in sharing their stats? Anyone with spare time and machine to populate a crawldb with the CC dataset and get some stats?
> 
> Thanks
> 
> Julien
>  
> -- 
>  <http://digitalpebble.com/img/logo.gif> 
> Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/> 
> http://www.digitalpebble.com <http://www.digitalpebble.com> 
> http://twitter.com/digitalpebble <http://twitter.com/digitalpebble> 
> 
>