You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by zhengping deng <de...@hotmail.com> on 2008/09/11 16:54:20 UTC

how to improve nutch crawl speed?

hi,    I am running the nutch-0.8.1 on 4 machines (HP DL380 G4, 2c/4G) with 1 namenode, 3 datanodes.  I am using them to crawl the internat now. But I find it is too slow.  I have spent nearly 8 hours to crawl, but the size of crawl directory is only 1.1G on HDFS.  If I use single machine nutch to crawl, it could be more than 10G. If I use larbin, it can more than 100G on one of my machine.    What is the problem?  How can I improve the speed of nutch in distributed mode.  Can nutch/hadoop be really competitive in SE?      My main config is blow:       hadoop-site.xml:           mapred.map.tasks   15           mapred.reduce.tasks 15           dfs.replication  2       nutch-site.xml:           fetcher.threads.fetch   500           fetcher.threads.per.host  20           parser.threads.parse 500     Thank you for all your help.Mark Deng

Discover the new Windows Vista Learn more! 
_________________________________________________________________
Discover the new Windows Vista
http://search.msn.com/results.aspx?q=windows+vista&mkt=en-US&form=QBRE

RE: how to improve nutch crawl speed?

Posted by Edward Quick <ed...@hotmail.com>.


I'm not expert on Nutch but if this is the same problem I had, try setting fetcher.server.delay to 0.1 in nutch-site.xml

Ed.

> From: dengzhengping123@hotmail.com
> To: nutch-user@lucene.apache.org
> Subject: how to improve nutch crawl speed?
> Date: Thu, 11 Sep 2008 14:54:20 +0000
> 
> 
> hi,    I am running the nutch-0.8.1 on 4 machines (HP DL380 G4, 2c/4G) with 1 namenode, 3 datanodes.  I am using them to crawl the internat now. But I find it is too slow.  I have spent nearly 8 hours to crawl, but the size of crawl directory is only 1.1G on HDFS.  If I use single machine nutch to crawl, it could be more than 10G. If I use larbin, it can more than 100G on one of my machine.    What is the problem?  How can I improve the speed of nutch in distributed mode.  Can nutch/hadoop be really competitive in SE?      My main config is blow:       hadoop-site.xml:           mapred.map.tasks   15           mapred.reduce.tasks 15           dfs.replication  2       nutch-site.xml:           fetcher.threads.fetch   500           fetcher.threads.per.host  20           parser.threads.parse 500     Thank you for all your help.Mark Deng
> 
> Discover the new Windows Vista Learn more! 
> _________________________________________________________________
> Discover the new Windows Vista
> http://search.msn.com/results.aspx?q=windows+vista&mkt=en-US&form=QBRE

_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/