You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Hrishikesh Agashe <hr...@persistent.co.in> on 2009/07/16 15:11:29 UTC

Nutch download speed

Hi,

We are running Nutch 1.0 on 4 VM Hadoop cluster (Each VM is: 2 CPU, Quad core, 3 GB RAM) located on data center with data storage on a common NAS. The bandwidth available to us is 150 Mb / sec. Theoretical calculation tells me that I can download 1.62 TB data per day (150 * 60 *60 * 24) / 8 = 1620000 MB.

Now my aim is to tune Nutch to get as close as possible to this figure.
I played a lot with different Nutch params (num of maps=17, num of reduce=7, num of threads=800, fairness=3 sec etc), but max I could get is 3GB / hour, which is 72 GB per day, which is way less than 1.62 TB. I have set filters to just download html text. No images, no videos etc.

So wanted to know about what all params constitute to the speed of Nutch data download? 
Am I missing some very obvious thing? Are number of machines too less? Is hardware configuration not powerful enough?

TIA,
-Hrishi

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: Nutch download speed

Posted by Doğacan Güney <do...@gmail.com>.
On Thu, Jul 16, 2009 at 16:11, Hrishikesh
Agashe<hr...@persistent.co.in> wrote:
> Hi,
>
> We are running Nutch 1.0 on 4 VM Hadoop cluster (Each VM is: 2 CPU, Quad core, 3 GB RAM) located on data center with data storage on a common NAS. The bandwidth available to us is 150 Mb / sec. Theoretical calculation tells me that I can download 1.62 TB data per day (150 * 60 *60 * 24) / 8 = 1620000 MB.
>
> Now my aim is to tune Nutch to get as close as possible to this figure.
> I played a lot with different Nutch params (num of maps=17, num of reduce=7, num of threads=800, fairness=3 sec etc), but max I could get is 3GB / hour, which is 72 GB per day, which is way less than 1.62 TB. I have set filters to just download html text. No images, no videos etc.
>
> So wanted to know about what all params constitute to the speed of Nutch data download?
> Am I missing some very obvious thing? Are number of machines too less? Is hardware configuration not powerful enough?
>

What are you downloading? Remember that nutch waits between successive
requests to the
same host, so you may simply be running out of hosts to fetch (so
fetcher just waits).

However, several people suggested that Fetcher in Nutch 1.0 _is_ slower:

https://issues.apache.org/jira/browse/NUTCH-721

My recommendation would be to use OldFetcher class in trunk to see if
it makes a difference.

> TIA,
> -Hrishi
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
>



-- 
Doğacan Güney