You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Guilherme Menezes <gu...@gmail.com> on 2008/09/23 23:33:20 UTC

Cluster size question

Hi everyone,

Our research group is planning to set up a cluster sufficient to crawl
around 1 billion single Web pages (estimated Brazilian Web size) for
academic purposes, maybe using Nutch. We currently have 4 boxes (16GB of
ram, 6 * 750 GB disks w/ 3 controllers, Quad-Core AMD Opteron processor),
and we are currently considering to buy more nodes. We have some questions
right now which some of you may help:

1) Is it better to buy less powerful nodes in order to have more nodes and
more parallelism, or is it better to have a smaller number of nodes
equivalent to the ones we currently have? I guess just 1 disk per controller
would help. I don't really know also if 16 GB of ram would be necessary. And
maybe a quad-core wouldn't be necessary too, maybe just a duo-core would be
sufficient. In your experiences, where would it be better to spend money on?
Ram, disk, processing, more nodes, everything?

2) How many nodes would it be necessary to perform a Web crawl of 1 billion
pages in about 1 month? Have you had any similar experiences? How many did
you use?

Thank you for any help! We are very interested in understanding Nutch and
collaborating in the future.

Re: Cluster size question

Posted by Guilherme Menezes <gu...@gmail.com>.
Had problems sending, resending.

On Tue, Sep 23, 2008 at 6:33 PM, Guilherme Menezes <
guilherme.v.f.menezes@gmail.com> wrote:

> Hi everyone,
>
> Our research group is planning to set up a cluster sufficient to crawl
> around 1 billion single Web pages (estimated Brazilian Web size) for
> academic purposes, maybe using Nutch. We currently have 4 boxes (16GB of
> ram, 6 * 750 GB disks w/ 3 controllers, Quad-Core AMD Opteron processor),
> and we are currently considering to buy more nodes. We have some questions
> right now which some of you may help:
>
> 1) Is it better to buy less powerful nodes in order to have more nodes and
> more parallelism, or is it better to have a smaller number of nodes
> equivalent to the ones we currently have? I guess just 1 disk per controller
> would help. I don't really know also if 16 GB of ram would be necessary. And
> maybe a quad-core wouldn't be necessary too, maybe just a duo-core would be
> sufficient. In your experiences, where would it be better to spend money on?
> Ram, disk, processing, more nodes, everything?
>
> 2) How many nodes would it be necessary to perform a Web crawl of 1 billion
> pages in about 1 month? Have you had any similar experiences? How many did
> you use?
>
> Thank you for any help! We are very interested in understanding Nutch and
> collaborating in the future.
>