You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Paul Stewart <ps...@nexicomgroup.net> on 2007/11/29 03:38:58 UTC

Hardware Planning

Hi folks...

I have read the archives and looking for input specific to my estimated
requirements:

Want to index about 100 million public webpages.  Space and bandwidth
are not a problem - coming up with the right hardware and keeping the
cost down is my goal.

I would estimate only 1-2 searches per second at least during the first
hardware phase.

With that in mind I'm trying to figure out whether to use a couple of
larger Dell servers or a bunch of small single CPU, 1 Gig RAM, 160 GB
hard drive type of machines....

Anyone share what they are using for hardware for about 100 million
webpages and their search result times etc??  Realworld is important to
me and being able to scale is important....

Thanks,

Paul





 

----------------------------------------------------------------------------

"The information transmitted is intended only for the person or entity to which it is addressed and contains confidential and/or privileged material. If you received this in error, please contact the sender immediately and then destroy this transmission, including all attachments, without copying, distributing or disclosing same. Thank you."

Re: Hardware Planning

Posted by v k <vk...@gmail.com>.
Have you considered EC2 + S3?
Also check out Rightscale.

On Nov 28, 2007 9:38 PM, Paul Stewart <ps...@nexicomgroup.net> wrote:

> Hi folks...
>
> I have read the archives and looking for input specific to my estimated
> requirements:
>
> Want to index about 100 million public webpages.  Space and bandwidth
> are not a problem - coming up with the right hardware and keeping the
> cost down is my goal.
>
> I would estimate only 1-2 searches per second at least during the first
> hardware phase.
>
> With that in mind I'm trying to figure out whether to use a couple of
> larger Dell servers or a bunch of small single CPU, 1 Gig RAM, 160 GB
> hard drive type of machines....
>
> Anyone share what they are using for hardware for about 100 million
> webpages and their search result times etc??  Realworld is important to
> me and being able to scale is important....
>
> Thanks,
>
> Paul
>
>
>
>
>
>
>
>
> ----------------------------------------------------------------------------
>
> "The information transmitted is intended only for the person or entity to
> which it is addressed and contains confidential and/or privileged material.
> If you received this in error, please contact the sender immediately and
> then destroy this transmission, including all attachments, without copying,
> distributing or disclosing same. Thank you."
>

RE: Hardware Planning

Posted by Paul Stewart <ps...@nexicomgroup.net>.
Thanks very much for the details... I appreciate it...

I'd be happy with the 500ms range on *average* but totally understand
your point about searches "piling up"....

So you're suggesting about 20 million pages per box - each box with 4
drives, dual CPU and 4 gig RAM?

I guess what I don't totally understand is what servers need lots of RAM
and which ones need all the storage etc for sure.  I was thinking of
some low end boxes (2 gig RAM, 160 Gig HD, single low end processor) for
storage and a couple of heftier boxes (dual cpu, 4 Gig RAM, 500 GB hard
drives)  - is this way off track?

What needs RAM and what needs storage in the components of Nutch?

Thanks again,

Paul


-----Original Message-----
From: Ken Krugler [mailto:kkrugler_lists@transpac.com] 
Sent: Thursday, November 29, 2007 11:13 AM
To: nutch-user@lucene.apache.org
Subject: RE: Hardware Planning

Hi Paul,

Leaving aside the hardware requirements for the crawl...

The main issue with what you need to achieve your is the nature of 
your index. If you're using the results of a standard Nutch web 
crawl, then search times < 500ms shouldn't be a problem.

But you actually want something more in the range of say 200ms 
average, as otherwise you can quickly run into the overlapping search 
problem...once a search doesn't complete in time before another 
search starts running, both searches take longer, which increases the 
odds that the third search happens before the previous search(es) 
have completed. So the performance can quickly deteriorate under a 
load that's only slightly higher than your target case.

However getting 200ms time isn't hard either, as long as the hardware 
is reasonable and the index size isn't huge.

In our experience, using more, cheaper boxes is the way to go. For 
web crawl data, I would probably got with two 10M page indexes per 
box, where the Lucene index goes on a smaller, faster drive and the 
page contents go on a bigger, slower drive. So then you'd have two 
faster drives and two slower drives per box, and use a dual CPU with 
dual cores. And 4GB of RAM, so each JVM gets 1.5GB with some 
breathing room for the OS.

Which means you'd need about five of these servers for 100M 
pages...unless you want replication for reliability, which means 10 
servers.

-- Ken

>No, not familiar with that yet - can you send out any URL's?
>
>My question is really whether you're better to try for one or two big
>boxes or a series of small boxes - also looking for anyone who has 100
>million pages in their index and a description of their hardware as a
>reference point...
>
>Thanks!
>
>Paul
>
>
>-----Original Message-----
>From: vkorvi@gmail.com [mailto:vkorvi@gmail.com] On Behalf Of VK
>Sent: Wednesday, November 28, 2007 9:53 PM
>To: nutch-user@lucene.apache.org
>Subject: Re: Hardware Planning
>
>Have you considered EC2 + S3?
>
>Also Rightscale has some interesting solutions, which I am currently
>evaluating.
>
>On Nov 28, 2007 9:38 PM, Paul Stewart <ps...@nexicomgroup.net>
wrote:
>
>>  Hi folks...
>>
>>  I have read the archives and looking for input specific to my
>estimated
>>  requirements:
>>
>  > Want to index about 100 million public webpages.  Space and
bandwidth
>>  are not a problem - coming up with the right hardware and keeping
the
>>  cost down is my goal.
>>
>>  I would estimate only 1-2 searches per second at least during the
>first
>>  hardware phase.
>>
>>  With that in mind I'm trying to figure out whether to use a couple
of
>>  larger Dell servers or a bunch of small single CPU, 1 Gig RAM, 160
GB
>>  hard drive type of machines....
>>
>>  Anyone share what they are using for hardware for about 100 million
>>  webpages and their search result times etc??  Realworld is important
>to
>>  me and being able to scale is important....
>  >
>>  Thanks,
>>
>>  Paul
>>
>>
>>
>>
>>
>>
>>
>>
>>
>-----------------------------------------------------------------------
-
>----
>>
>>  "The information transmitted is intended only for the person or
entity
>to
>>  which it is addressed and contains confidential and/or privileged
>material.
>>  If you received this in error, please contact the sender immediately
>and
>>  then destroy this transmission, including all attachments, without
>copying,
>>  distributing or disclosing same. Thank you."
>>
>
>
>
>
>-----------------------------------------------------------------------
-----
>
>"The information transmitted is intended only for the person or 
>entity to which it is addressed and contains confidential and/or 
>privileged material. If you received this in error, please contact 
>the sender immediately and then destroy this transmission, including 
>all attachments, without copying, distributing or disclosing same. 
>Thank you."


-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"


 

----------------------------------------------------------------------------

"The information transmitted is intended only for the person or entity to which it is addressed and contains confidential and/or privileged material. If you received this in error, please contact the sender immediately and then destroy this transmission, including all attachments, without copying, distributing or disclosing same. Thank you."

RE: Hardware Planning

Posted by Ken Krugler <kk...@transpac.com>.
Hi Paul,

Leaving aside the hardware requirements for the crawl...

The main issue with what you need to achieve your is the nature of 
your index. If you're using the results of a standard Nutch web 
crawl, then search times < 500ms shouldn't be a problem.

But you actually want something more in the range of say 200ms 
average, as otherwise you can quickly run into the overlapping search 
problem...once a search doesn't complete in time before another 
search starts running, both searches take longer, which increases the 
odds that the third search happens before the previous search(es) 
have completed. So the performance can quickly deteriorate under a 
load that's only slightly higher than your target case.

However getting 200ms time isn't hard either, as long as the hardware 
is reasonable and the index size isn't huge.

In our experience, using more, cheaper boxes is the way to go. For 
web crawl data, I would probably got with two 10M page indexes per 
box, where the Lucene index goes on a smaller, faster drive and the 
page contents go on a bigger, slower drive. So then you'd have two 
faster drives and two slower drives per box, and use a dual CPU with 
dual cores. And 4GB of RAM, so each JVM gets 1.5GB with some 
breathing room for the OS.

Which means you'd need about five of these servers for 100M 
pages...unless you want replication for reliability, which means 10 
servers.

-- Ken

>No, not familiar with that yet - can you send out any URL's?
>
>My question is really whether you're better to try for one or two big
>boxes or a series of small boxes - also looking for anyone who has 100
>million pages in their index and a description of their hardware as a
>reference point...
>
>Thanks!
>
>Paul
>
>
>-----Original Message-----
>From: vkorvi@gmail.com [mailto:vkorvi@gmail.com] On Behalf Of VK
>Sent: Wednesday, November 28, 2007 9:53 PM
>To: nutch-user@lucene.apache.org
>Subject: Re: Hardware Planning
>
>Have you considered EC2 + S3?
>
>Also Rightscale has some interesting solutions, which I am currently
>evaluating.
>
>On Nov 28, 2007 9:38 PM, Paul Stewart <ps...@nexicomgroup.net> wrote:
>
>>  Hi folks...
>>
>>  I have read the archives and looking for input specific to my
>estimated
>>  requirements:
>>
>  > Want to index about 100 million public webpages.  Space and bandwidth
>>  are not a problem - coming up with the right hardware and keeping the
>>  cost down is my goal.
>>
>>  I would estimate only 1-2 searches per second at least during the
>first
>>  hardware phase.
>>
>>  With that in mind I'm trying to figure out whether to use a couple of
>>  larger Dell servers or a bunch of small single CPU, 1 Gig RAM, 160 GB
>>  hard drive type of machines....
>>
>>  Anyone share what they are using for hardware for about 100 million
>>  webpages and their search result times etc??  Realworld is important
>to
>>  me and being able to scale is important....
>  >
>>  Thanks,
>>
>>  Paul
>>
>>
>>
>>
>>
>>
>>
>>
>>
>------------------------------------------------------------------------
>----
>>
>>  "The information transmitted is intended only for the person or entity
>to
>>  which it is addressed and contains confidential and/or privileged
>material.
>>  If you received this in error, please contact the sender immediately
>and
>>  then destroy this transmission, including all attachments, without
>copying,
>>  distributing or disclosing same. Thank you."
>>
>
>
>
>
>----------------------------------------------------------------------------
>
>"The information transmitted is intended only for the person or 
>entity to which it is addressed and contains confidential and/or 
>privileged material. If you received this in error, please contact 
>the sender immediately and then destroy this transmission, including 
>all attachments, without copying, distributing or disclosing same. 
>Thank you."


-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

RE: Hardware Planning

Posted by Paul Stewart <ps...@nexicomgroup.net>.
No, not familiar with that yet - can you send out any URL's?

My question is really whether you're better to try for one or two big
boxes or a series of small boxes - also looking for anyone who has 100
million pages in their index and a description of their hardware as a
reference point...

Thanks!

Paul


-----Original Message-----
From: vkorvi@gmail.com [mailto:vkorvi@gmail.com] On Behalf Of VK
Sent: Wednesday, November 28, 2007 9:53 PM
To: nutch-user@lucene.apache.org
Subject: Re: Hardware Planning

Have you considered EC2 + S3?

Also Rightscale has some interesting solutions, which I am currently
evaluating.

On Nov 28, 2007 9:38 PM, Paul Stewart <ps...@nexicomgroup.net> wrote:

> Hi folks...
>
> I have read the archives and looking for input specific to my
estimated
> requirements:
>
> Want to index about 100 million public webpages.  Space and bandwidth
> are not a problem - coming up with the right hardware and keeping the
> cost down is my goal.
>
> I would estimate only 1-2 searches per second at least during the
first
> hardware phase.
>
> With that in mind I'm trying to figure out whether to use a couple of
> larger Dell servers or a bunch of small single CPU, 1 Gig RAM, 160 GB
> hard drive type of machines....
>
> Anyone share what they are using for hardware for about 100 million
> webpages and their search result times etc??  Realworld is important
to
> me and being able to scale is important....
>
> Thanks,
>
> Paul
>
>
>
>
>
>
>
>
>
------------------------------------------------------------------------
----
>
> "The information transmitted is intended only for the person or entity
to
> which it is addressed and contains confidential and/or privileged
material.
> If you received this in error, please contact the sender immediately
and
> then destroy this transmission, including all attachments, without
copying,
> distributing or disclosing same. Thank you."
>


 

----------------------------------------------------------------------------

"The information transmitted is intended only for the person or entity to which it is addressed and contains confidential and/or privileged material. If you received this in error, please contact the sender immediately and then destroy this transmission, including all attachments, without copying, distributing or disclosing same. Thank you."

Re: Hardware Planning

Posted by VK <vk...@gmail.com>.
Have you considered EC2 + S3?

Also Rightscale has some interesting solutions, which I am currently
evaluating.

On Nov 28, 2007 9:38 PM, Paul Stewart <ps...@nexicomgroup.net> wrote:

> Hi folks...
>
> I have read the archives and looking for input specific to my estimated
> requirements:
>
> Want to index about 100 million public webpages.  Space and bandwidth
> are not a problem - coming up with the right hardware and keeping the
> cost down is my goal.
>
> I would estimate only 1-2 searches per second at least during the first
> hardware phase.
>
> With that in mind I'm trying to figure out whether to use a couple of
> larger Dell servers or a bunch of small single CPU, 1 Gig RAM, 160 GB
> hard drive type of machines....
>
> Anyone share what they are using for hardware for about 100 million
> webpages and their search result times etc??  Realworld is important to
> me and being able to scale is important....
>
> Thanks,
>
> Paul
>
>
>
>
>
>
>
>
> ----------------------------------------------------------------------------
>
> "The information transmitted is intended only for the person or entity to
> which it is addressed and contains confidential and/or privileged material.
> If you received this in error, please contact the sender immediately and
> then destroy this transmission, including all attachments, without copying,
> distributing or disclosing same. Thank you."
>