You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Philippe LE NAOUR <ph...@le-naour.com> on 2005/05/20 18:27:53 UTC

Hardware requirements and some other questions about Nutch

Hi,

I'm new to this list.

I have some questions about Nutch to see if it suits my needs.

First of all, I have a database that contains 50 000 URLs classified by 
categories and sub-categories, I wish to fully crawl the 50 000 sites 
behind those URLs. No problem I can provide the urls to nutch.

I want to use the categories informations in searches to restrict 
results, for example a user can search all sites that contains cat in 
pet category. Is it possible with Nutch ? I've seen that I can add 
plugins, perhaps is it possible with plugins ?


Second part: hardware requirements.

Lets say that each website have a maximum of 1000 pages, I must store 
the index for 50 000 000 pages. How many disk storage do I need ?
I've seen that Mozdex works with 10 servers for 100 000 000 pages but I 
don't know how many requests it serves. Is there something to do to 
reduce the number of servers ?

Thanks for your replies.

PS: sorry for my very bad english.

Re: Hardware requirements and some other questions about Nutch

Posted by Byron Miller <By...@compaid.com>.
Actually at mozdex we have consolidated a bit and we are rebuilding under
the latest release.   For 50 million urls a 200 gig disk is all you need.
That leaves you enough room for your segmetns, db and the space needed to
process (about double your db size)

The biggest boost you can give your query servers is tons of memory. SATA
150 or Scsi drives at 10krpm is also a bonus.

We have finished migrating to entirely Athlon 64's and i'll be posting our
build on the site and wiki

-byron

-----Original Message-----
From: Philippe LE NAOUR <ph...@le-naour.com>
To: nutch-user@incubator.apache.org
Date: Fri, 20 May 2005 18:27:53 +0200
Subject: Hardware requirements and some other questions about Nutch

> Hi,
> 
> I'm new to this list.
> 
> I have some questions about Nutch to see if it suits my needs.
> 
> First of all, I have a database that contains 50 000 URLs classified by
> categories and sub-categories, I wish to fully crawl the 50 000 sites 
> behind those URLs. No problem I can provide the urls to nutch.
> 
> I want to use the categories informations in searches to restrict 
> results, for example a user can search all sites that contains cat in 
> pet category. Is it possible with Nutch ? I've seen that I can add 
> plugins, perhaps is it possible with plugins ?
> 
> 
> Second part: hardware requirements.
> 
> Lets say that each website have a maximum of 1000 pages, I must store 
> the index for 50 000 000 pages. How many disk storage do I need ?
> I've seen that Mozdex works with 10 servers for 100 000 000 pages but I
> don't know how many requests it serves. Is there something to do to 
> reduce the number of servers ?
> 
> Thanks for your replies.
> 
> PS: sorry for my very bad english.
>