You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Dan Segel <sa...@glowmania.net> on 2008/06/05 22:27:15 UTC

Gigablast.com search engine- 10BILLION PAGES!

Our ultimate goal is to basically replicate gigablast.com search engine.  They claim to have less than 500 servers that contain 10billion pages indexed, spidered and updated on a routine basis...  I am looking at featuring 500 million pages indexed per node, and have a total of 20 nodes.  Each node will feature 2 quad core processes, 4TB (at raid 5) and 32 gb of ram.  I believe this can be done however how many searches per second do you think would be realistic in this instance?  We are looking at achieving 25+/- searches per second ultimately spread out over the 20 nodes... I can really uses some advice with this one.

Thanks,
D. Segel

Re: Gigablast.com search engine- 10BILLION PAGES!

Posted by Ted Dunning <te...@gmail.com>.
Web scale and web-speed search almost always means memory based search.

500 Mpages in 25GB of memory means that you have 50 bytes per document
available.  This is very small.  Conceivable for some applications, but not
likely if you want to have high quality search.

25 queries per second against such an index (memory size, not documents)
seems very doable.  Possibly even easy.  You should be able to do this with
something like SOLR.

I think you need to budget no more than 100Mpages per node (and that might
be ambitious).

On Thu, Jun 5, 2008 at 1:27 PM, Dan Segel <sa...@glowmania.net> wrote:

> Our ultimate goal is to basically replicate gigablast.com search engine.
>  They claim to have less than 500 servers that contain 10billion pages
> indexed, spidered and updated on a routine basis...  I am looking at
> featuring 500 million pages indexed per node, and have a total of 20 nodes.
>  Each node will feature 2 quad core processes, 4TB (at raid 5) and 32 gb of
> ram.  I believe this can be done however how many searches per second do you
> think would be realistic in this instance?  We are looking at achieving
> 25+/- searches per second ultimately spread out over the 20 nodes... I can
> really uses some advice with this one.
>
> Thanks,
> D. Segel




-- 
ted