You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ed Smiley <es...@ebrary.com> on 2014/04/25 21:48:41 UTC

TB scale

Anyone with experience, suggestions or lessons learned in the 10 -100 TB scale they'd like to share?
Researching optimum design for a Solr Cloud with, say, about 20TB index.
-
Thanks

Ed Smiley, Senior Software Architect, Ebooks
ProQuest | 161 Evelyn Ave. | Mountain View, CA 94041 USA | +1 640 475 8700 ext. 3772
ed.smiley@proquest.com<ma...@proquest.com>
www.proquest.com<http://www.proquest.com/> | www.ebrary.com<http://www.ebrary.com/> | www.eblib.com<http://www.eblib.com/>
ebrary and EBL, ProQuest businesses

Re: TB scale

Posted by Walter Underwood <wu...@wunderwood.org>.

I think Hathi Trust has a few terabytes of index. They do full-text search on 10 million books.

http://www.hathitrust.org/blogs/Large-scale-Search

wunder

On Apr 26, 2014, at 8:36 AM, Toke Eskildsen <te...@statsbiblioteket.dk> wrote:

>> Anyone with experience, suggestions or lessons learned in the 10 -100 TB scale they'd like to share?
>> Researching optimum design for a Solr Cloud with, say, about 20TB index.
> 
> We're building a web archive with a projected index size of 20TB (distributed in 20 shards). Some test results and a short write-up at http://sbdevel.wordpress.com/2013/12/06/danish-webscale/ - feel free to ask for more details.
> 
> tl;dr: We're saying to hell with RAM for caching and putting it all on SSDs on a single big machine. Results so far (some distributed tests with 200GB & 400GB indexes, some single tests with a production-index of 1TB) are very promising, both for plain keyword-search, grouping and faceting (DocValues rocks).
> 
> - Toke Eskildsen

RE: TB scale

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

> Anyone with experience, suggestions or lessons learned in the 10 -100 TB scale they'd like to share?
> Researching optimum design for a Solr Cloud with, say, about 20TB index.

We're building a web archive with a projected index size of 20TB (distributed in 20 shards). Some test results and a short write-up at http://sbdevel.wordpress.com/2013/12/06/danish-webscale/ - feel free to ask for more details.

tl;dr: We're saying to hell with RAM for caching and putting it all on SSDs on a single big machine. Results so far (some distributed tests with 200GB & 400GB indexes, some single tests with a production-index of 1TB) are very promising, both for plain keyword-search, grouping and faceting (DocValues rocks).

- Toke Eskildsen

Re: TB scale

Posted by Shawn Heisey <so...@elyograg.org>.

On 4/25/2014 1:48 PM, Ed Smiley wrote:
> Anyone with experience, suggestions or lessons learned in the 10 -100 TB scale they'd like to share?
> Researching optimum design for a Solr Cloud with, say, about 20TB index.

You've gotten some good information already in the replies that have
come your way. The following blog post is even more relevant (in the
"we don't know" department) for large indexes than it is for small indexes:

http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

My own index is nowhere near that size. It has 95 million records and
seven shards. A single copy is about 108GB and lives on two servers
that each have 64GB of RAM. I'm not running in SolrCloudmode.

The most important resource for Solr scalability is RAM. This includes
the Java heap on each server, as well as unallocated memory so the
operating system can cache the index data that lives on that server.

http://wiki.apache.org/solr/SolrPerformanceProblems

As the wiki page says, ideally you'd want as much RAM for the OS disk
cache as the index takes up on disk, but 40TB of RAM across all servers
just for the OS disk cache (in addition to whatever you need for the
java heap) is too expensive to contemplate. A 1:1 ratio is not an
absolute requirement, although it does produce the best results.

For that 40TB ideal figure, I am assuming that you mean a single replica
of your index would be 20TB, and that you'd have two.

Doing everything you can to reduce the index size will go a long way
towards improving Solr performance. Having SSD in each server for the
index data would also help. If the query volume is high, a large number
of very fast CPU cores is also required.

Thanks,
Shawn

Re: TB scale

Posted by Jack Krupansky <ja...@basetechnology.com>.

Also take a look at using DataStax Enterprise for managing large distributed 
databases, using Cassandra for the system of record data storage and Solr 
for indexing and search.

See:
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise

How many documents is your 20TB?

-- Jack Krupansky

-----Original Message----- 
From: Ed Smiley
Sent: Friday, April 25, 2014 3:48 PM
To: solr-user@lucene.apache.org
Subject: TB scale

Anyone with experience, suggestions or lessons learned in the 10 -100 TB 
scale they'd like to share?
Researching optimum design for a Solr Cloud with, say, about 20TB index.
-
Thanks

Ed Smiley, Senior Software Architect, Ebooks
ProQuest | 161 Evelyn Ave. | Mountain View, CA 94041 USA | +1 640 475 8700 
ext. 3772
ed.smiley@proquest.com<ma...@proquest.com>
www.proquest.com<http://www.proquest.com/> | 
www.ebrary.com<http://www.ebrary.com/> | 
www.eblib.com<http://www.eblib.com/>
ebrary and EBL, ProQuest businesses

Re: TB scale

Posted by Yonik Seeley <yo...@heliosearch.com>.

How many documents?  That can be just as important (often more
important) than total index size.
Some other details, like the types of requests, would be helpful (i.e.
what the index will be used for... the latency requirements of
requests, if you will be faceting, etc).

-Yonik
http://heliosearch.org - solve Solr GC pauses with off-heap filters
and fieldcache

Re: TB scale

Posted by Ed Smiley <es...@ebrary.com>.

Not looking for a cookbook.
Just curious to hear some war stories since this is relatively rare.

‹Ed :)
-- 

Ed Smiley, Senior Software Architect, Ebooks
ProQuest | 161 Evelyn Ave. | Mountain View, CA 94041 USA | +1 640 475 8700
ext. 3772
ed.smiley@proquest.com
www.proquest.com <http://www.proquest.com/> | www.ebrary.com
<http://www.ebrary.com/> | www.eblib.com <http://www.eblib.com/>
ebrary and EBL, ProQuest businesses
 





On 4/25/14, 2:01 PM, "Otis Gospodnetic" <ot...@gmail.com> wrote:

>Hi Ed,
>
>Unfortunately, there is no good *general* advice, so you'd need to provide
>a lot more detail to get useful help.
>
>Otis
>--
>Performance Monitoring * Log Analytics * Search Analytics
>Solr & Elasticsearch Support * http://sematext.com/
>
>
>On Fri, Apr 25, 2014 at 3:48 PM, Ed Smiley <es...@ebrary.com> wrote:
>
>> Anyone with experience, suggestions or lessons learned in the 10 -100 TB
>> scale they'd like to share?
>> Researching optimum design for a Solr Cloud with, say, about 20TB index.
>> -
>> Thanks
>>
>> Ed Smiley, Senior Software Architect, Ebooks
>> ProQuest | 161 Evelyn Ave. | Mountain View, CA 94041 USA | +1 640 475
>>8700
>> ext. 3772
>> ed.smiley@proquest.com<ma...@proquest.com>
>> www.proquest.com<http://www.proquest.com/> | www.ebrary.com<
>> http://www.ebrary.com/> | www.eblib.com<http://www.eblib.com/>
>> ebrary and EBL, ProQuest businesses
>>
>>

Re: TB scale

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi Ed,

Unfortunately, there is no good *general* advice, so you'd need to provide
a lot more detail to get useful help.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Fri, Apr 25, 2014 at 3:48 PM, Ed Smiley <es...@ebrary.com> wrote:

> Anyone with experience, suggestions or lessons learned in the 10 -100 TB
> scale they'd like to share?
> Researching optimum design for a Solr Cloud with, say, about 20TB index.
> -
> Thanks
>
> Ed Smiley, Senior Software Architect, Ebooks
> ProQuest | 161 Evelyn Ave. | Mountain View, CA 94041 USA | +1 640 475 8700
> ext. 3772
> ed.smiley@proquest.com<ma...@proquest.com>
> www.proquest.com<http://www.proquest.com/> | www.ebrary.com<
> http://www.ebrary.com/> | www.eblib.com<http://www.eblib.com/>
> ebrary and EBL, ProQuest businesses
>
>