You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ed Smiley <es...@ebrary.com> on 2014/04/25 21:48:41 UTC
TB scale
Anyone with experience, suggestions or lessons learned in the 10 -100 TB scale they'd like to share?
Researching optimum design for a Solr Cloud with, say, about 20TB index.
-
Thanks
Ed Smiley, Senior Software Architect, Ebooks
ProQuest | 161 Evelyn Ave. | Mountain View, CA 94041 USA | +1 640 475 8700 ext. 3772
ed.smiley@proquest.com<ma...@proquest.com>
www.proquest.com<http://www.proquest.com/> | www.ebrary.com<http://www.ebrary.com/> | www.eblib.com<http://www.eblib.com/>
ebrary and EBL, ProQuest businesses
Re: TB scale
Posted by Walter Underwood <wu...@wunderwood.org>.
I think Hathi Trust has a few terabytes of index. They do full-text search on 10 million books.
http://www.hathitrust.org/blogs/Large-scale-Search
wunder
On Apr 26, 2014, at 8:36 AM, Toke Eskildsen <te...@statsbiblioteket.dk> wrote:
>> Anyone with experience, suggestions or lessons learned in the 10 -100 TB scale they'd like to share?
>> Researching optimum design for a Solr Cloud with, say, about 20TB index.
>
> We're building a web archive with a projected index size of 20TB (distributed in 20 shards). Some test results and a short write-up at http://sbdevel.wordpress.com/2013/12/06/danish-webscale/ - feel free to ask for more details.
>
> tl;dr: We're saying to hell with RAM for caching and putting it all on SSDs on a single big machine. Results so far (some distributed tests with 200GB & 400GB indexes, some single tests with a production-index of 1TB) are very promising, both for plain keyword-search, grouping and faceting (DocValues rocks).
>
> - Toke Eskildsen
RE: TB scale
Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
> Anyone with experience, suggestions or lessons learned in the 10 -100 TB scale they'd like to share?
> Researching optimum design for a Solr Cloud with, say, about 20TB index.
We're building a web archive with a projected index size of 20TB (distributed in 20 shards). Some test results and a short write-up at http://sbdevel.wordpress.com/2013/12/06/danish-webscale/ - feel free to ask for more details.
tl;dr: We're saying to hell with RAM for caching and putting it all on SSDs on a single big machine. Results so far (some distributed tests with 200GB & 400GB indexes, some single tests with a production-index of 1TB) are very promising, both for plain keyword-search, grouping and faceting (DocValues rocks).
- Toke Eskildsen
Re: TB scale
Posted by Shawn Heisey <so...@elyograg.org>.
On 4/25/2014 1:48 PM, Ed Smiley wrote:
> Anyone with experience, suggestions or lessons learned in the 10 -100 TB scale they'd like to share?
> Researching optimum design for a Solr Cloud with, say, about 20TB index.
You've gotten some good information already in the replies that have
come your way. The following blog post is even more relevant (in the
"we don't know" department) for large indexes than it is for small indexes:
http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
My own index is nowhere near that size. It has 95 million records and
seven shards. A single copy is about 108GB and lives on two servers
that each have 64GB of RAM. I'm not running in SolrCloudmode.
The most important resource for Solr scalability is RAM. This includes
the Java heap on each server, as well as unallocated memory so the
operating system can cache the index data that lives on that server.
http://wiki.apache.org/solr/SolrPerformanceProblems
As the wiki page says, ideally you'd want as much RAM for the OS disk
cache as the index takes up on disk, but 40TB of RAM across all servers
just for the OS disk cache (in addition to whatever you need for the
java heap) is too expensive to contemplate. A 1:1 ratio is not an
absolute requirement, although it does produce the best results.
For that 40TB ideal figure, I am assuming that you mean a single replica
of your index would be 20TB, and that you'd have two.
Doing everything you can to reduce the index size will go a long way
towards improving Solr performance. Having SSD in each server for the
index data would also help. If the query volume is high, a large number
of very fast CPU cores is also required.
Thanks,
Shawn
Re: TB scale
Posted by Jack Krupansky <ja...@basetechnology.com>.
Also take a look at using DataStax Enterprise for managing large distributed
databases, using Cassandra for the system of record data storage and Solr
for indexing and search.
See:
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise
How many documents is your 20TB?
-- Jack Krupansky
-----Original Message-----
From: Ed Smiley
Sent: Friday, April 25, 2014 3:48 PM
To: solr-user@lucene.apache.org
Subject: TB scale
Anyone with experience, suggestions or lessons learned in the 10 -100 TB
scale they'd like to share?
Researching optimum design for a Solr Cloud with, say, about 20TB index.
-
Thanks
Ed Smiley, Senior Software Architect, Ebooks
ProQuest | 161 Evelyn Ave. | Mountain View, CA 94041 USA | +1 640 475 8700
ext. 3772
ed.smiley@proquest.com<ma...@proquest.com>
www.proquest.com<http://www.proquest.com/> |
www.ebrary.com<http://www.ebrary.com/> |
www.eblib.com<http://www.eblib.com/>
ebrary and EBL, ProQuest businesses
Re: TB scale
Posted by Yonik Seeley <yo...@heliosearch.com>.
How many documents? That can be just as important (often more
important) than total index size.
Some other details, like the types of requests, would be helpful (i.e.
what the index will be used for... the latency requirements of
requests, if you will be faceting, etc).
-Yonik
http://heliosearch.org - solve Solr GC pauses with off-heap filters
and fieldcache
Re: TB scale
Posted by Ed Smiley <es...@ebrary.com>.
Not looking for a cookbook.
Just curious to hear some war stories since this is relatively rare.
‹Ed :)
--
Ed Smiley, Senior Software Architect, Ebooks
ProQuest | 161 Evelyn Ave. | Mountain View, CA 94041 USA | +1 640 475 8700
ext. 3772
ed.smiley@proquest.com
www.proquest.com <http://www.proquest.com/> | www.ebrary.com
<http://www.ebrary.com/> | www.eblib.com <http://www.eblib.com/>
ebrary and EBL, ProQuest businesses
On 4/25/14, 2:01 PM, "Otis Gospodnetic" <ot...@gmail.com> wrote:
>Hi Ed,
>
>Unfortunately, there is no good *general* advice, so you'd need to provide
>a lot more detail to get useful help.
>
>Otis
>--
>Performance Monitoring * Log Analytics * Search Analytics
>Solr & Elasticsearch Support * http://sematext.com/
>
>
>On Fri, Apr 25, 2014 at 3:48 PM, Ed Smiley <es...@ebrary.com> wrote:
>
>> Anyone with experience, suggestions or lessons learned in the 10 -100 TB
>> scale they'd like to share?
>> Researching optimum design for a Solr Cloud with, say, about 20TB index.
>> -
>> Thanks
>>
>> Ed Smiley, Senior Software Architect, Ebooks
>> ProQuest | 161 Evelyn Ave. | Mountain View, CA 94041 USA | +1 640 475
>>8700
>> ext. 3772
>> ed.smiley@proquest.com<ma...@proquest.com>
>> www.proquest.com<http://www.proquest.com/> | www.ebrary.com<
>> http://www.ebrary.com/> | www.eblib.com<http://www.eblib.com/>
>> ebrary and EBL, ProQuest businesses
>>
>>
Re: TB scale
Posted by Otis Gospodnetic <ot...@gmail.com>.
Hi Ed,
Unfortunately, there is no good *general* advice, so you'd need to provide
a lot more detail to get useful help.
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/
On Fri, Apr 25, 2014 at 3:48 PM, Ed Smiley <es...@ebrary.com> wrote:
> Anyone with experience, suggestions or lessons learned in the 10 -100 TB
> scale they'd like to share?
> Researching optimum design for a Solr Cloud with, say, about 20TB index.
> -
> Thanks
>
> Ed Smiley, Senior Software Architect, Ebooks
> ProQuest | 161 Evelyn Ave. | Mountain View, CA 94041 USA | +1 640 475 8700
> ext. 3772
> ed.smiley@proquest.com<ma...@proquest.com>
> www.proquest.com<http://www.proquest.com/> | www.ebrary.com<
> http://www.ebrary.com/> | www.eblib.com<http://www.eblib.com/>
> ebrary and EBL, ProQuest businesses
>
>