You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2010/02/25 22:09:37 UTC

Advice on deployment

We are currently using a commerical indexing product based on Lucene for 
our indexing needs, and would like to replace it with SOLR.  The source 
database for this system has 40 million records, growing by about 30,000 
items per day. It is a repository for all the metadata relating to an 
archive of photos/images, text articles, and recently video content.  
The metadata also exists in the filesystem along with the actual 
content, and this copy is used when a single item is selected on the 
website.

Our existing deployment consists of 20 "static" index processes running 
on 10 servers, each with up to 2.1 million rows of the index.  In 
addition, there is an 11th server that acts as a search broker and 
houses another index, called the incremental, where all the new data 
goes as it comes in.  Each of these servers is a 64-bit CentOS virtual 
machine with 5.5GB of RAM allocated.  They run on three hosts, each with 
32GB of RAM, 8 CPU cores at either 2.5 or 2.66Ghz, and four SATA disks 
in a RAID10 array, and a fourth machine available for redundancy and 
development.  The data is divided among these indexes by an 
autoincrement database field called DID (document identifier), the 
primary key on the table.  The table (75GB of data, 9GB of index) has 
80+ fields for each document, but only about 60 of them are used in the 
index, and about 50 of those are actually stored in the index.

The current SOLR plan is to divide the index into shards, and distribute 
the documents among the shards by using a modulus function on the DID.  
We haven't determined yet how documents to put in each shard for the 
best performance.  As part of the testing for this, I am building the 
full index right now with six shards, so each one will have between 6 
and 7 million documents.  For performance reasons, we will continue the 
incremental concept and use a separate shard for new data.

I actually did build a massive single index with the entire database in 
it.  It performed a lot better than I expected - returning results 
within 1-2 seconds with one query happening at a time - but it ground to 
a halt when subjected to a very mild load test.  At the time, my caches 
had not been tuned, but I still don't think it will scale.

We are currently testing with the pre-packaged version 1.4 running under 
Jetty because we want maximum stability, but we already determined that 
this isn't going to cut it.  At the very least, we will need SOLR-1143 
to keep the system working through problems and maintenance, but today I 
was pointed in the direction of SolrCloud.

I have duplicated the concept of a search broker from the current 
solution by giving each of my VMs four cores - broker, live, build, and 
test.  The broker core talks to a directory named broker, the other 
three use core0, core1, and core2, so that we can swap cores and not 
have name confusion.  We'll just use the XML output of the server to 
keep track of them.

The broker core has a shards parameter included in solrconfig.xml 
pointing at all the 'live' cores, so you can just issue search queries 
to that core and get results from the entire system - as long as all the 
shards are up.  The broker core will always have an empty index.

Does anyone have any recommendations about the best way to build in 
fault tolerance, considering long-term viability?  I don't think I want 
to use the full SolrCloud feature set, but shard replicas are very 
intriguing.  The drawback there is that I need twice as much hardware so 
I can run two copies of each shard on separate physical machines, but it 
has a plus side - if we lost a whole VM host, recovery would be much 
easier than it is now.  I found SOLR-1537, which if I read it right, 
would let us have shard replicas with the regular codebase.

How stable can I expect things to be if I get a nightly build or grab 
1.5 and/or SolrCloud from SVN?  Are there any particular nightly builds 
that are known to be more stable than others?  I'm an admin who flirts 
with programming from time to time, but we do have actual Java 
developers on staff.

Thanks,
Shawn