You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by pa...@wipro.com on 2013/05/15 07:00:30 UTC

Billion document index

Hi,

We have to setup a billion document index using Apache Solr(about 2 billion docs). I need some assistance on choosing the right configuration for the environment setup. I have been going through the Solr documentation, but couldn't figure out what would be the best configuration for same.

Below is the configuration on one of the box we have, can you please assist if this will suffice our requirement.
OS: SunOS
RAM : 32GB
Processor: 4 (SPARC64-VII+/2660Mhz)
Server: Sun SPARC Enterprise M5000

Desired System Requirements:
Expected number of requests/day: 10,000
New Documents ingested/day: 1Million

I have below questions on same:

1. Does the above system configuration seems ok for our requirement?

2. Is it ok if we host entire index on a single physical box or I should use multiple physical box?

3. Should I go for the simple installation or SolrCloud?

4. If I should use SolrCloud, then probably I may have to use some master/slave setup(not sure though)? What I have in mind is to use master for ingestion of new documents and slave for querying. Then at the end of the day I can have the replication to update the slaves? Can you please advise if this is a good approach and most importantly if this is feasible?

On google, I could find that many people have already setup such environment, but I couldn't figure out the configuration they are using. If some can share their experience, then it will probably help others as well.

Thanks!

Regards,
Pankaj

Please do not print this email unless it is absolutely necessary.

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

www.wipro.com

Re: Billion document index

Posted by Shawn Heisey <so...@elyograg.org>.

On 5/15/2013 3:56 AM, pankaj.pandey4@wipro.com wrote:
> Thanks Shawn for explaining everything in such detail, it was really helpful.
> 
> Have few more queries on the same. Can you please explain the purpose of the 3rd box in minimal configuration, with the standalone zookeeper? 

A zookeeper ensemble works best with an odd number of hosts.  If you
want redundancy, it requires a minimum instance count of three.  To
ensure that it can survive hardware failure, they must all be on
different physical machines.  If you've only got two Solr servers, then
you need a third host to complete zookeeper.

> On separate note, if we go with ahead with 4 box(8 shard with replication factor 2 for each):
> 	1. Would it be ok to maintain the replica on the same box or we would need separate box for that?

You do not want replicas on the same host.  Failures are inevitable, and
if a host with both replicas of a shard were to fail, that data would be
either temporarily inaccessible or just gone.

With 8 shards and a replication factor of two, you'll have 16 total
replicas, with four replicas on each server.  The replicas for shard 1
might be on server 1 and server 2, the replicas for shard 2 might be on
server3 and server4, etc.

> 	2. Is the above configuration sufficient enough to guarantee failover and high availability?

If you use the collections API to create your collection, it will
automatically place the replicas so that everything will fail over
correctly.

> 	3. How can I configure my application to query always against the replica and let the master be used only for ingestion. Replica will be synced with 	master after working hours(overnight).

You can't.  SolrCloud's basic operational model doesn't work this way.
When you index, the document is forwarded to the replica that is the
current elected leader for the proper shard.  The leader will index the
document and forward it to all other replicas for that shard, which will
also index the document.  Normal SolrCloud operation does not use
replication, each copy does its own indexing.  You just have to index
the data to any machine in the cluster and SolrCloud takes care of the rest.

When you query, the machine that receives the request will automatically
farm out different requests to itself and other machines, giving you
some aspects of load balancing for free.

You may be confused by the replication comment above, because SolrCloud
actually does require that you enable replication in your config.  The
reason that it requires this config is that replication may be required
when recovering the index on a replica that goes down and then comes
back up.  It is ONLY used for index recovery, not normal operation.

Thanks,
Shawn

RE: Billion document index

Posted by pa...@wipro.com.

Thanks Shawn for explaining everything in such detail, it was really helpful.

Have few more queries on the same. Can you please explain the purpose of the 3rd box in minimal configuration, with the standalone zookeeper? 

On separate note, if we go with ahead with 4 box(8 shard with replication factor 2 for each):
	1. Would it be ok to maintain the replica on the same box or we would need separate box for that?
	2. Is the above configuration sufficient enough to guarantee failover and high availability?
	3. How can I configure my application to query always against the replica and let the master be used only for ingestion. Replica will be synced with 	master after working hours(overnight).

Regards,
Pankaj

-----Original Message-----
From: Shawn Heisey [mailto:solr@elyograg.org] 
Sent: Wednesday, May 15, 2013 12:01 PM
To: solr-user@lucene.apache.org
Subject: Billion document index

On 5/14/2013 11:00 PM, pankaj.pandey4@wipro.com wrote:
> We have to setup a billion document index using Apache Solr(about 2 billion docs). I need some assistance on choosing the right configuration for the environment setup. I have been going through the Solr documentation, but couldn't figure out what would be the best configuration for same.
> 
> Below is the configuration on one of the box we have, can you please assist if this will suffice our requirement.
> OS: SunOS
> RAM : 32GB
> Processor: 4 (SPARC64-VII+/2660Mhz)
> Server: Sun SPARC Enterprise M5000
> 
> Desired System Requirements:
> Expected number of requests/day: 10,000 New Documents ingested/day: 
> 1Million
> 
> I have below questions on same:
> 
> 1.       Does the above system configuration seems ok for our requirement?
> 
> 2.       Is it ok if we host entire index on a single physical box or I should use multiple physical box?
> 
> 3.       Should I go for the simple installation or SolrCloud?
> 
> 4.       If I should use SolrCloud, then probably I may have to use some master/slave setup(not sure though)? What I have in mind is to use master for ingestion of new documents and slave for querying. Then at the end of the day I can have the replication to update the slaves? Can you please advise if this is a good approach and most importantly if this is feasible?
> 
> On google, I could find that many people have already setup such environment, but I couldn't figure out the configuration they are using. If some can share their experience, then it will probably help others as well.

There's a lot of information here for you to digest.  Be sure to read all the way to the end.  The numbers (and the costs associated with those numbers) might scare you.

I have no idea what's going to be in your index.  I even looked up the website for your email domain, and still don't know what you might be trying to search.

Because it uses a 32-bit signed number to track things, a single Solr index (not sharded) is limited to a little more than 2 billion documents.  This means you'll want to use distributed search (sharding) from the beginning.  For new deployments, SolrCloud is much better than trying to handle sharding yourself.  You'll probably want 8 or more shards and a minimum replication factor of 2, so that you have two copies of every shard.  That doesn't necessarily mean that you'll need that many machines, but you might want to plan on at least four of them.
 You'll probably be putting more than one shard per server.

SolrCloud is a true cluster - there is no master and no slaves.  Both indexing and queries are completely distributed.  Clients that you write in Java get these distributed features with no extra requirements, non-Java clients will require some form of external load balancing to ensure that they are always talking to a server that's up.

The absolute minimum number of physical machines you need for SolrCloud is three.  Two of those need to be the beefy workhorses.  Each of them will run Solr and a standalone ZooKeeper.  The third machine can be modest and will just run a third instance of zookeeper.  If you have more than two servers that will run Solr, then you can just run the standalone zookeeper on three of them and won't need any extra hardware.

Memory is going to be your real problem with a very large index.  When it comes to the amount of required memory, you might want to read this wiki page, then come back here:

http://wiki.apache.org/solr/SolrPerformanceProblems

I really was serious about reading that page, and not just because I wrote it.  The information you'll find there is key to understanding the scale of what you propose and what I'm going to say below.

Even with very small documents, an index with 2 billion of them is probably going to be at least 100GB, and quite possibly 300GB, 500GB, or larger.

For discussion purposes, let's say that you've got the extremely conservative index size of 100GB and you're going to put that on four servers.  To cache this index sufficiently to avoid performance problems, you'll need between 64GB and 128GB of total RAM for caching across the entire cluster.

If we assume that you've taken every possible step to reduce Solr's Java heap requirements, you might be able to do a heap of 8 to 16GB per server, but the actual heap requirement could be significantly higher.
Adding this up, you get a bare minimum memory requirement of 32GB for each of those four servers.  Ideally, you'd need to have 48GB for each of them.  If you plan to put it on two Solr servers instead of four, double the per-server memory requirement.

Remember that all the information in the previous paragraph assumes a total index size of 100GB, and your index has the potential to be a lot bigger than 100GB.  If you have a 300GB index size instead of 100GB, triple those numbers.  Scale up similarly for larger sizes.

One final note: your anticipated query volume is quite low, so you might be able to get away with a little bit less memory than I have described here, but you should be aware that running with less may cause query times measured in tens of seconds, and SolrCloud may become very unstable.

Thanks,
Shawn

Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. 

www.wipro.com

Re: Billion document index

Posted by Daniel Collins <da...@gmail.com>.

Just on our experiences, we have a large collection (350M documents, but
1.2Tb in size spread across 4 shards/machines and multiple replicas, we may
well need more) and the first thing we needed to do for size estimation was
to work out how big a set number of documents would be on disk.  So we did
a test collection, inserted 1000 documents and measured the size of the
collection. You've told us its 2 billion docs, but 2 billion of what, is it
lots of fields, lots of text, how many fields are stored, how many are
indexed, etc... As Shawn says, you need to do some empirical analysis
yourself on what your collection will look like.  There is a spreadsheet
(size-estimator-lucene-solr.xls) in the Solr source distribution, but that
will be a very rough rule of thumb, and  I don't know if it handles the new
compressed stored fields?  Only you can know how big your collection is,
there is no hard and fast rule.

As an aside, it seemed an odd suggestion for hardware, a beefy (and
expensive) Sun box, but with relatively low memory (I know PCs that have
16Gb, so 32Gb isn't much these days).  Again, I can only comment on our
experience, but we are going for the horizontal scaling approach which Solr
cloud is more suited too, so we have smaller Linux/Intel-based machines,
but with 256Gb of RAM (and we may well need 512Gb) to try and maximize the
use of the page cache.

Depending on your eventual collection size, and how frequently your updates
are coming in (you said 1M per day, but is that a bulk job or just
occurring adhoc), you may want to investigate SSD storage.  If you are
doing lots of segments merges because the collection is continually
changing, then disk IO could be an issue.  We went that route, since we
have a NRT setup and we want to try and keep the number of segments down so
that search times are better.

On 15 May 2013 07:38, Shawn Heisey <so...@elyograg.org> wrote:

> On 5/15/2013 12:31 AM, Shawn Heisey wrote:
> > If we assume that you've taken every possible step to reduce Solr's Java
> > heap requirements, you might be able to do a heap of 8 to 16GB per
> > server, but the actual heap requirement could be significantly higher.
> > Adding this up, you get a bare minimum memory requirement of 32GB for
> > each of those four servers.  Ideally, you'd need to have 48GB for each
> > of them.  If you plan to put it on two Solr servers instead of four,
> > double the per-server memory requirement.
> >
> > Remember that all the information in the previous paragraph assumes a
> > total index size of 100GB, and your index has the potential to be a lot
> > bigger than 100GB.  If you have a 300GB index size instead of 100GB,
> > triple those numbers.  Scale up similarly for larger sizes.
>
> I should have made something clear here: These are just possible
> estimates, not hard reliable numbers.  In particular, if you end up
> needing a larger per-server heap size than 16GB, then my estimates are
> wrong.
>
> Thanks,
> Shawn
>
>

Re: Billion document index

Posted by Shawn Heisey <so...@elyograg.org>.

On 5/15/2013 12:31 AM, Shawn Heisey wrote:
> If we assume that you've taken every possible step to reduce Solr's Java
> heap requirements, you might be able to do a heap of 8 to 16GB per
> server, but the actual heap requirement could be significantly higher.
> Adding this up, you get a bare minimum memory requirement of 32GB for
> each of those four servers.  Ideally, you'd need to have 48GB for each
> of them.  If you plan to put it on two Solr servers instead of four,
> double the per-server memory requirement.
> 
> Remember that all the information in the previous paragraph assumes a
> total index size of 100GB, and your index has the potential to be a lot
> bigger than 100GB.  If you have a 300GB index size instead of 100GB,
> triple those numbers.  Scale up similarly for larger sizes.

I should have made something clear here: These are just possible
estimates, not hard reliable numbers.  In particular, if you end up
needing a larger per-server heap size than 16GB, then my estimates are
wrong.

Thanks,
Shawn

RE: Billion document index

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

Shawn Heisey [solr@elyograg.org]:
> Performance testing would be required in order to make a proper
> determination on whether SSD makes financial sense.

I fully agree.

[Lack of TRIM with RAID]

> then performance eventually suffers, and can become even worse than
> a spinning hard disk.

Do you have a source for that?

> From what people have said here on this list, SSDs can do really really
> amazing things for performance ... but IMHO they cannot eliminate the
> cache requirement.  They may reduce it drastically, of course.

It would be exceedingly difficult to set up a machine that had no free memory for cache, so eliminating the need for cache does not really give us anything. What works is when the need for cache is reduced to be non-significant for a machine: When building a search server for a 200GB index, there is quite the difference between needing 200GB free memory for cache and needing 20GB. Even assuming that using a SSD only lowers the cache requirement by 50%, it would still be cheaper to use a 400GB SSD instead of installing the extra 100GB of RAM.

(I'm setting the SSD-price to 1/5 of RAM here. YMMW)

- Toke Eskildsen

Re: Billion document index

Posted by Shawn Heisey <so...@elyograg.org>.

On 5/15/2013 1:57 AM, Toke Eskildsen wrote:
> On Wed, 2013-05-15 at 08:31 +0200, Shawn Heisey wrote:
>> http://wiki.apache.org/solr/SolrPerformanceProblems
>>
>> I really was serious about reading that page, and not just because I
>> wrote it.
> 
> That page makes a clear recommendation of RAM over SSDs.
> Have you done any performance testing on this?

For good performance, I don't think SSDs give you the ability to size
your machines solely according to your heap requirements.  SSD will
greatly reduce the cost of a cache miss, but it's still a lot more
expensive (timewise) to read from SSD than it is from RAM.

Based on the latency numbers of RAM and SSD, you will still need RAM for
caching.  You might only need 10-50% of your index size for the cache
instead of 50-100%.

All the above is my best estimate based on what information I have.  I
don't have access to the required hardware to do performance testing,
and it's unlikely that I will have access in the next few years.
Donations are welcome! :)  Performance testing would be required in
order to make a proper determination on whether SSD makes financial sense.

There is at least one technical hurdle that I know of with SSD, though I
hope it's already been fixed.  That's TRIM support in RAID controllers
and the software RAID of modern operating systems.  If the OS cannot
tell the SSD that the space occupied by deleted files is once again
available for re-allocation, then performance eventually suffers, and
can become even worse than a spinning hard disk.  TRIM support is
available now for single disks, but those disks are not big enough for
many workloads, and single disk volumes will eventually have a
catastrophic failure.

>From what people have said here on this list, SSDs can do really really
amazing things for performance ... but IMHO they cannot eliminate the
cache requirement.  They may reduce it drastically, of course.

Thanks,
Shawn

Re: Billion document index

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Wed, 2013-05-15 at 08:31 +0200, Shawn Heisey wrote:
> http://wiki.apache.org/solr/SolrPerformanceProblems
> 
> I really was serious about reading that page, and not just because I
> wrote it.

That page makes a clear recommendation of RAM over SSDs.
Have you done any performance testing on this?

- Toke Eskildsen, State and University Library, Denmark