You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Rahul Goswami <ra...@gmail.com> on 2019/06/25 20:53:22 UTC

Configuration recommendation for SolrCloud

Hello,
We are running Solr 7.2.1 and planning for a deployment which will grow to
4 billion documents over time. We have 16 nodes at disposal.I am thinking
between 3 configurations:

1 cluster - 16 nodes
vs
2 clusters - 8 nodes each
vs
4 clusters -4 nodes each

Irrespective of the configuration, each node would host 8 shards (eg: a
cluster with 16 nodes would have 16*8=128 shards; similarly, 32 shards in a
4 node cluster). These 16 nodes will be hosted across 4 beefy servers each
with 128 GB RAM. So we can allocate 32 GB RAM (not heap space) to each
node. what configuration would be most efficient for our use case
considering moderate-heavy indexing and search load? Would also like to
know the tradeoffs involved if any. Thanks in advance!

Regards,
Rahul

Re: Configuration recommendation for SolrCloud

Posted by Rahul Goswami <ra...@gmail.com>.
Hi Toke,

Thank you for following up. Reading back, I surely could have explained
better. Thanks for asking again.

>> What is a cluster? Is it a fully separate SolrCloud?
Yes, by cluster I mean a fully separate SolrCloud.


>> If so, does that mean you can divide your collection into (at least) 4
independent parts, where the indexing flow and the clients knows which
cluster to use?
So we can divide the documents across 4 SolrClouds each with multiple
nodes. The clients would know which SolrCloud to index to. So the answer to
your question is yes.


>>  Can it be divided further?
For the sake of maintainability and ease of configuration, we wouldn't want
to go beyond 4 SolrClouds. So at this point I would say no. But open to
ideas if you think it would be greatly advantageous.


So if we go with the 3rd configuration option we would be roughly indexing
1 billion documents (with an analyzed 'content' field possibly containing
large text) per SolrCloud.

Also I later got to know additional configurations and updated hardware
specs, so let me revise that. We would index with a replication factor of
2. Hence each SolrCloud would have 4x2=8 nodes and 1 billion x 2 =2 billion
documents indexed (with an analyzed 'content' field possibly containing
large text). We would have up to 12 GB heap space allocated per node. By
node I mean an individual Solr instance running on a certain port. Hence to
break down the specs :

For each SolrCloud:

8 nodes, each with 12 GB heap for Solr. Each node hosting 16 replicas
(cores).
2 billion documents (replication factor=2. So 1 billion unique documents)

Would SolrCloud scale well with the given configuration for a
moderate-heavy indexing and search load ?

Additional consideration: We have 4 beefy physical servers at disposal for
this deployment. If we go with 4 SolrClouds then we would have 4x8=32 nodes
(Solr instances) running across these 4 physical servers.

Any issues that you might see with this configuration or additional
considerations that I might be missing?

Thanks,
Rahul







On Sat, Jun 29, 2019 at 1:13 PM Toke Eskildsen <to...@kb.dk> wrote:

> Rahul Goswami <ra...@gmail.com> wrote:
> > We are running Solr 7.2.1 and planning for a deployment which will grow
> to
> > 4 billion documents over time. We have 16 nodes at disposal.I am thinking
> > between 3 configurations:
> >
> > 1 cluster - 16 nodes
> > vs
> > 2 clusters - 8 nodes each
> > vs
> > 4 clusters -4 nodes each
>
> You haven't got any answers. Maybe because it is a bit unclear what you're
> asking. What is a cluster? Is it a fully separate SolrCloud? If so, does
> that mean you can divide your collection into (at least) 4 independent
> parts, where the indexing flow and the clients knows which cluster to use?
> Can it be divided further?
>
> - Toke Eskildsen
>

Re: Configuration recommendation for SolrCloud

Posted by Toke Eskildsen <to...@kb.dk>.
Rahul Goswami <ra...@gmail.com> wrote:
> We are running Solr 7.2.1 and planning for a deployment which will grow to
> 4 billion documents over time. We have 16 nodes at disposal.I am thinking
> between 3 configurations:
> 
> 1 cluster - 16 nodes
> vs
> 2 clusters - 8 nodes each
> vs
> 4 clusters -4 nodes each

You haven't got any answers. Maybe because it is a bit unclear what you're asking. What is a cluster? Is it a fully separate SolrCloud? If so, does that mean you can divide your collection into (at least) 4 independent parts, where the indexing flow and the clients knows which cluster to use? Can it be divided further?

- Toke Eskildsen

Re: Configuration recommendation for SolrCloud

Posted by Jörn Franke <jo...@gmail.com>.
As someone else wrote there are a lot of uncertainties and I recommend to test yourself to find the optimal configuration. Some food for thought:
How many clients do you have and what is their concurrency? What operations will they do? Do they Access Solr directly? You can use Jmeter to simulate the querying part (and also the indexing). Depending on the concurrency of users you may need to think about the number of CPUs.
What does moderate indexing mean? How much does the collection grow per day ?
Have you thought about putting the Zookeeper ensemble on dedicated nodes?

Why do you want to use an older Solr version? Why not the newest + JDK 11?

In what format are the documents? Will you convert them before ? What analysis will you do on the documents (may have impact on index size etc)?

Also important - how do you plan to reindex the full collection in case a Schema field changes (hint: look that the user query aliases so this can be done without interruption).

Normally I would expect a web app in between also for security reasons. You may need to scale this one as well.

You don’t have to answer those questions here, but I recommend to answer them during a Proof-of-Concept at your premises yourself.
I don’t see a point to create more than one cluster (except for disaster recovery and cross data center replication if this is needed). Maybe I am overlooking something here why you thought of multiple clusters.

> Am 25.06.2019 um 22:53 schrieb Rahul Goswami <ra...@gmail.com>:
> 
> Hello,
> We are running Solr 7.2.1 and planning for a deployment which will grow to
> 4 billion documents over time. We have 16 nodes at disposal.I am thinking
> between 3 configurations:
> 
> 1 cluster - 16 nodes
> vs
> 2 clusters - 8 nodes each
> vs
> 4 clusters -4 nodes each
> 
> Irrespective of the configuration, each node would host 8 shards (eg: a
> cluster with 16 nodes would have 16*8=128 shards; similarly, 32 shards in a
> 4 node cluster). These 16 nodes will be hosted across 4 beefy servers each
> with 128 GB RAM. So we can allocate 32 GB RAM (not heap space) to each
> node. what configuration would be most efficient for our use case
> considering moderate-heavy indexing and search load? Would also like to
> know the tradeoffs involved if any. Thanks in advance!
> 
> Regards,
> Rahul