You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Scott Prentice <sp...@leximation.com> on 2018/03/14 00:01:58 UTC

Scoping SolrCloud setup

We're in the process of moving from 12 single-core collections 
(non-cloud Solr) on 3 VMs to a SolrCloud setup. Our collections aren't 
huge, ranging in size from 50K to 150K documents with one at 1.2M docs. 
Our max query frequency is rather low .. probably no more than 
10-20/min. We do update frequently, maybe 10-100 documents every 10 mins.

Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've 
got each collection split into 2 shards with 3 replicas (one per VM). 
Also, Zookeeper is running on each VM. I understand that it's best to 
have each ZK server on a separate machine, but hoping this will work for 
now.

This all seemed like a good place to start, but after reading lots of 
articles and posts, I'm thinking that maybe our smaller collections 
(under 100K docs) should just be one shard each, and maybe the 1.2M 
collection should be more like 6 shards. How do you decide how many 
shards is right?

Also, our current live system is separated into dev/stage/prod tiers, 
not, all of these tiers are together on each of the cloud VMs. This 
bothers some people, thinking that it may make our production 
environment less stable. I know that in an ideal world, we'd have them 
all on separate systems, but with the replication, it seems like we're 
going to make the overall system more stable. Is this a correct 
understanding?

I'm just wondering anyone has opinions on whether we're going in a 
reasonable direction or not. Are there any articles that discuss these 
initial sizing/scoping issues?

Thanks!
...scott



Re: Scoping SolrCloud setup

Posted by Scott Prentice <sp...@leximation.com>.
Greg...

Thanks. That's very helpful, and is inline with what I've been seeing.

So, to be clear, you're saying that the size of all collections on a 
server should be less than the available RAM. It looks like we've got 
about 13GB of documents in all (and growing), so, if we're restricted to 
16GB on each VM I'm thinking that it probably makes sense to split the 
collections over multiple VMs rather than having them all on one. 
Perhaps instead of all indexes replicated on 3 VMs, we should split 
things up over 4 VMs and go down to just 2 replicas. We can add 2 more 
VMs to go up to 3 replicas if that seems necessary at some point.

Thanks,
...scott


On 3/13/18 6:15 PM, Greg Roodt wrote:
> A single shard is much simpler conceptually and also cheaper to query. I
> would say that even your 1.2M collection can be a single shard. I'm running
> a single shard setup 4X that size. You can still have replicas of this
> shard for redundancy / availability purposes.
>
> I'm not an expert, but I think one of the deciding factors is if your index
> can fit into RAM (not JVM Heap, but OS cache). What are the sizes of your
> indexes?
>
> On 14 March 2018 at 11:01, Scott Prentice <sp...@leximation.com> wrote:
>
>> We're in the process of moving from 12 single-core collections (non-cloud
>> Solr) on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging
>> in size from 50K to 150K documents with one at 1.2M docs. Our max query
>> frequency is rather low .. probably no more than 10-20/min. We do update
>> frequently, maybe 10-100 documents every 10 mins.
>>
>> Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got
>> each collection split into 2 shards with 3 replicas (one per VM). Also,
>> Zookeeper is running on each VM. I understand that it's best to have each
>> ZK server on a separate machine, but hoping this will work for now.
>>
>> This all seemed like a good place to start, but after reading lots of
>> articles and posts, I'm thinking that maybe our smaller collections (under
>> 100K docs) should just be one shard each, and maybe the 1.2M collection
>> should be more like 6 shards. How do you decide how many shards is right?
>>
>> Also, our current live system is separated into dev/stage/prod tiers, not,
>> all of these tiers are together on each of the cloud VMs. This bothers some
>> people, thinking that it may make our production environment less stable. I
>> know that in an ideal world, we'd have them all on separate systems, but
>> with the replication, it seems like we're going to make the overall system
>> more stable. Is this a correct understanding?
>>
>> I'm just wondering anyone has opinions on whether we're going in a
>> reasonable direction or not. Are there any articles that discuss these
>> initial sizing/scoping issues?
>>
>> Thanks!
>> ...scott
>>
>>
>>


Re: Scoping SolrCloud setup

Posted by Greg Roodt <gr...@gmail.com>.
A single shard is much simpler conceptually and also cheaper to query. I
would say that even your 1.2M collection can be a single shard. I'm running
a single shard setup 4X that size. You can still have replicas of this
shard for redundancy / availability purposes.

I'm not an expert, but I think one of the deciding factors is if your index
can fit into RAM (not JVM Heap, but OS cache). What are the sizes of your
indexes?

On 14 March 2018 at 11:01, Scott Prentice <sp...@leximation.com> wrote:

> We're in the process of moving from 12 single-core collections (non-cloud
> Solr) on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging
> in size from 50K to 150K documents with one at 1.2M docs. Our max query
> frequency is rather low .. probably no more than 10-20/min. We do update
> frequently, maybe 10-100 documents every 10 mins.
>
> Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got
> each collection split into 2 shards with 3 replicas (one per VM). Also,
> Zookeeper is running on each VM. I understand that it's best to have each
> ZK server on a separate machine, but hoping this will work for now.
>
> This all seemed like a good place to start, but after reading lots of
> articles and posts, I'm thinking that maybe our smaller collections (under
> 100K docs) should just be one shard each, and maybe the 1.2M collection
> should be more like 6 shards. How do you decide how many shards is right?
>
> Also, our current live system is separated into dev/stage/prod tiers, not,
> all of these tiers are together on each of the cloud VMs. This bothers some
> people, thinking that it may make our production environment less stable. I
> know that in an ideal world, we'd have them all on separate systems, but
> with the replication, it seems like we're going to make the overall system
> more stable. Is this a correct understanding?
>
> I'm just wondering anyone has opinions on whether we're going in a
> reasonable direction or not. Are there any articles that discuss these
> initial sizing/scoping issues?
>
> Thanks!
> ...scott
>
>
>

Re: Scoping SolrCloud setup

Posted by Scott Prentice <sp...@leximation.com>.
Walter...

Thanks for the additional data points. Clearly we're a long way from 
needing anything too complex.

Cheers!
...scott


On 3/14/18 1:12 PM, Walter Underwood wrote:
> That would be my recommendation for a first setup. One Solr instance per host, one shard per collection. We run 5 million document cores with 8 GB of heap for the JVM. We size the RAM so that all the indexes fit in OS filesystem buffers.
>
> Our big cluster is 32 hosts, 21 million documents in four shards. Each host is a 36 processor Amazon instance. Each host has one 8 GB Solr process (Solr 6.6.2, java 8u121, G1 collector). No faceting, but we get very long queries, average length is 25 terms.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On Mar 14, 2018, at 12:50 PM, Scott Prentice <sp...@leximation.com> wrote:
>>
>> Erick...
>>
>> Thanks. Yes. I think we were just going shard-happy without really understanding the purpose. I think we'll start by keeping things simple .. no shards, fewer replicas, maybe a bit more RAM. Then we can assess the performance and make adjustments as needed.
>>
>> Yes, that's the main reason for moving from our current non-cloud Solr setup to SolrCloud .. future flexibility as well as greater stability.
>>
>> Thanks!
>> ...scott
>>
>>
>> On 3/14/18 11:34 AM, Erick Erickson wrote:
>>> Scott:
>>>
>>> Eventually you'll hit the limit of your hardware, regardless of VMs.
>>> I've seen multiple VMs help a lot when you have really beefy hardware,
>>> as in 32 cores, 128G memory and the like. Otherwise it's iffier.
>>>
>>> re: sharding or not. As others wrote, sharding is only useful when a
>>> single collection grows past the limits of your hardware. Until that
>>> point, it's usually a better bet to get better hardware than shard.
>>> I've seen 300M docs fit in a single shard. I've also seen 10M strain
>>> pretty beefy hardware.. but from what you've said multiple shards are
>>> really not something you need to worry about.
>>>
>>> About balancing all this across VMs and/or machines. You have a _lot_
>>> of room to balance things. Let's say you put all your collections on
>>> one physical machine to start (not recommending, just sayin'). 6
>>> months from now you need to move collections 1-10 to another machine
>>> due to growth. You:
>>> 1> spin up a new machine
>>> 2> build out collections 1-10 on those machines by using the
>>> Collections API ADDREPLICA.
>>> 3> once the new replicas are healthy. DELETEREPLICA on the old hardware.
>>>
>>> No down time. No configuration to deal with, SolrCloud will take care
>>> of it for you.
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Mar 14, 2018 at 9:32 AM, Scott Prentice <sp...@leximation.com> wrote:
>>>> Emir...
>>>>
>>>> Thanks for the input. Our larger collections are localized content, so it
>>>> may make sense to shard those so we can target the specific index. I'll need
>>>> to confirm how it's being used, if queries are always within a language or
>>>> if they are cross-language.
>>>>
>>>> Thanks also for the link .. very helpful!
>>>>
>>>> All the best,
>>>> ...scott
>>>>
>>>>
>>>>
>>>>
>>>> On 3/14/18 2:21 AM, Emir Arnautović wrote:
>>>>> Hi Scott,
>>>>> There is no definite answer - it depends on your documents and query
>>>>> patterns. Sharding does come with an overhead but also allows Solr to
>>>>> parallelise search. Query latency is usually something that tells you if you
>>>>> need to split collection to multiple shards or not. In caseIf you are ok
>>>>> with latency there is no need to split. Other scenario where shards make
>>>>> sense is when routing is used in majority of queries so that enables you to
>>>>> query only subset of documents.
>>>>> Also, there is indexing aspect where sharding helps - in case of high
>>>>> indexing throughput is needed, having multiple shards will spread indexing
>>>>> load to multiple servers.
>>>>> It seems to me that there is no high indexing throughput requirement and
>>>>> the main criteria should be query latency.
>>>>> Here is another blog post talking about this subject:
>>>>> http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html
>>>>> <http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html>
>>>>>
>>>>> Thanks,
>>>>> Emir
>>>>> --
>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>>>
>>>>>
>>>>>
>>>>>> On 14 Mar 2018, at 01:01, Scott Prentice <sp...@leximation.com> wrote:
>>>>>>
>>>>>> We're in the process of moving from 12 single-core collections (non-cloud
>>>>>> Solr) on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging in
>>>>>> size from 50K to 150K documents with one at 1.2M docs. Our max query
>>>>>> frequency is rather low .. probably no more than 10-20/min. We do update
>>>>>> frequently, maybe 10-100 documents every 10 mins.
>>>>>>
>>>>>> Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got
>>>>>> each collection split into 2 shards with 3 replicas (one per VM). Also,
>>>>>> Zookeeper is running on each VM. I understand that it's best to have each ZK
>>>>>> server on a separate machine, but hoping this will work for now.
>>>>>>
>>>>>> This all seemed like a good place to start, but after reading lots of
>>>>>> articles and posts, I'm thinking that maybe our smaller collections (under
>>>>>> 100K docs) should just be one shard each, and maybe the 1.2M collection
>>>>>> should be more like 6 shards. How do you decide how many shards is right?
>>>>>>
>>>>>> Also, our current live system is separated into dev/stage/prod tiers,
>>>>>> not, all of these tiers are together on each of the cloud VMs. This bothers
>>>>>> some people, thinking that it may make our production environment less
>>>>>> stable. I know that in an ideal world, we'd have them all on separate
>>>>>> systems, but with the replication, it seems like we're going to make the
>>>>>> overall system more stable. Is this a correct understanding?
>>>>>>
>>>>>> I'm just wondering anyone has opinions on whether we're going in a
>>>>>> reasonable direction or not. Are there any articles that discuss these
>>>>>> initial sizing/scoping issues?
>>>>>>
>>>>>> Thanks!
>>>>>> ...scott
>>>>>>
>>>>>>
>


Re: Scoping SolrCloud setup

Posted by Walter Underwood <wu...@wunderwood.org>.
That would be my recommendation for a first setup. One Solr instance per host, one shard per collection. We run 5 million document cores with 8 GB of heap for the JVM. We size the RAM so that all the indexes fit in OS filesystem buffers.

Our big cluster is 32 hosts, 21 million documents in four shards. Each host is a 36 processor Amazon instance. Each host has one 8 GB Solr process (Solr 6.6.2, java 8u121, G1 collector). No faceting, but we get very long queries, average length is 25 terms.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Mar 14, 2018, at 12:50 PM, Scott Prentice <sp...@leximation.com> wrote:
> 
> Erick...
> 
> Thanks. Yes. I think we were just going shard-happy without really understanding the purpose. I think we'll start by keeping things simple .. no shards, fewer replicas, maybe a bit more RAM. Then we can assess the performance and make adjustments as needed.
> 
> Yes, that's the main reason for moving from our current non-cloud Solr setup to SolrCloud .. future flexibility as well as greater stability.
> 
> Thanks!
> ...scott
> 
> 
> On 3/14/18 11:34 AM, Erick Erickson wrote:
>> Scott:
>> 
>> Eventually you'll hit the limit of your hardware, regardless of VMs.
>> I've seen multiple VMs help a lot when you have really beefy hardware,
>> as in 32 cores, 128G memory and the like. Otherwise it's iffier.
>> 
>> re: sharding or not. As others wrote, sharding is only useful when a
>> single collection grows past the limits of your hardware. Until that
>> point, it's usually a better bet to get better hardware than shard.
>> I've seen 300M docs fit in a single shard. I've also seen 10M strain
>> pretty beefy hardware.. but from what you've said multiple shards are
>> really not something you need to worry about.
>> 
>> About balancing all this across VMs and/or machines. You have a _lot_
>> of room to balance things. Let's say you put all your collections on
>> one physical machine to start (not recommending, just sayin'). 6
>> months from now you need to move collections 1-10 to another machine
>> due to growth. You:
>> 1> spin up a new machine
>> 2> build out collections 1-10 on those machines by using the
>> Collections API ADDREPLICA.
>> 3> once the new replicas are healthy. DELETEREPLICA on the old hardware.
>> 
>> No down time. No configuration to deal with, SolrCloud will take care
>> of it for you.
>> 
>> Best,
>> Erick
>> 
>> On Wed, Mar 14, 2018 at 9:32 AM, Scott Prentice <sp...@leximation.com> wrote:
>>> Emir...
>>> 
>>> Thanks for the input. Our larger collections are localized content, so it
>>> may make sense to shard those so we can target the specific index. I'll need
>>> to confirm how it's being used, if queries are always within a language or
>>> if they are cross-language.
>>> 
>>> Thanks also for the link .. very helpful!
>>> 
>>> All the best,
>>> ...scott
>>> 
>>> 
>>> 
>>> 
>>> On 3/14/18 2:21 AM, Emir Arnautović wrote:
>>>> Hi Scott,
>>>> There is no definite answer - it depends on your documents and query
>>>> patterns. Sharding does come with an overhead but also allows Solr to
>>>> parallelise search. Query latency is usually something that tells you if you
>>>> need to split collection to multiple shards or not. In caseIf you are ok
>>>> with latency there is no need to split. Other scenario where shards make
>>>> sense is when routing is used in majority of queries so that enables you to
>>>> query only subset of documents.
>>>> Also, there is indexing aspect where sharding helps - in case of high
>>>> indexing throughput is needed, having multiple shards will spread indexing
>>>> load to multiple servers.
>>>> It seems to me that there is no high indexing throughput requirement and
>>>> the main criteria should be query latency.
>>>> Here is another blog post talking about this subject:
>>>> http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html
>>>> <http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html>
>>>> 
>>>> Thanks,
>>>> Emir
>>>> --
>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>> 
>>>> 
>>>> 
>>>>> On 14 Mar 2018, at 01:01, Scott Prentice <sp...@leximation.com> wrote:
>>>>> 
>>>>> We're in the process of moving from 12 single-core collections (non-cloud
>>>>> Solr) on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging in
>>>>> size from 50K to 150K documents with one at 1.2M docs. Our max query
>>>>> frequency is rather low .. probably no more than 10-20/min. We do update
>>>>> frequently, maybe 10-100 documents every 10 mins.
>>>>> 
>>>>> Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got
>>>>> each collection split into 2 shards with 3 replicas (one per VM). Also,
>>>>> Zookeeper is running on each VM. I understand that it's best to have each ZK
>>>>> server on a separate machine, but hoping this will work for now.
>>>>> 
>>>>> This all seemed like a good place to start, but after reading lots of
>>>>> articles and posts, I'm thinking that maybe our smaller collections (under
>>>>> 100K docs) should just be one shard each, and maybe the 1.2M collection
>>>>> should be more like 6 shards. How do you decide how many shards is right?
>>>>> 
>>>>> Also, our current live system is separated into dev/stage/prod tiers,
>>>>> not, all of these tiers are together on each of the cloud VMs. This bothers
>>>>> some people, thinking that it may make our production environment less
>>>>> stable. I know that in an ideal world, we'd have them all on separate
>>>>> systems, but with the replication, it seems like we're going to make the
>>>>> overall system more stable. Is this a correct understanding?
>>>>> 
>>>>> I'm just wondering anyone has opinions on whether we're going in a
>>>>> reasonable direction or not. Are there any articles that discuss these
>>>>> initial sizing/scoping issues?
>>>>> 
>>>>> Thanks!
>>>>> ...scott
>>>>> 
>>>>> 
> 


Re: Scoping SolrCloud setup

Posted by Scott Prentice <sp...@leximation.com>.
Erick...

Thanks. Yes. I think we were just going shard-happy without really 
understanding the purpose. I think we'll start by keeping things simple 
.. no shards, fewer replicas, maybe a bit more RAM. Then we can assess 
the performance and make adjustments as needed.

Yes, that's the main reason for moving from our current non-cloud Solr 
setup to SolrCloud .. future flexibility as well as greater stability.

Thanks!
...scott


On 3/14/18 11:34 AM, Erick Erickson wrote:
> Scott:
>
> Eventually you'll hit the limit of your hardware, regardless of VMs.
> I've seen multiple VMs help a lot when you have really beefy hardware,
> as in 32 cores, 128G memory and the like. Otherwise it's iffier.
>
> re: sharding or not. As others wrote, sharding is only useful when a
> single collection grows past the limits of your hardware. Until that
> point, it's usually a better bet to get better hardware than shard.
> I've seen 300M docs fit in a single shard. I've also seen 10M strain
> pretty beefy hardware.. but from what you've said multiple shards are
> really not something you need to worry about.
>
> About balancing all this across VMs and/or machines. You have a _lot_
> of room to balance things. Let's say you put all your collections on
> one physical machine to start (not recommending, just sayin'). 6
> months from now you need to move collections 1-10 to another machine
> due to growth. You:
> 1> spin up a new machine
> 2> build out collections 1-10 on those machines by using the
> Collections API ADDREPLICA.
> 3> once the new replicas are healthy. DELETEREPLICA on the old hardware.
>
> No down time. No configuration to deal with, SolrCloud will take care
> of it for you.
>
> Best,
> Erick
>
> On Wed, Mar 14, 2018 at 9:32 AM, Scott Prentice <sp...@leximation.com> wrote:
>> Emir...
>>
>> Thanks for the input. Our larger collections are localized content, so it
>> may make sense to shard those so we can target the specific index. I'll need
>> to confirm how it's being used, if queries are always within a language or
>> if they are cross-language.
>>
>> Thanks also for the link .. very helpful!
>>
>> All the best,
>> ...scott
>>
>>
>>
>>
>> On 3/14/18 2:21 AM, Emir Arnautović wrote:
>>> Hi Scott,
>>> There is no definite answer - it depends on your documents and query
>>> patterns. Sharding does come with an overhead but also allows Solr to
>>> parallelise search. Query latency is usually something that tells you if you
>>> need to split collection to multiple shards or not. In caseIf you are ok
>>> with latency there is no need to split. Other scenario where shards make
>>> sense is when routing is used in majority of queries so that enables you to
>>> query only subset of documents.
>>> Also, there is indexing aspect where sharding helps - in case of high
>>> indexing throughput is needed, having multiple shards will spread indexing
>>> load to multiple servers.
>>> It seems to me that there is no high indexing throughput requirement and
>>> the main criteria should be query latency.
>>> Here is another blog post talking about this subject:
>>> http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html
>>> <http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html>
>>>
>>> Thanks,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>
>>>
>>>
>>>> On 14 Mar 2018, at 01:01, Scott Prentice <sp...@leximation.com> wrote:
>>>>
>>>> We're in the process of moving from 12 single-core collections (non-cloud
>>>> Solr) on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging in
>>>> size from 50K to 150K documents with one at 1.2M docs. Our max query
>>>> frequency is rather low .. probably no more than 10-20/min. We do update
>>>> frequently, maybe 10-100 documents every 10 mins.
>>>>
>>>> Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got
>>>> each collection split into 2 shards with 3 replicas (one per VM). Also,
>>>> Zookeeper is running on each VM. I understand that it's best to have each ZK
>>>> server on a separate machine, but hoping this will work for now.
>>>>
>>>> This all seemed like a good place to start, but after reading lots of
>>>> articles and posts, I'm thinking that maybe our smaller collections (under
>>>> 100K docs) should just be one shard each, and maybe the 1.2M collection
>>>> should be more like 6 shards. How do you decide how many shards is right?
>>>>
>>>> Also, our current live system is separated into dev/stage/prod tiers,
>>>> not, all of these tiers are together on each of the cloud VMs. This bothers
>>>> some people, thinking that it may make our production environment less
>>>> stable. I know that in an ideal world, we'd have them all on separate
>>>> systems, but with the replication, it seems like we're going to make the
>>>> overall system more stable. Is this a correct understanding?
>>>>
>>>> I'm just wondering anyone has opinions on whether we're going in a
>>>> reasonable direction or not. Are there any articles that discuss these
>>>> initial sizing/scoping issues?
>>>>
>>>> Thanks!
>>>> ...scott
>>>>
>>>>


Re: Scoping SolrCloud setup

Posted by Erick Erickson <er...@gmail.com>.
Scott:

Eventually you'll hit the limit of your hardware, regardless of VMs.
I've seen multiple VMs help a lot when you have really beefy hardware,
as in 32 cores, 128G memory and the like. Otherwise it's iffier.

re: sharding or not. As others wrote, sharding is only useful when a
single collection grows past the limits of your hardware. Until that
point, it's usually a better bet to get better hardware than shard.
I've seen 300M docs fit in a single shard. I've also seen 10M strain
pretty beefy hardware.. but from what you've said multiple shards are
really not something you need to worry about.

About balancing all this across VMs and/or machines. You have a _lot_
of room to balance things. Let's say you put all your collections on
one physical machine to start (not recommending, just sayin'). 6
months from now you need to move collections 1-10 to another machine
due to growth. You:
1> spin up a new machine
2> build out collections 1-10 on those machines by using the
Collections API ADDREPLICA.
3> once the new replicas are healthy. DELETEREPLICA on the old hardware.

No down time. No configuration to deal with, SolrCloud will take care
of it for you.

Best,
Erick

On Wed, Mar 14, 2018 at 9:32 AM, Scott Prentice <sp...@leximation.com> wrote:
> Emir...
>
> Thanks for the input. Our larger collections are localized content, so it
> may make sense to shard those so we can target the specific index. I'll need
> to confirm how it's being used, if queries are always within a language or
> if they are cross-language.
>
> Thanks also for the link .. very helpful!
>
> All the best,
> ...scott
>
>
>
>
> On 3/14/18 2:21 AM, Emir Arnautović wrote:
>>
>> Hi Scott,
>> There is no definite answer - it depends on your documents and query
>> patterns. Sharding does come with an overhead but also allows Solr to
>> parallelise search. Query latency is usually something that tells you if you
>> need to split collection to multiple shards or not. In caseIf you are ok
>> with latency there is no need to split. Other scenario where shards make
>> sense is when routing is used in majority of queries so that enables you to
>> query only subset of documents.
>> Also, there is indexing aspect where sharding helps - in case of high
>> indexing throughput is needed, having multiple shards will spread indexing
>> load to multiple servers.
>> It seems to me that there is no high indexing throughput requirement and
>> the main criteria should be query latency.
>> Here is another blog post talking about this subject:
>> http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html
>> <http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html>
>>
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>>> On 14 Mar 2018, at 01:01, Scott Prentice <sp...@leximation.com> wrote:
>>>
>>> We're in the process of moving from 12 single-core collections (non-cloud
>>> Solr) on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging in
>>> size from 50K to 150K documents with one at 1.2M docs. Our max query
>>> frequency is rather low .. probably no more than 10-20/min. We do update
>>> frequently, maybe 10-100 documents every 10 mins.
>>>
>>> Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got
>>> each collection split into 2 shards with 3 replicas (one per VM). Also,
>>> Zookeeper is running on each VM. I understand that it's best to have each ZK
>>> server on a separate machine, but hoping this will work for now.
>>>
>>> This all seemed like a good place to start, but after reading lots of
>>> articles and posts, I'm thinking that maybe our smaller collections (under
>>> 100K docs) should just be one shard each, and maybe the 1.2M collection
>>> should be more like 6 shards. How do you decide how many shards is right?
>>>
>>> Also, our current live system is separated into dev/stage/prod tiers,
>>> not, all of these tiers are together on each of the cloud VMs. This bothers
>>> some people, thinking that it may make our production environment less
>>> stable. I know that in an ideal world, we'd have them all on separate
>>> systems, but with the replication, it seems like we're going to make the
>>> overall system more stable. Is this a correct understanding?
>>>
>>> I'm just wondering anyone has opinions on whether we're going in a
>>> reasonable direction or not. Are there any articles that discuss these
>>> initial sizing/scoping issues?
>>>
>>> Thanks!
>>> ...scott
>>>
>>>
>>
>

Re: Scoping SolrCloud setup

Posted by Scott Prentice <sp...@leximation.com>.
Emir...

Thanks for the input. Our larger collections are localized content, so 
it may make sense to shard those so we can target the specific index. 
I'll need to confirm how it's being used, if queries are always within a 
language or if they are cross-language.

Thanks also for the link .. very helpful!

All the best,
...scott



On 3/14/18 2:21 AM, Emir Arnautović wrote:
> Hi Scott,
> There is no definite answer - it depends on your documents and query patterns. Sharding does come with an overhead but also allows Solr to parallelise search. Query latency is usually something that tells you if you need to split collection to multiple shards or not. In caseIf you are ok with latency there is no need to split. Other scenario where shards make sense is when routing is used in majority of queries so that enables you to query only subset of documents.
> Also, there is indexing aspect where sharding helps - in case of high indexing throughput is needed, having multiple shards will spread indexing load to multiple servers.
> It seems to me that there is no high indexing throughput requirement and the main criteria should be query latency.
> Here is another blog post talking about this subject: http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html <http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html>
>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>> On 14 Mar 2018, at 01:01, Scott Prentice <sp...@leximation.com> wrote:
>>
>> We're in the process of moving from 12 single-core collections (non-cloud Solr) on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging in size from 50K to 150K documents with one at 1.2M docs. Our max query frequency is rather low .. probably no more than 10-20/min. We do update frequently, maybe 10-100 documents every 10 mins.
>>
>> Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got each collection split into 2 shards with 3 replicas (one per VM). Also, Zookeeper is running on each VM. I understand that it's best to have each ZK server on a separate machine, but hoping this will work for now.
>>
>> This all seemed like a good place to start, but after reading lots of articles and posts, I'm thinking that maybe our smaller collections (under 100K docs) should just be one shard each, and maybe the 1.2M collection should be more like 6 shards. How do you decide how many shards is right?
>>
>> Also, our current live system is separated into dev/stage/prod tiers, not, all of these tiers are together on each of the cloud VMs. This bothers some people, thinking that it may make our production environment less stable. I know that in an ideal world, we'd have them all on separate systems, but with the replication, it seems like we're going to make the overall system more stable. Is this a correct understanding?
>>
>> I'm just wondering anyone has opinions on whether we're going in a reasonable direction or not. Are there any articles that discuss these initial sizing/scoping issues?
>>
>> Thanks!
>> ...scott
>>
>>
>


Re: Scoping SolrCloud setup

Posted by Emir Arnautović <em...@sematext.com>.
Hi Scott,
There is no definite answer - it depends on your documents and query patterns. Sharding does come with an overhead but also allows Solr to parallelise search. Query latency is usually something that tells you if you need to split collection to multiple shards or not. In caseIf you are ok with latency there is no need to split. Other scenario where shards make sense is when routing is used in majority of queries so that enables you to query only subset of documents.
Also, there is indexing aspect where sharding helps - in case of high indexing throughput is needed, having multiple shards will spread indexing load to multiple servers.
It seems to me that there is no high indexing throughput requirement and the main criteria should be query latency.
Here is another blog post talking about this subject: http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html <http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html>

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 14 Mar 2018, at 01:01, Scott Prentice <sp...@leximation.com> wrote:
> 
> We're in the process of moving from 12 single-core collections (non-cloud Solr) on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging in size from 50K to 150K documents with one at 1.2M docs. Our max query frequency is rather low .. probably no more than 10-20/min. We do update frequently, maybe 10-100 documents every 10 mins.
> 
> Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got each collection split into 2 shards with 3 replicas (one per VM). Also, Zookeeper is running on each VM. I understand that it's best to have each ZK server on a separate machine, but hoping this will work for now.
> 
> This all seemed like a good place to start, but after reading lots of articles and posts, I'm thinking that maybe our smaller collections (under 100K docs) should just be one shard each, and maybe the 1.2M collection should be more like 6 shards. How do you decide how many shards is right?
> 
> Also, our current live system is separated into dev/stage/prod tiers, not, all of these tiers are together on each of the cloud VMs. This bothers some people, thinking that it may make our production environment less stable. I know that in an ideal world, we'd have them all on separate systems, but with the replication, it seems like we're going to make the overall system more stable. Is this a correct understanding?
> 
> I'm just wondering anyone has opinions on whether we're going in a reasonable direction or not. Are there any articles that discuss these initial sizing/scoping issues?
> 
> Thanks!
> ...scott
> 
>