You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by neotorand <ne...@gmail.com> on 2018/04/11 10:16:28 UTC

Decision on Number of shards and collection

Hi Team 
First of all i take this opportunity to thank you all for creating a
beautiful place where people can explore ,learn and debate. 

I have been on my knees for couple of days to decide on this. 

When i am creating a solr cloud eco system i need to decide on number of
shards and collection. 
What are the best practices for taking this decisions. 

I believe heterogeneous data can be indexed to same collection and i can
have multiple shards for the index to be partitioned.So whats the need of a
second collection?. yes when collection size grows i should look for more
collection.what exactly that size is? what KPI drives the decision of having
more collection?Any pointers or links for best practice. 

when should i go for multiple shards? 
yes when shard size grows.Right? whats the size and how do i benchmark. 

I am sorry for my question if its already asked but googled all the ecospace
quora,stackoverflow,lucid 

Regards 
Neo



--
Sent from: http://lucene.472066.n3.nabble.com/Lucene-Java-Users-f532864.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Decision on Number of shards and collection

Posted by Adrien Grand <jp...@gmail.com>.

Hi noetorand,

You will likely find better help on the solr-user mailing-list. This
mailing list is for questions about Lucene.

Le mer. 11 avr. 2018 à 12:16, neotorand <ne...@gmail.com> a écrit :

> Hi Team
> First of all i take this opportunity to thank you all for creating a
> beautiful place where people can explore ,learn and debate.
>
> I have been on my knees for couple of days to decide on this.
>
> When i am creating a solr cloud eco system i need to decide on number of
> shards and collection.
> What are the best practices for taking this decisions.
>
> I believe heterogeneous data can be indexed to same collection and i can
> have multiple shards for the index to be partitioned.So whats the need of a
> second collection?. yes when collection size grows i should look for more
> collection.what exactly that size is? what KPI drives the decision of
> having
> more collection?Any pointers or links for best practice.
>
> when should i go for multiple shards?
> yes when shard size grows.Right? whats the size and how do i benchmark.
>
> I am sorry for my question if its already asked but googled all the
> ecospace
> quora,stackoverflow,lucid
>
> Regards
> Neo
>
>
>
> --
> Sent from:
> http://lucene.472066.n3.nabble.com/Lucene-Java-Users-f532864.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Decision on Number of shards and collection

Posted by Denis Bazhenov <do...@gmail.com>.

Hello.

The answer will depend on the context of the system. I'll give you my point of view from perspective of developing and supporting medium to large scale search systems (400M documents, 40 shards, about 20 collections, 30-40+ physical servers)

Basically, I'd recommend:

1. if there are distinct sets of documents which are never will be queried together (no matter heterogeneous or not), always split them in different collections. This will reduce search time, index time and so forth. It may be a little bit more complex from operational point of view, though.

2. with sharding (splitting logically homogenous index in parts) I have no easy answer, but it's kind of opposite... Basically, inverted index is very efficient data structure. And most efficient (in terms of CPU time spent per query) implementation of a search a system is single index system (no sharding). Sadly, such system will suffer from low search throughput. When splitting index you increase search throughput, but also increase the cost of processing a single query. The more hardware you have the more important efficiency will be.

This implies that if you have room for increasing search throughput using replicas instead of sharding, you should do it. It's more efficient and more simple way, but only if:

1. index is small enough to fit inside RAM of a single box;
2. your search queries/algorithms is efficient enough in terms of GC pressure for one box to handle reasonable amount of requests.

> On Apr 11, 2018, at 20:16, neotorand <ne...@gmail.com> wrote:
> 
> Hi Team 
> First of all i take this opportunity to thank you all for creating a
> beautiful place where people can explore ,learn and debate. 
> 
> I have been on my knees for couple of days to decide on this. 
> 
> When i am creating a solr cloud eco system i need to decide on number of
> shards and collection. 
> What are the best practices for taking this decisions. 
> 
> I believe heterogeneous data can be indexed to same collection and i can
> have multiple shards for the index to be partitioned.So whats the need of a
> second collection?. yes when collection size grows i should look for more
> collection.what exactly that size is? what KPI drives the decision of having
> more collection?Any pointers or links for best practice. 
> 
> when should i go for multiple shards? 
> yes when shard size grows.Right? whats the size and how do i benchmark. 
> 
> I am sorry for my question if its already asked but googled all the ecospace
> quora,stackoverflow,lucid 
> 
> Regards 
> Neo
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Lucene-Java-Users-f532864.html
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---
Denis Bazhenov <do...@gmail.com>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org