You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by neotorand <ne...@gmail.com> on 2018/04/11 10:15:36 UTC

Decision on Number of shards and collection

Hi Team
First of all i take this opportunity to thank you all for creating a
beautiful place where people can explore ,learn and debate.

I have been on my knees for couple of days to decide on this.

When i am creating a solr cloud eco system i need to decide on number of
shards and collection.
What are the best practices for taking this decisions.

I believe heterogeneous data can be indexed to same collection and i can
have multiple shards for the index to be partitioned.So whats the need of a
second collection?. yes when collection size grows i should look for more
collection.what exactly that size is? what KPI drives the decision of having
more collection?Any pointers or links for best practice.

when should i go for multiple shards?
yes when shard size grows.Right? whats the size and how do i benchmark.

I am sorry for my question if its already asked but googled all the ecospace
quora,stackoverflow,lucid

Regards
Neo





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Decision on Number of shards and collection

Posted by Shawn Heisey <ap...@elyograg.org>.

On 4/13/2018 1:44 AM, neotorand wrote:
> Lets say i have 5 different entities and they have each 10,20,30,40 and 50
> attributes(Columns) to be indexed/stored.
> Now if i store them in single collection.is there any ways empty spaces
> being created.
> On other way if i store heterogeneous data items in a single collection,
> Does by any means there is a poor utilization of memory by creation of empty
> holes.

If a document doesn't have some of the fields your schema is capable of 
addressing, no space or memory is consumed for the missing fields.

> What are the pros and cons of single vs Multiple.

If you have a single collection for different kinds of data, then you do 
get *some* economies of scale in the total index size.  Whether that 
means a significant size reduction or a small size reduction depends on 
your data.  The downside to one collection: If you have 500000 of each 
kind of document and combine five of them into one collection, then 
every query must look through 2.5 million documents instead of 500000 
documents. These are both small numbers for Solr, but the larger index 
is still going to take more time to search.  If there are any possible 
issues with security, you'll need to include one or more fields with 
every document with information about the type of document so that 
you're able to filter results according to the access privileges of the 
user making a query.

With multiple collections, searching on each one is going to be faster 
than searching on a combined collection. Whether or not it's enough of a 
difference to matter will depend on how much data is involved, the 
nature of that data, and the nature of your queries.  But as Erick 
mentioned, you'll have less capability with Solr's join functionality -- 
assuming you even need that functionality.

Thanks,
Shawn

Re: Decision on Number of shards and collection

Posted by Erick Erickson <er...@gmail.com>.

Having documents without fields doesn't matter much.

Solr (well, Lucene actually) is pretty efficient about this. It
handles thousands of different field types, although I have to say
that when you have thousands of fields it's usually time to revisit
the design. It looks like your total field count is quite reasonable.

The biggest bit for single .vs. multiple is if you want to select
documents of one type based on some criteria of another type (think
join in DB terms). It's much easier with a single collection.

That said, if you start to use Solr "just like a database" then you
might also want to revisit your architecture ;)

Best,
Erick

On Fri, Apr 13, 2018 at 12:44 AM, neotorand <ne...@gmail.com> wrote:
> Hi Shawn,
> Thanks for the long explanation.
> Now 2 Billion limit can be overcome by using shard.
>
> Now coming back to collection.Unless we have  a logical or Business reason
> we should not go for more than one collection.
>
> Lets say i have 5 different entities and they have each 10,20,30,40 and 50
> attributes(Columns) to be indexed/stored.
> Now if i store them in single collection.is there any ways empty spaces
> being created.
> On other way if i store heterogeneous data items in a single collection,
> Does by any means there is a poor utilization of memory by creation of empty
> holes.
>
> What are the pros and cons of single vs Multiple.
>
> Thanks team for spending your valuable time to clarify.
>
> Regards
> Neo
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Decision on Number of shards and collection

Posted by neotorand <ne...@gmail.com>.

Hi Shawn,
Thanks for the long explanation.
Now 2 Billion limit can be overcome by using shard.

Now coming back to collection.Unless we have  a logical or Business reason
we should not go for more than one collection.

Lets say i have 5 different entities and they have each 10,20,30,40 and 50
attributes(Columns) to be indexed/stored.
Now if i store them in single collection.is there any ways empty spaces
being created.
On other way if i store heterogeneous data items in a single collection,
Does by any means there is a poor utilization of memory by creation of empty
holes.

What are the pros and cons of single vs Multiple.

Thanks team for spending your valuable time to clarify.

Regards
Neo





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Decision on Number of shards and collection

Posted by Shawn Heisey <ap...@elyograg.org>.

On 4/12/2018 4:57 AM, neotorand wrote:
> I read from the link you shared that
> "Shard cannot contain more than 2 billion documents since Lucene is using
> integer for internal IDs."
>
> In which java class of SOLR implimentaion repository this can be found.

The 2 billion limit  is a *hard* limit from Lucene.  It's not in Solr.  
It's pretty much the only hard limit that Lucene actually has - there's 
a workaround for everything else.  Solr can overcome this limit for a 
single index by sharding the index into multiple physical indexes across 
multiple servers, which is more automated in SolrCloud than in 
standalone mode.

The 2 billion limit per individual index can't be raised. Lucene uses an 
"int" datatype to hold the internal ID everywhere it's used.  Java 
numeric types are signed, which means that the maximum number a 32-bit 
data type can hold is 2147483647.  This is the value returned by the 
Java constant Integer.MAX_VALUE.  A little bit is subtracted from that 
value to obtain the maximum it will attempt to use, to be absolutely 
sure it can't go over.

https://issues.apache.org/jira/browse/LUCENE-5843

Raising the limit is theoretically possible, but not without *MAJOR* 
surgery to an extremely large amount of Lucene's code. The risk of bugs 
when attempting that change is *VERY* high -- it could literally take 
months to find them all and fix them.

The two most popular search engines using Lucene are Solr and 
elasticsearch. Both of these packages can overcome the 2 billion limit 
with sharding.

Summary: The 2 billion document limit can be frustrating, but since an 
index that large on a single machine is most likely not going to perform 
well and should be split across several machines, there's almost no 
value to raising the limit and risking a large number of software bugs.

Thanks,
Shawn

Re: Decision on Number of shards and collection

Posted by neotorand <ne...@gmail.com>.

Emir
I read from the link you shared that 
"Shard cannot contain more than 2 billion documents since Lucene is using
integer for internal IDs."

In which java class of SOLR implimentaion repository this can be found.

Regards
Neo



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Decision on Number of shards and collection

Posted by neotorand <ne...@gmail.com>.

Thanks every one for your beautifull explanation and valuable time.

Thanks Emir for the Nice
Link(http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html)
Thanks Shawn for
https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

When should we have more collection?

We have a business reason to keep them in separate collection
we dont need to query all data at once

When should we have more shards?
Define Latency
Go on adding document to shards till you have acceptable Latency.That will
define the shards size(SS)
Get the size of all data to be indexed.(TS)
numshards = TS/SS

One quick question.
@Shawn
If i have data in more than one collection still i can query them at once.?
I think yes as i read from SOLR site.
What are pros and cons of single vs multiple collection?

I have gone through the estimating Memory and storage for SOLR from
Lucid.(https://lucidworks.com/2011/09/14/estimating-memory-and-storage-for-lucenesolr/)

@SOLR4189 i will go through the book and get back to you.Thanks.

Time is too short to explore the Long Lived Open source technology

Regards
Neo



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Decision on Number of shards and collection

Posted by SOLR4189 <Kl...@yandex.ru>.

I advise you to read the book Solr in Action. To answer your question you
need to take account server resources that you have (CPU, RAM and disk),
take account index size and take account average size single doc.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Decision on Number of shards and collection

Posted by Erick Erickson <er...@gmail.com>.

50M is a ballpark number I use as a place to _start_ getting a handle
on capacity. It's useful solely to answer the "is it bigger than a breadbox
and smaller than a house" question. It's totally meaningless without
testing.

Say I'm talking to a client and we have no data. Some are scared
that their 10M docs will require lots of hardware. Saying "I usualy
expect to see 50M docs on a node" gives them
some confidence that it's not going to require a massive hardware
investment and they can go forward with a PoC.

OTOH I have other clients saying "We have 100B documents" and
I have to say "You could be talking 200 nodes" which gives them
incentive to do a PoC to get a hard number.

I do recommend you keep adding (perhaps synthetic) docs to your
node until it tips over. Finding your installation falls over at, say, 50M
docs means you need to start taking action beforehand. OTOH if you
load 150M docs on it and still function OK you can breathe a lot
easier...

Best,
Erick



On Wed, Apr 11, 2018 at 8:55 AM, Abhi Basu <90...@gmail.com> wrote:
>  *The BKM I have read so far (trying to find source) says 50 million
> docs/shard performs well. I have found this in my recent tests as well. But
> of course it depends on index structure, etc.*
>
> On Wed, Apr 11, 2018 at 10:37 AM, Shawn Heisey <ap...@elyograg.org> wrote:
>
>> On 4/11/2018 4:15 AM, neotorand wrote:
>> > I believe heterogeneous data can be indexed to same collection and i can
>> > have multiple shards for the index to be partitioned.So whats the need
>> of a
>> > second collection?. yes when collection size grows i should look for more
>> > collection.what exactly that size is? what KPI drives the decision of
>> having
>> > more collection?Any pointers or links for best practice.
>>
>> There are no hard rules.  Many factors affect these decisions.
>>
>> https://lucidworks.com/2012/07/23/sizing-hardware-in-the-
>> abstract-why-we-dont-have-a-definitive-answer/
>>
>> Creating multiple collections should be done when there is a logical or
>> business reason for keeping different sets of data separate from each
>> other.  If there's never any need for people to query all the data at
>> once, then it might make sense to use separate collections.  Or you
>> might want to put them together just for convenience, and use data in
>> the index to filter the results to only the information that the user is
>> allowed to access.
>>
>> > when should i go for multiple shards?
>> > yes when shard size grows.Right? whats the size and how do i benchmark.
>>
>> Some indexes function really well with 300 million documents or more per
>> shard.  Other indexes struggle with less than a million per shard.  It's
>> impossible to give you any specific number.  It depends on a bunch of
>> factors.
>>
>> If query rate is very high, then you want to keep the shard count low.
>> Using one shard might not be possible due to index size, but it should
>> be as low as you can make it.  You're also going to want to have a lot
>> of replicas to handle the load.
>>
>> If query rate is extremely low, then sharding the index can actually
>> *improve* performance, because there will be idle CPU capacity that can
>> be used for the subqueries.
>>
>> Thanks,
>> Shawn
>>
>>
>
>
> --
> Abhi Basu

Re: Decision on Number of shards and collection

Posted by Abhi Basu <90...@gmail.com>.

 *The BKM I have read so far (trying to find source) says 50 million
docs/shard performs well. I have found this in my recent tests as well. But
of course it depends on index structure, etc.*

On Wed, Apr 11, 2018 at 10:37 AM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 4/11/2018 4:15 AM, neotorand wrote:
> > I believe heterogeneous data can be indexed to same collection and i can
> > have multiple shards for the index to be partitioned.So whats the need
> of a
> > second collection?. yes when collection size grows i should look for more
> > collection.what exactly that size is? what KPI drives the decision of
> having
> > more collection?Any pointers or links for best practice.
>
> There are no hard rules.  Many factors affect these decisions.
>
> https://lucidworks.com/2012/07/23/sizing-hardware-in-the-
> abstract-why-we-dont-have-a-definitive-answer/
>
> Creating multiple collections should be done when there is a logical or
> business reason for keeping different sets of data separate from each
> other.  If there's never any need for people to query all the data at
> once, then it might make sense to use separate collections.  Or you
> might want to put them together just for convenience, and use data in
> the index to filter the results to only the information that the user is
> allowed to access.
>
> > when should i go for multiple shards?
> > yes when shard size grows.Right? whats the size and how do i benchmark.
>
> Some indexes function really well with 300 million documents or more per
> shard.  Other indexes struggle with less than a million per shard.  It's
> impossible to give you any specific number.  It depends on a bunch of
> factors.
>
> If query rate is very high, then you want to keep the shard count low.
> Using one shard might not be possible due to index size, but it should
> be as low as you can make it.  You're also going to want to have a lot
> of replicas to handle the load.
>
> If query rate is extremely low, then sharding the index can actually
> *improve* performance, because there will be idle CPU capacity that can
> be used for the subqueries.
>
> Thanks,
> Shawn
>
>


-- 
Abhi Basu

Re: Decision on Number of shards and collection

Posted by Shawn Heisey <ap...@elyograg.org>.

On 4/11/2018 4:15 AM, neotorand wrote:
> I believe heterogeneous data can be indexed to same collection and i can
> have multiple shards for the index to be partitioned.So whats the need of a
> second collection?. yes when collection size grows i should look for more
> collection.what exactly that size is? what KPI drives the decision of having
> more collection?Any pointers or links for best practice.

There are no hard rules.  Many factors affect these decisions.

https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Creating multiple collections should be done when there is a logical or
business reason for keeping different sets of data separate from each
other.  If there's never any need for people to query all the data at
once, then it might make sense to use separate collections.  Or you
might want to put them together just for convenience, and use data in
the index to filter the results to only the information that the user is
allowed to access.

> when should i go for multiple shards?
> yes when shard size grows.Right? whats the size and how do i benchmark.

Some indexes function really well with 300 million documents or more per
shard.  Other indexes struggle with less than a million per shard.  It's
impossible to give you any specific number.  It depends on a bunch of
factors.

If query rate is very high, then you want to keep the shard count low. 
Using one shard might not be possible due to index size, but it should
be as low as you can make it.  You're also going to want to have a lot
of replicas to handle the load.

If query rate is extremely low, then sharding the index can actually
*improve* performance, because there will be idle CPU capacity that can
be used for the subqueries.

Thanks,
Shawn

Re: Decision on Number of shards and collection

Posted by Emir Arnautović <em...@sematext.com>.

Hi,
Only you can tell what are acceptable query latency (I can tell you ideal - it is 0 :)
Usually you start test with a single shard and start adding documents to it and measure query latency. When you start being close to max allowed latency, you have your shard size. Then you try to estimate number of documents in index now/future and you divide that number by shard size to get the number of shards. You then test to see what is the number of shards you can have per node. You multiple number of shards by number of replicas you plan to have and divide by number of shards per node to estimate number of nodes.
If you are not happy with numbers, you change some assumption, e.g. node type, and redo tests.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/

> On 11 Apr 2018, at 15:30, neotorand <ne...@gmail.com> wrote:
> 
> Hi Emir,
> Thanks a lot for your reply.
> so when i design a solr eco system i should start with some rough guess on
> shards and increase the number of shards to make performance better.what is
> the accepted/ideal Response Time.There should be a trade off between
> Response time and the number of shards as data keeps growing.
> 
> I agree we split our index when response time increases.So what could be
> that response time threshold or query Latency?
> 
> Thanks again!
> 
> 
> Regards
> priyadarshi
> 
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Decision on Number of shards and collection

Posted by neotorand <ne...@gmail.com>.

Hi Emir,
Thanks a lot for your reply.
so when i design a solr eco system i should start with some rough guess on
shards and increase the number of shards to make performance better.what is
the accepted/ideal Response Time.There should be a trade off between
Response time and the number of shards as data keeps growing.

I agree we split our index when response time increases.So what could be
that response time threshold or query Latency?

Thanks again!


Regards
priyadarshi





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Decision on Number of shards and collection

Posted by Emir Arnautović <em...@sematext.com>.

Hi Neo,
Shard size determines query latency, so you split your index when queries become too slow. Distributed search comes with some overhead, so oversharding is not the way to go either. There is no hard rule what are the best numbers, but here  are some thought how to approach this: http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html <http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html>

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 11 Apr 2018, at 12:15, neotorand <ne...@gmail.com> wrote:
> 
> Hi Team
> First of all i take this opportunity to thank you all for creating a
> beautiful place where people can explore ,learn and debate.
> 
> I have been on my knees for couple of days to decide on this.
> 
> When i am creating a solr cloud eco system i need to decide on number of
> shards and collection.
> What are the best practices for taking this decisions.
> 
> I believe heterogeneous data can be indexed to same collection and i can
> have multiple shards for the index to be partitioned.So whats the need of a
> second collection?. yes when collection size grows i should look for more
> collection.what exactly that size is? what KPI drives the decision of having
> more collection?Any pointers or links for best practice.
> 
> when should i go for multiple shards?
> yes when shard size grows.Right? whats the size and how do i benchmark.
> 
> I am sorry for my question if its already asked but googled all the ecospace
> quora,stackoverflow,lucid
> 
> Regards
> Neo
> 
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html