You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Nishanth S <ni...@gmail.com> on 2015/01/07 23:29:30 UTC

Determining the Number of Solr Shards

Hi All,

I  am working on coming up with a solr architecture layout  for my use
case.We are a very write heavy application with  no down time tolerance and
 have low SLAs on reads when compared with writes.I am looking at around
12K tps with average index size of solr document in the range of 6kB.I
would like to go with 3 replicas for that extra fault tolerance and  trying
to identify the number  of shards.The machines are monsterous and have
 around 100 GB of RAM and  more than 24 cores on each.Is there a way to
come at the number of  desired shards in this case.Any pointers would be
helpful.


Thanks,
Nishanth

Re: Determining the Number of Solr Shards

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Thu, 2015-01-08 at 22:55 +0100, Nishanth S wrote:
> Thanks guys for your inputs I would be looking at around 100 Tb of total
>  index size  with 5100 million documents [...]

That is a large corpus when coupled with your high indexing & QPS
requirements. Are the queries complex too? Will you be doing non-trivial
faceting?

Your requirements are so high that any guesswork at this point is likely
to be wrong by an order of magnitude. What is very certain is that you
will need serious hardware. Your starting point should not be to try and
estimate the number of shards. Start by building a test setup.

- Toke Eskildsen



RE: Determining the Number of Solr Shards

Posted by Andrew Butkus <an...@c6-intelligence.com>.
We decided to downgrade to 20 shards again, as we kept having the query time spikes, if it was a memory issue, I would assume we would have the same performance issues with 20 shards, so I think this is maybe a problem in solr rather than our configuration / amount of ram.


In anycase, we have thought about adding some more servers to the solrcloud we have, is there an easy way to add servers to a quoram without having to reshard and re-index? I have looked at the collections API and not discovered one yet ...

Thanks

Andy


-----Original Message-----
From: Jack Krupansky [mailto:jack.krupansky@gmail.com] 
Sent: 08 January 2015 22:17
To: solr-user@lucene.apache.org
Subject: Re: Determining the Number of Solr Shards

My final advice would be my standard proof of concept implementation advice
- test a configuration with 10% (or 5%) of the target data size and 10% (or
5%) of the estimated resource requirements (maybe 25% of the estimated RAM) and see how well it performs.

Take the actual index size and multiply by 10 (or 20 for a 5% load) to get a closer estimate of total storage required.

If a 10% load fails to perform well with 25% of the total estimated RAM, then you can be sure that you'll have problems with 10x the data and only 4x the RAM. Increase the RAM for that 10 load until you get acceptable performance for both indexing and a full range of queries, and then use 10x that RAM for the RAM for the 100% load. That's the OS system memory for file caching, not the total system RAM.

-- Jack Krupansky

On Thu, Jan 8, 2015 at 4:55 PM, Nishanth S <ni...@gmail.com> wrote:

> Thanks guys for your inputs I would be looking at around 100 Tb of 
> total  index size  with 5100 million documents  for  a period of  30 
> days before we purge the  indexes.I had estimated it slightly on the  
> higher side of things but that's where I feel we would be.
>
> Thanks,
> Nishanth
>
> On Wed, Jan 7, 2015 at 7:50 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>
> > On 1/7/2015 7:14 PM, Nishanth S wrote:
> > > Thanks Shawn and Walter.Yes those are 12,000 writes/second.Reads  
> > > for
> the
> > > moment would be in the 1000 reads/second. Guess finding out the 
> > > right number  of  shards would be my starting point.
> >
> > I don't think indexing 12000 docs per second would be too much for 
> > Solr to handle, as long as you architect the indexing application properly.
> > You would likely need to have several indexing threads or processes 
> > that index in parallel.  Solr is fully thread-safe and can handle 
> > several indexing requests at the same time.  If the indexing 
> > application is single-threaded, indexing speed will not reach its full potential.
> >
> > Be aware that indexing at the same time as querying will reduce the 
> > number of queries per second that you can handle.  In an environment 
> > where both reads and writes are heavy like you have described, more 
> > shards and/or more replicas might be required.
> >
> > For the query side ... even 1000 queries per second is a fairly 
> > heavy query rate.  You're likely to need at least a few replicas, 
> > possibly several, to handle that.  The type and complexity of the 
> > queries you do will make a big difference as well.  To handle that 
> > query level, I would still recommend only running one shard replica 
> > on each server.  If you have three shards and three replicas, that means 9 Solr servers.
> >
> > How many documents will you have in total?  You said they are about 
> > 6KB each ... but depending on the fieldType definitions (and the 
> > analysis chain for TextField types), 6KB might be very large or fairly small.
> >
> > Do you have any idea how large the Solr index will be with all your 
> > documents?  Estimating that will require indexing a significant 
> > percentage of your documents with the actual schema and config that 
> > you will use in production.
> >
> > If I know how many documents you have, how large the full index will 
> > be, and can see an example of the more complex queries you will do, 
> > I can make *preliminary* guesses about the number of shards you 
> > might need.  I do have to warn you that it will only be a guess.  
> > You'll have to experiment to see what works best.
> >
> > Thanks,
> > Shawn
> >
> >
>

Re: Determining the Number of Solr Shards

Posted by Jack Krupansky <ja...@gmail.com>.
My final advice would be my standard proof of concept implementation advice
- test a configuration with 10% (or 5%) of the target data size and 10% (or
5%) of the estimated resource requirements (maybe 25% of the estimated RAM)
and see how well it performs.

Take the actual index size and multiply by 10 (or 20 for a 5% load) to get
a closer estimate of total storage required.

If a 10% load fails to perform well with 25% of the total estimated RAM,
then you can be sure that you'll have problems with 10x the data and only
4x the RAM. Increase the RAM for that 10 load until you get acceptable
performance for both indexing and a full range of queries, and then use 10x
that RAM for the RAM for the 100% load. That's the OS system memory for
file caching, not the total system RAM.

-- Jack Krupansky

On Thu, Jan 8, 2015 at 4:55 PM, Nishanth S <ni...@gmail.com> wrote:

> Thanks guys for your inputs I would be looking at around 100 Tb of total
>  index size  with 5100 million documents  for  a period of  30 days before
> we purge the  indexes.I had estimated it slightly on the  higher side of
> things but that's where I feel we would be.
>
> Thanks,
> Nishanth
>
> On Wed, Jan 7, 2015 at 7:50 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>
> > On 1/7/2015 7:14 PM, Nishanth S wrote:
> > > Thanks Shawn and Walter.Yes those are 12,000 writes/second.Reads  for
> the
> > > moment would be in the 1000 reads/second. Guess finding out the right
> > > number  of  shards would be my starting point.
> >
> > I don't think indexing 12000 docs per second would be too much for Solr
> > to handle, as long as you architect the indexing application properly.
> > You would likely need to have several indexing threads or processes that
> > index in parallel.  Solr is fully thread-safe and can handle several
> > indexing requests at the same time.  If the indexing application is
> > single-threaded, indexing speed will not reach its full potential.
> >
> > Be aware that indexing at the same time as querying will reduce the
> > number of queries per second that you can handle.  In an environment
> > where both reads and writes are heavy like you have described, more
> > shards and/or more replicas might be required.
> >
> > For the query side ... even 1000 queries per second is a fairly heavy
> > query rate.  You're likely to need at least a few replicas, possibly
> > several, to handle that.  The type and complexity of the queries you do
> > will make a big difference as well.  To handle that query level, I would
> > still recommend only running one shard replica on each server.  If you
> > have three shards and three replicas, that means 9 Solr servers.
> >
> > How many documents will you have in total?  You said they are about 6KB
> > each ... but depending on the fieldType definitions (and the analysis
> > chain for TextField types), 6KB might be very large or fairly small.
> >
> > Do you have any idea how large the Solr index will be with all your
> > documents?  Estimating that will require indexing a significant
> > percentage of your documents with the actual schema and config that you
> > will use in production.
> >
> > If I know how many documents you have, how large the full index will be,
> > and can see an example of the more complex queries you will do, I can
> > make *preliminary* guesses about the number of shards you might need.  I
> > do have to warn you that it will only be a guess.  You'll have to
> > experiment to see what works best.
> >
> > Thanks,
> > Shawn
> >
> >
>

Re: Determining the Number of Solr Shards

Posted by Nishanth S <ni...@gmail.com>.
Thanks guys for your inputs I would be looking at around 100 Tb of total
 index size  with 5100 million documents  for  a period of  30 days before
we purge the  indexes.I had estimated it slightly on the  higher side of
things but that's where I feel we would be.

Thanks,
Nishanth

On Wed, Jan 7, 2015 at 7:50 PM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 1/7/2015 7:14 PM, Nishanth S wrote:
> > Thanks Shawn and Walter.Yes those are 12,000 writes/second.Reads  for the
> > moment would be in the 1000 reads/second. Guess finding out the right
> > number  of  shards would be my starting point.
>
> I don't think indexing 12000 docs per second would be too much for Solr
> to handle, as long as you architect the indexing application properly.
> You would likely need to have several indexing threads or processes that
> index in parallel.  Solr is fully thread-safe and can handle several
> indexing requests at the same time.  If the indexing application is
> single-threaded, indexing speed will not reach its full potential.
>
> Be aware that indexing at the same time as querying will reduce the
> number of queries per second that you can handle.  In an environment
> where both reads and writes are heavy like you have described, more
> shards and/or more replicas might be required.
>
> For the query side ... even 1000 queries per second is a fairly heavy
> query rate.  You're likely to need at least a few replicas, possibly
> several, to handle that.  The type and complexity of the queries you do
> will make a big difference as well.  To handle that query level, I would
> still recommend only running one shard replica on each server.  If you
> have three shards and three replicas, that means 9 Solr servers.
>
> How many documents will you have in total?  You said they are about 6KB
> each ... but depending on the fieldType definitions (and the analysis
> chain for TextField types), 6KB might be very large or fairly small.
>
> Do you have any idea how large the Solr index will be with all your
> documents?  Estimating that will require indexing a significant
> percentage of your documents with the actual schema and config that you
> will use in production.
>
> If I know how many documents you have, how large the full index will be,
> and can see an example of the more complex queries you will do, I can
> make *preliminary* guesses about the number of shards you might need.  I
> do have to warn you that it will only be a guess.  You'll have to
> experiment to see what works best.
>
> Thanks,
> Shawn
>
>

Re: Determining the Number of Solr Shards

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/7/2015 7:14 PM, Nishanth S wrote:
> Thanks Shawn and Walter.Yes those are 12,000 writes/second.Reads  for the
> moment would be in the 1000 reads/second. Guess finding out the right
> number  of  shards would be my starting point.

I don't think indexing 12000 docs per second would be too much for Solr
to handle, as long as you architect the indexing application properly.
You would likely need to have several indexing threads or processes that
index in parallel.  Solr is fully thread-safe and can handle several
indexing requests at the same time.  If the indexing application is
single-threaded, indexing speed will not reach its full potential.

Be aware that indexing at the same time as querying will reduce the
number of queries per second that you can handle.  In an environment
where both reads and writes are heavy like you have described, more
shards and/or more replicas might be required.

For the query side ... even 1000 queries per second is a fairly heavy
query rate.  You're likely to need at least a few replicas, possibly
several, to handle that.  The type and complexity of the queries you do
will make a big difference as well.  To handle that query level, I would
still recommend only running one shard replica on each server.  If you
have three shards and three replicas, that means 9 Solr servers.

How many documents will you have in total?  You said they are about 6KB
each ... but depending on the fieldType definitions (and the analysis
chain for TextField types), 6KB might be very large or fairly small.

Do you have any idea how large the Solr index will be with all your
documents?  Estimating that will require indexing a significant
percentage of your documents with the actual schema and config that you
will use in production.

If I know how many documents you have, how large the full index will be,
and can see an example of the more complex queries you will do, I can
make *preliminary* guesses about the number of shards you might need.  I
do have to warn you that it will only be a guess.  You'll have to
experiment to see what works best.

Thanks,
Shawn


Re: Determining the Number of Solr Shards

Posted by Erick Erickson <er...@gmail.com>.
1,000 queries/second is not trivial either. My starting point for QPS
is about 50.
But that's entirely "straw man" and (and as the link Shawn provided indicates)
only testing will determine if that's realistic.

So going for 1,000 queries/second, you're talking.... 20 replicas for
each shard.

And we haven't even talked about the number of shards yet.

You're actually quite a ways from being to predict much about hardware.

For instance, what is your retention?, i.e. how long you'll have to
keep the documents.
Let's assume that your _average_ writes/second is even 5,000. That's
18M docs/hour or
400+M docs/day. My (again straw-man) number for the number of docs you
can put on
a single shard is 100M (again with the caveat that only testing will
tell, this may be 20M
and may be 200M or even more).

Let's be generous and, for round numbers, assume you're adding 400M docs/day and
each shard can hold 200M docs (WARNING! These are _very_ optimistic numbers!)
you're talking 20 X 2 X (number of days retention you need) replicas.


Don't mean to be too much of a downer, but this is not as simple as
throwing a few
big machines at the problem and being good-to-go.

Best,
Erick

On Wed, Jan 7, 2015 at 6:14 PM, Nishanth S <ni...@gmail.com> wrote:
> Thanks Shawn and Walter.Yes those are 12,000 writes/second.Reads  for the
> moment would be in the 1000 reads/second. Guess finding out the right
> number  of  shards would be my starting point.
>
> Thanks,
> Nishanth
>
>
> On Wed, Jan 7, 2015 at 6:28 PM, Walter Underwood <wu...@wunderwood.org>
> wrote:
>
>> This is described as “write heavy”, so I think that is 12,000
>> writes/second, not queries.
>>
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/
>>
>>
>> On Jan 7, 2015, at 5:16 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>>
>> > On 1/7/2015 3:29 PM, Nishanth S wrote:
>> >> I  am working on coming up with a solr architecture layout  for my use
>> >> case.We are a very write heavy application with  no down time tolerance
>> and
>> >> have low SLAs on reads when compared with writes.I am looking at around
>> >> 12K tps with average index size of solr document in the range of 6kB.I
>> >> would like to go with 3 replicas for that extra fault tolerance and
>> trying
>> >> to identify the number  of shards.The machines are monsterous and have
>> >> around 100 GB of RAM and  more than 24 cores on each.Is there a way to
>> >> come at the number of  desired shards in this case.Any pointers would be
>> >> helpful.
>> >
>> > This is one of those questions that's nearly impossible to answer
>> > without field trials that have a production load on a production index.
>> > Minor changes to either config or schema can have a major impact on the
>> > query load Solr will support.
>> >
>> >
>> https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>> >
>> > A query load of 12000 queries per second is VERY high.  That is likely
>> > to require a **LOT** of hardware, because you're going to need a lot of
>> > replicas.  Because each server will be handling quite a lot of
>> > simultaneous queries, the best results will come from having only one
>> > replica (solr core) per server.
>> >
>> > Generally you'll get better results for a high query load if you don't
>> > shard your index, but depending on how many docs you have, you might
>> > want to shard.  You haven't said how many docs you have.
>> >
>> > The key to excellent performance with Solr is to make sure that the
>> > system never hits the disk to read index data -- for 12000 queries per
>> > second, the index must be fully cached in RAM.  If Solr must go to the
>> > actual disk, query performance will drop significantly.
>> >
>> > Thanks,
>> > Shawn
>> >
>>
>>

Re: Determining the Number of Solr Shards

Posted by Jack Krupansky <ja...@gmail.com>.
Anybody on the list have a feel for how many simultaneous queries Solr can
handle in parallel? Will it be linear WRT the number of CPU cores? Or are
their other bottlenecks or locks in Lucene or Solr such that even with more
CPU cores the Solr server will be saturated with fewer queries than the
number of CPU cores?

-- Jack Krupansky

On Wed, Jan 7, 2015 at 9:14 PM, Nishanth S <ni...@gmail.com> wrote:

> Thanks Shawn and Walter.Yes those are 12,000 writes/second.Reads  for the
> moment would be in the 1000 reads/second. Guess finding out the right
> number  of  shards would be my starting point.
>
> Thanks,
> Nishanth
>
>
> On Wed, Jan 7, 2015 at 6:28 PM, Walter Underwood <wu...@wunderwood.org>
> wrote:
>
> > This is described as “write heavy”, so I think that is 12,000
> > writes/second, not queries.
> >
> > Walter Underwood
> > wunder@wunderwood.org
> > http://observer.wunderwood.org/
> >
> >
> > On Jan 7, 2015, at 5:16 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> >
> > > On 1/7/2015 3:29 PM, Nishanth S wrote:
> > >> I  am working on coming up with a solr architecture layout  for my use
> > >> case.We are a very write heavy application with  no down time
> tolerance
> > and
> > >> have low SLAs on reads when compared with writes.I am looking at
> around
> > >> 12K tps with average index size of solr document in the range of 6kB.I
> > >> would like to go with 3 replicas for that extra fault tolerance and
> > trying
> > >> to identify the number  of shards.The machines are monsterous and have
> > >> around 100 GB of RAM and  more than 24 cores on each.Is there a way to
> > >> come at the number of  desired shards in this case.Any pointers would
> be
> > >> helpful.
> > >
> > > This is one of those questions that's nearly impossible to answer
> > > without field trials that have a production load on a production index.
> > > Minor changes to either config or schema can have a major impact on the
> > > query load Solr will support.
> > >
> > >
> >
> https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> > >
> > > A query load of 12000 queries per second is VERY high.  That is likely
> > > to require a **LOT** of hardware, because you're going to need a lot of
> > > replicas.  Because each server will be handling quite a lot of
> > > simultaneous queries, the best results will come from having only one
> > > replica (solr core) per server.
> > >
> > > Generally you'll get better results for a high query load if you don't
> > > shard your index, but depending on how many docs you have, you might
> > > want to shard.  You haven't said how many docs you have.
> > >
> > > The key to excellent performance with Solr is to make sure that the
> > > system never hits the disk to read index data -- for 12000 queries per
> > > second, the index must be fully cached in RAM.  If Solr must go to the
> > > actual disk, query performance will drop significantly.
> > >
> > > Thanks,
> > > Shawn
> > >
> >
> >
>

Re: Determining the Number of Solr Shards

Posted by Nishanth S <ni...@gmail.com>.
Thanks Shawn and Walter.Yes those are 12,000 writes/second.Reads  for the
moment would be in the 1000 reads/second. Guess finding out the right
number  of  shards would be my starting point.

Thanks,
Nishanth


On Wed, Jan 7, 2015 at 6:28 PM, Walter Underwood <wu...@wunderwood.org>
wrote:

> This is described as “write heavy”, so I think that is 12,000
> writes/second, not queries.
>
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/
>
>
> On Jan 7, 2015, at 5:16 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>
> > On 1/7/2015 3:29 PM, Nishanth S wrote:
> >> I  am working on coming up with a solr architecture layout  for my use
> >> case.We are a very write heavy application with  no down time tolerance
> and
> >> have low SLAs on reads when compared with writes.I am looking at around
> >> 12K tps with average index size of solr document in the range of 6kB.I
> >> would like to go with 3 replicas for that extra fault tolerance and
> trying
> >> to identify the number  of shards.The machines are monsterous and have
> >> around 100 GB of RAM and  more than 24 cores on each.Is there a way to
> >> come at the number of  desired shards in this case.Any pointers would be
> >> helpful.
> >
> > This is one of those questions that's nearly impossible to answer
> > without field trials that have a production load on a production index.
> > Minor changes to either config or schema can have a major impact on the
> > query load Solr will support.
> >
> >
> https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> >
> > A query load of 12000 queries per second is VERY high.  That is likely
> > to require a **LOT** of hardware, because you're going to need a lot of
> > replicas.  Because each server will be handling quite a lot of
> > simultaneous queries, the best results will come from having only one
> > replica (solr core) per server.
> >
> > Generally you'll get better results for a high query load if you don't
> > shard your index, but depending on how many docs you have, you might
> > want to shard.  You haven't said how many docs you have.
> >
> > The key to excellent performance with Solr is to make sure that the
> > system never hits the disk to read index data -- for 12000 queries per
> > second, the index must be fully cached in RAM.  If Solr must go to the
> > actual disk, query performance will drop significantly.
> >
> > Thanks,
> > Shawn
> >
>
>

Re: Determining the Number of Solr Shards

Posted by Walter Underwood <wu...@wunderwood.org>.
This is described as “write heavy”, so I think that is 12,000 writes/second, not queries.

Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/


On Jan 7, 2015, at 5:16 PM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 1/7/2015 3:29 PM, Nishanth S wrote:
>> I  am working on coming up with a solr architecture layout  for my use
>> case.We are a very write heavy application with  no down time tolerance and
>> have low SLAs on reads when compared with writes.I am looking at around
>> 12K tps with average index size of solr document in the range of 6kB.I
>> would like to go with 3 replicas for that extra fault tolerance and  trying
>> to identify the number  of shards.The machines are monsterous and have
>> around 100 GB of RAM and  more than 24 cores on each.Is there a way to
>> come at the number of  desired shards in this case.Any pointers would be
>> helpful.
> 
> This is one of those questions that's nearly impossible to answer
> without field trials that have a production load on a production index. 
> Minor changes to either config or schema can have a major impact on the
> query load Solr will support.
> 
> https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> 
> A query load of 12000 queries per second is VERY high.  That is likely
> to require a **LOT** of hardware, because you're going to need a lot of
> replicas.  Because each server will be handling quite a lot of
> simultaneous queries, the best results will come from having only one
> replica (solr core) per server.
> 
> Generally you'll get better results for a high query load if you don't
> shard your index, but depending on how many docs you have, you might
> want to shard.  You haven't said how many docs you have.
> 
> The key to excellent performance with Solr is to make sure that the
> system never hits the disk to read index data -- for 12000 queries per
> second, the index must be fully cached in RAM.  If Solr must go to the
> actual disk, query performance will drop significantly.
> 
> Thanks,
> Shawn
> 


Re: Determining the Number of Solr Shards

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/7/2015 3:29 PM, Nishanth S wrote:
> I  am working on coming up with a solr architecture layout  for my use
> case.We are a very write heavy application with  no down time tolerance and
>  have low SLAs on reads when compared with writes.I am looking at around
> 12K tps with average index size of solr document in the range of 6kB.I
> would like to go with 3 replicas for that extra fault tolerance and  trying
> to identify the number  of shards.The machines are monsterous and have
>  around 100 GB of RAM and  more than 24 cores on each.Is there a way to
> come at the number of  desired shards in this case.Any pointers would be
> helpful.

This is one of those questions that's nearly impossible to answer
without field trials that have a production load on a production index. 
Minor changes to either config or schema can have a major impact on the
query load Solr will support.

https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

A query load of 12000 queries per second is VERY high.  That is likely
to require a **LOT** of hardware, because you're going to need a lot of
replicas.  Because each server will be handling quite a lot of
simultaneous queries, the best results will come from having only one
replica (solr core) per server.

Generally you'll get better results for a high query load if you don't
shard your index, but depending on how many docs you have, you might
want to shard.  You haven't said how many docs you have.

The key to excellent performance with Solr is to make sure that the
system never hits the disk to read index data -- for 12000 queries per
second, the index must be fully cached in RAM.  If Solr must go to the
actual disk, query performance will drop significantly.

Thanks,
Shawn