You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Doss <it...@gmail.com> on 2019/01/04 06:26:28 UTC

Regarding Shards - Composite / Implicit , Replica Type - NRT / TLOG

Hi,

We are planning to setup a SOLR cloud with 6 nodes for 3 million records
(expected to grow to 5 million in a year), with 150 fields and over all
index would come around 120GB.

We plan to use NRT with 5 sec soft commit and 1 min hard commit.

Expected query volume would be 5000 select hits per second and 7000 inserts
/ updates per second.

Our records can be classified under 15 categories, but they will not have
even number of records, few categories will have more number of records.

Queries will also come in the same pattern, that is., categories with high
number of records will get high volume of select / updates.

For this situation we are confused in choosing what type of sharding would
help us in better performance in both select and updates?

Composite / implicit - Composite with 15 shards or implicit based on 15
categories.

Our select queries will have minimum 15 filters in fq, with extensive
function queries used in sort.

Updates will have 6 integer fields, 5 string fields and 4 string/integer
fields with multi valued.

If we choose implicit to boost select performance, our updates will be
heavy on few shards (major category shards), will this be a problem?

For our kind of situation which replica Type can we choose? All NRT or NRT
with TLOG ?

Thanks in advance!

Best,
Doss.

Re: Regarding Shards - Composite / Implicit , Replica Type - NRT / TLOG

Posted by Erick Erickson <er...@gmail.com>.
It's usually best to use compositeId routing. That distributes
the load evenly. Otherwise, _you_ have to be responsible
for making sure that the docs are reasonably evenly distributed,
which can be a pain.

Implicit routing is usually best in situations where you index
to a particular shard for a while then move on to another
shard, think news stories where you want to keep them for
30 days then dispose of them. Implicit lets you add/remove
shards on a daily basis. Doesn't sound particularly suitable for
your situation.

But I do have to ask why you're sharding at all? 5M docs is a fairly
small index by modern standards. There's some inevitable overhead
with sharding that you could avoid. Mostly I'm asking if you've
stress-tested with that query and update rate. The 7,000 updates/second
do worry me a bit with a single-shard solution, but if you get adequate
response times under that load, then there's no need to shard. Use all
the hardware to support querying.

Sharding will improve indexing throughput without doubt, Solr scales
roughly linearly with the number of shards. Do use CloudSolrClient
for your updates as it routes docs to the correct leader, avoiding
one extra hop.

Given your soft  commit setting of 5 seconds, I infer that the allowable
time for updates to be searchable is quite small, indicating that NRT
replicas are the way to go. I'll also say that this commit rate is pretty
aggressive given your volume, is it really necessary to be that short?
Your caches are going to be pretty useless since they won't stick around
for very long. Look carefully at the autowarming time, in order to
make any good use of your fitlerCache, you'll have to autowarm it some
and if you do, you need to insure that the autowarm interval is less than
your autocommit time.

Best,
Erick

On Thu, Jan 3, 2019 at 10:34 PM Doss <it...@gmail.com> wrote:
>
> Hi,
>
> We are planning to setup a SOLR cloud with 6 nodes for 3 million records
> (expected to grow to 5 million in a year), with 150 fields and over all
> index would come around 120GB.
>
> We plan to use NRT with 5 sec soft commit and 1 min hard commit.
>
> Expected query volume would be 5000 select hits per second and 7000 inserts
> / updates per second.
>
> Our records can be classified under 15 categories, but they will not have
> even number of records, few categories will have more number of records.
>
> Queries will also come in the same pattern, that is., categories with high
> number of records will get high volume of select / updates.
>
> For this situation we are confused in choosing what type of sharding would
> help us in better performance in both select and updates?
>
> Composite / implicit - Composite with 15 shards or implicit based on 15
> categories.
>
> Our select queries will have minimum 15 filters in fq, with extensive
> function queries used in sort.
>
> Updates will have 6 integer fields, 5 string fields and 4 string/integer
> fields with multi valued.
>
> If we choose implicit to boost select performance, our updates will be
> heavy on few shards (major category shards), will this be a problem?
>
> For our kind of situation which replica Type can we choose? All NRT or NRT
> with TLOG ?
>
> Thanks in advance!
>
> Best,
> Doss.

Re: Regarding Shards - Composite / Implicit , Replica Type - NRT / TLOG

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/3/2019 11:26 PM, Doss wrote:
> We are planning to setup a SOLR cloud with 6 nodes for 3 million records
> (expected to grow to 5 million in a year), with 150 fields and over all
> index would come around 120GB.
>
> We plan to use NRT with 5 sec soft commit and 1 min hard commit.

Five seconds is likely far too short an interval.  That's something 
you'll have to experiment with.

> Expected query volume would be 5000 select hits per second and 7000 inserts
> / updates per second.

5000 queries per second is an extremely high query rate.  I would guess 
that six nodes is far too few to handle that much of a query load.  It 
might also be plenty ... it's nearly impossible to gauge that with the 
information you've shared so far.  Usually the only way to find out for 
sure is to actually BUILD the system and try it.

7000 documents inserted per second is also ambitious.  It's achievable, 
but is almost certainly going to require parallel threads/processes 
indexing at the same time.  That's going to reduce the query volume you 
can handle.

If you expect 3 million documents to reach 120GB of index size, then 
each of those documents must be fairly large.  Large documents will 
index more slowly, and can also reduce query capacity.

Memory will be your biggest challenge.  If a Solr instance must handle 
120GB of index and achieve a high query volume, then you'll want that 
Solr instance to have about 128GB of memory, so the entire index will 
fit into the operating system disk cache.

> Our records can be classified under 15 categories, but they will not have
> even number of records, few categories will have more number of records.
>
> Queries will also come in the same pattern, that is., categories with high
> number of records will get high volume of select / updates.
>
> For this situation we are confused in choosing what type of sharding would
> help us in better performance in both select and updates?
>
> Composite / implicit - Composite with 15 shards or implicit based on 15
> categories.

15 shards is probably far too many for only a few million documents, 
especially with the extremely high query volume and low host count you 
have projected.  With a high query volume, you want the absolute minimum 
number of shards possible ... one if you can.  Handling several million 
documents in a single shard is usually doable.

> Our select queries will have minimum 15 filters in fq, with extensive
> function queries used in sort.

When a query has multiple filters, they will generally all be run in 
parallel, not sequentially.  This can affect the query volume you can 
handle, it's very difficult to know whether the effect will be helpful 
or harmful.

> For our kind of situation which replica Type can we choose? All NRT or NRT
> with TLOG ?

If you will only have two replicas, they should both be either NRT or 
TLOG.  With more than two replicas, my suggestion would be to make two 
of them TLOG and the rest PULL. One of the TLOG replicas will be elected 
leader, and all other replicas will copy the index from the leader, 
rather than do the independent indexing that NRT replicas do.

Thanks,
Shawn