You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Aswath Srinivasan (TMS)" <as...@toyota.com> on 2016/01/22 23:18:49 UTC

Taking Solr to production

If below is the situation,


*         4 Virtual machines with 64 GB RAM - 64bit machines, 512 GB storage for each VM

*         Totally about 2.5 million documents to  be indexed

*         Documents average size is 512 KB - pdfs and htmls

*         Expected QPS is 150

*         Incremental indexing is once per day at around 50,000 documents per day (update & delete combined)

This being said I was thinking I would take the Solr to production with,


*         2 shards, 1 Leader & 3 Replicas

*         2 solr instance per VM

*         3 Zookeepers on the same machines as that of Solr (3 out of 4 VMs will have external zookeeper)

*         Solr 5.3.1 version

Do you all think this set up will work? Will this server me 150 QPS?

I know that nobody can give a definite answer and the only way is to do a performance testing and tweak it from there but there is another proposal to have 4 shards, 1 Leader and 1 Replica which I'm not in favor off. So, posting it here, just trying to get some peer opinion!!

Thank you,
Aswath NS


Re: Taking Solr to production

Posted by Erick Erickson <er...@gmail.com>.
It boils down to whether the response rate when you query a single
shard is "acceptable", plus some overhead for sharding.

So, if you need 100QPS and all you can get after tuning on a single
shard (which you can test with &distrib=false)
is 10QPS, you need 10 replicas.

But if a single shard can only get you responses back in 10 seconds,
you need more shards.

And so on....

Best,
Erick



On Fri, Jan 22, 2016 at 3:30 PM, Aswath Srinivasan (TMS)
<as...@toyota.com> wrote:
> Thanks guys for all the responses.
>
> True. What I wanted to convey is  2 shards with 4 replicas.
>
>>> use more shards if the query latency is too high.
>
> Shouldn't we go for more replicas if query latency is too high? You can go for more shard if you have number of indexing documents and at a much frequent rate. Do you disagree with my point of view?
>
> There are no facets but complex queries exist. A safe bet is to have 2 shards is what I was thinking so I give enough breathing space for the indexing jobs and 4 replicas to address the high QPS request. Am I thinking correctly?
>
> I cannot thank you enough you guys!!
>
> Thank you,
> Aswath NS
>
>
> -----Original Message-----
> From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> Sent: Friday, January 22, 2016 3:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Taking Solr to production
>
> "1 Leader & 3 Replicas"
>
> SolrCloud does not distinguish leaders from replicas - that's old master-slave terminology. The leader is just one of the replicas.
>
> So, are you really talking about 2 shards with 4 replicas each or 2 shards with 2 replicas each?
>
> Putting multiple replica instances on each machine isn't buying you anything, just making it more complicated to manage.
>
> Number of shards is determined by amount of data and whether query latency can be achieved - use more shards if the query latency is too high.
>
> 2.5 million (2,500,000) documents is rather small, so unless your queries are running really slow, it's not clear you even need sharding, but we don't know your document and query complexity. Heavy faceting or complex function queries?
>
> Number of replicas is determined by query load - number of simultaneous query requests, as well as HA availability requirements.
>
>
>
>
> -- Jack Krupansky
>
> On Fri, Jan 22, 2016 at 5:45 PM, Toke Eskildsen
> wrote:
>
>> Aswath Srinivasan (TMS) wrote:
>> > * Totally about 2.5 million documents to be indexed
>> > * Documents average size is 512 KB - pdfs and htmls
>>
>> > This being said I was thinking I would take the Solr to production with,
>> > * 2 shards, 1 Leader & 3 Replicas
>>
>> > Do you all think this set up will work? Will this server me 150 QPS?
>>
>> It certainly helps that you are batch updating. What is missing in
>> this estimation is how large the documents are when indexed, as I
>> guess the ½MB average is for the raw files? If they are your everyday
>> short PDFs with images, meaning not a lot of text, handling 2M+ of
>> them is easy. If they are all full-length books, it is another matter.
>>
>> Your document count is relatively low and if your index data end up
>> being not-too-big (let's say 100GB), then you ought to consider having
>> just a single shard with 4 replicas: There is a non-trivial overhead
>> going from 1 shard to more than one, especially if you are doing faceting.
>>
>> - Toke Eskildsen
>>

RE: Taking Solr to production

Posted by "Aswath Srinivasan (TMS)" <as...@toyota.com>.
Thanks guys for all the responses.

True. What I wanted to convey is  2 shards with 4 replicas.

>> use more shards if the query latency is too high.

Shouldn't we go for more replicas if query latency is too high? You can go for more shard if you have number of indexing documents and at a much frequent rate. Do you disagree with my point of view?

There are no facets but complex queries exist. A safe bet is to have 2 shards is what I was thinking so I give enough breathing space for the indexing jobs and 4 replicas to address the high QPS request. Am I thinking correctly?

I cannot thank you enough you guys!!

Thank you,
Aswath NS


-----Original Message-----
From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
Sent: Friday, January 22, 2016 3:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Taking Solr to production

"1 Leader & 3 Replicas"

SolrCloud does not distinguish leaders from replicas - that's old master-slave terminology. The leader is just one of the replicas.

So, are you really talking about 2 shards with 4 replicas each or 2 shards with 2 replicas each?

Putting multiple replica instances on each machine isn't buying you anything, just making it more complicated to manage.

Number of shards is determined by amount of data and whether query latency can be achieved - use more shards if the query latency is too high.

2.5 million (2,500,000) documents is rather small, so unless your queries are running really slow, it's not clear you even need sharding, but we don't know your document and query complexity. Heavy faceting or complex function queries?

Number of replicas is determined by query load - number of simultaneous query requests, as well as HA availability requirements.




-- Jack Krupansky

On Fri, Jan 22, 2016 at 5:45 PM, Toke Eskildsen
wrote:

> Aswath Srinivasan (TMS) wrote:
> > * Totally about 2.5 million documents to be indexed
> > * Documents average size is 512 KB - pdfs and htmls
>
> > This being said I was thinking I would take the Solr to production with,
> > * 2 shards, 1 Leader & 3 Replicas
>
> > Do you all think this set up will work? Will this server me 150 QPS?
>
> It certainly helps that you are batch updating. What is missing in
> this estimation is how large the documents are when indexed, as I
> guess the ½MB average is for the raw files? If they are your everyday
> short PDFs with images, meaning not a lot of text, handling 2M+ of
> them is easy. If they are all full-length books, it is another matter.
>
> Your document count is relatively low and if your index data end up
> being not-too-big (let's say 100GB), then you ought to consider having
> just a single shard with 4 replicas: There is a non-trivial overhead
> going from 1 shard to more than one, especially if you are doing faceting.
>
> - Toke Eskildsen
>

Re: Taking Solr to production

Posted by Jack Krupansky <ja...@gmail.com>.
"1 Leader & 3 Replicas"

SolrCloud does not distinguish leaders from replicas - that's old
master-slave terminology. The leader is just one of the replicas.

So, are you really talking about 2 shards with 4 replicas each or 2 shards
with 2 replicas each?

Putting multiple replica instances on each machine isn't buying you
anything, just making it more complicated to manage.

Number of shards is determined by amount of data and whether query latency
can be achieved - use more shards if the query latency is too high.

2.5 million (2,500,000) documents is rather small, so unless your queries
are running really slow, it's not clear you even need sharding, but we
don't know your document and query complexity. Heavy faceting or complex
function queries?

Number of replicas is determined by query load - number of simultaneous
query requests, as well as HA availability requirements.




-- Jack Krupansky

On Fri, Jan 22, 2016 at 5:45 PM, Toke Eskildsen <te...@statsbiblioteket.dk>
wrote:

> Aswath Srinivasan (TMS) <as...@toyota.com> wrote:
> > *         Totally about 2.5 million documents to  be indexed
> > *         Documents average size is 512 KB - pdfs and htmls
>
> > This being said I was thinking I would take the Solr to production with,
> > *         2 shards, 1 Leader & 3 Replicas
>
> > Do you all think this set up will work? Will this server me 150 QPS?
>
> It certainly helps that you are batch updating. What is missing in this
> estimation is how large the documents are when indexed, as I guess the ½MB
> average is for the raw files? If they are your everyday short PDFs with
> images, meaning not a lot of text, handling 2M+ of them is easy. If they
> are all full-length books, it is another matter.
>
> Your document count is relatively low and if your index data end up being
> not-too-big (let's say 100GB), then you ought to consider having just a
> single shard with 4 replicas: There is a non-trivial overhead going from 1
> shard to more than one, especially if you are doing faceting.
>
> - Toke Eskildsen
>

Re: Taking Solr to production

Posted by Walter Underwood <wu...@wunderwood.org>.
I agree, sharding may hurt more than it helps. And estimate the text size after the documents are processed.

We all love Solr Cloud, but this could be a good application for traditional master/slave Solr. That means no Zookeeper nodes and it is really easy to add a new query slave, just clone the instance.

We run an index with homework questions which seems similar to yours.

* 7 million documents.
* 50 Gbyte index.
* Request rates of 5000 to 10,000 q/minute per server.
* No facets or highlighting (highlighting soon, we store term vectors).
* Amazon EC2 instances with 16 cores, 30 Gbytes RAM, index is in ephemeral SSD.
* Index updates once per day.
* Master/slave.
* Solr 4.10.4.

During peak traffic, the 95th percentile response time was about three seconds, but that is because the queries are entire homework questions, up to 1000 words, pasted into the query box. Yes, we have very unusual queries. Median response time was much better, about 50 milliseconds.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jan 22, 2016, at 2:45 PM, Toke Eskildsen <te...@statsbiblioteket.dk> wrote:
> 
> Aswath Srinivasan (TMS) <as...@toyota.com> wrote:
>> *         Totally about 2.5 million documents to  be indexed
>> *         Documents average size is 512 KB - pdfs and htmls
> 
>> This being said I was thinking I would take the Solr to production with,
>> *         2 shards, 1 Leader & 3 Replicas
> 
>> Do you all think this set up will work? Will this server me 150 QPS?
> 
> It certainly helps that you are batch updating. What is missing in this estimation is how large the documents are when indexed, as I guess the ½MB average is for the raw files? If they are your everyday short PDFs with images, meaning not a lot of text, handling 2M+ of them is easy. If they are all full-length books, it is another matter.
> 
> Your document count is relatively low and if your index data end up being not-too-big (let's say 100GB), then you ought to consider having just a single shard with 4 replicas: There is a non-trivial overhead going from 1 shard to more than one, especially if you are doing faceting.
> 
> - Toke Eskildsen


Re: Taking Solr to production

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Aswath Srinivasan (TMS) <as...@toyota.com> wrote:
> *         Totally about 2.5 million documents to  be indexed
> *         Documents average size is 512 KB - pdfs and htmls

> This being said I was thinking I would take the Solr to production with,
> *         2 shards, 1 Leader & 3 Replicas

> Do you all think this set up will work? Will this server me 150 QPS?

It certainly helps that you are batch updating. What is missing in this estimation is how large the documents are when indexed, as I guess the ½MB average is for the raw files? If they are your everyday short PDFs with images, meaning not a lot of text, handling 2M+ of them is easy. If they are all full-length books, it is another matter.

Your document count is relatively low and if your index data end up being not-too-big (let's say 100GB), then you ought to consider having just a single shard with 4 replicas: There is a non-trivial overhead going from 1 shard to more than one, especially if you are doing faceting.

- Toke Eskildsen