You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Brent P <br...@gmail.com> on 2016/08/26 02:51:43 UTC

High load, frequent updates, low latency requirement use case

I'm trying to set up a Solr Cloud cluster to support a system with the
following characteristics:

It will be writing documents at a rate of approximately 500 docs/second,
and running search queries at about the same rate.
The documents are fairly small, with about 10 fields, most of which range
in size from a simple int to a string that holds a UUID. There's a date
field, and then three text fields that typically hold in the range of 350
to 500 chars.
Documents should be available for searching within 30 seconds of being
added.
We need an average search latency of 50 ms or faster.

We've been using DataStax Enterprise with decent results, but trying to
determine if we can get more out of the latest version of Solr Cloud, as we
originally chose DSE ~4 years ago *I believe* because its Cassandra-backed
Solr provided redundancy/high availability features that weren't currently
available with straight Solr (not even sure if Solr Cloud was available
then).

We have 24 fairly beefy servers (96 CPU cores, 256 GB RAM, SSDs) for the
task, and I'm trying to figure out the best way to distribute the documents
into collections, cores, and shards.

If I can categorize a document into one of 8 "types", should I create 8
collections? Is that going to provide better performance than putting them
all into one collection and then using a filter query with the type field
when doing a search?

What are the options/things to consider when deciding on the number of
shards for each collection? As far as I know, I don't choose the number of
Solr cores, that is just determined base on the replication factor (and
shard count?).

Some of the settings I'm using in my solrconfig that seem important:
<lockType>${solr.lock.type:native}</lockType>
<autoCommit>
  <maxTime>${solr.autoCommit.maxTime:30000}</maxTime>
  <openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
  <maxTime>${solr.autoSoftCommit.maxTime:1000}</maxTime>
</autoSoftCommit>
<useColdSearcher>true</useColdSearcher>
<maxWarmingSearchers>8</maxWarmingSearchers>

I've got the updateLog/transaction log enabled, as I think I read it's
required for Solr Cloud.

Are there any settings I should look at that affect performance
significantly, especially outside of the solrconfig.xml for each collection
(like jetty configs, logging properties, etc)?

How much impact do the <lib/> directives in the solrconfig have on
performance? Do they only take effect if I have something configured that
requires them, and therefore if I'm missing one that I need, I'd get an
error if it's not defined?

Any help will be greatly appreciated. Thanks!
-Brent

Re: High load, frequent updates, low latency requirement use case

Posted by Shawn Heisey <ap...@elyograg.org>.

On 8/25/2016 8:51 PM, Brent P wrote:

Replies inline.  Hopefully they'll be easily visible.

> It will be writing documents at a rate of approximately 500 docs/second,
> and running search queries at about the same rate.

500 queries per second is a LOT.  You're going to probably need a lot of
replicas to handle the load.

> The documents are fairly small, with about 10 fields, most of which range
> in size from a simple int to a string that holds a UUID. There's a date
> field, and then three text fields that typically hold in the range of 350
> to 500 chars.
> Documents should be available for searching within 30 seconds of being
> added.
> We need an average search latency of 50 ms or faster.

Bottom line here -- there is no easy answer.

https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

> We've been using DataStax Enterprise with decent results, but trying to
> determine if we can get more out of the latest version of Solr Cloud, as we
> originally chose DSE ~4 years ago *I believe* because its Cassandra-backed
> Solr provided redundancy/high availability features that weren't currently
> available with straight Solr (not even sure if Solr Cloud was available
> then).

SolrCloud became officially available in a stable release in Solr 4.0.0,
October 2012.  It was available in 4.0.0-ALPHA and 4.0.0-BETA some
months before that.  I'm reasonably certain that Solr had cloud before
DSE did.  I don't know which Solr release DSE is incorporating, but it's
likely a 4.x version.

A great many bugs were found and fixed over the next couple of years as
subsequent 4.x releases became available.  Solr 5.x and 6.x continued
the stability evolution.  Version 6.2 is the newest release, just
announced yesterday.

> We have 24 fairly beefy servers (96 CPU cores, 256 GB RAM, SSDs) for the
> task, and I'm trying to figure out the best way to distribute the documents
> into collections, cores, and shards.
>
> If I can categorize a document into one of 8 "types", should I create 8
> collections? Is that going to provide better performance than putting them
> all into one collection and then using a filter query with the type field
> when doing a search?
>
> What are the options/things to consider when deciding on the number of
> shards for each collection? As far as I know, I don't choose the number of
> Solr cores, that is just determined base on the replication factor (and
> shard count?).

The only answer I can give you, as mentioned by the blog post above, is
"It Depends."  How many documents are going to be in the index?  Can you
project how large the index directory would be if you indexed all those
documents into one collection with one shard?  With some rough numbers,
I can make a recommendation -- but you need to understand that it would
only be a recommendation, one that could turn out to be wrong once you
make it to production.

> Some of the settings I'm using in my solrconfig that seem important:
> <lockType>${solr.lock.type:native}</lockType>
> <autoCommit>
>   <maxTime>${solr.autoCommit.maxTime:30000}</maxTime>
>   <openSearcher>false</openSearcher>
> </autoCommit>
> <autoSoftCommit>
>   <maxTime>${solr.autoSoftCommit.maxTime:1000}</maxTime>
> </autoSoftCommit>
> <useColdSearcher>true</useColdSearcher>
> <maxWarmingSearchers>8</maxWarmingSearchers>

Right away I can tell you that one second latency,. which you have
configured iin autoSoftCommit, is likely to cause BIG problems for you. 
That should be increased to the largest value you can stand to have --
several minutes would be ideal.  Since you said you need 30 second
latency, setting it to 25 seconds might be the way to go.

Repeating something Emir said:  You should set maxWarmingSearchers back
to 2.  If you increased this because of messages you saw in the log, you
should know that increasing maxWarmingSearchers is likely to make things
worse, not better.  The underlying problem -- commits taking too long --
is what you should address.

> I've got the updateLog/transaction log enabled, as I think I read it's
> required for Solr Cloud.
>
> Are there any settings I should look at that affect performance
> significantly, especially outside of the solrconfig.xml for each collection
> (like jetty configs, logging properties, etc)?

Hard to say without a ton more information -- mostly about the index
size, which I already asked about.

> How much impact do the <lib/> directives in the solrconfig have on
> performance? Do they only take effect if I have something configured that
> requires them, and therefore if I'm missing one that I need, I'd get an
> error if it's not defined?

Loading a jar will take a small amount of memory, but if the feature
added by that jar is never used, it is likely to have little visible impact.

My recommendation is to remove all the <lib> directives, and remove
config in your schema and solrconfig.xml that references features you
don't need.  If you find that you DO require a feature that needs one or
more jars, place them into $SOLR_HOME/lib.  They will be loaded once,
and become available to all cores, without any need to configure <lib>
directives.

Thanks,
Shawn

Re: High load, frequent updates, low latency requirement use case

Posted by Erick Erickson <er...@gmail.com>.

Use the SolrJ CloudSolrClient class and use the client.add(doclist) form.

Best,
Erick

On Thu, Sep 8, 2016 at 8:56 PM, Brent <br...@gmail.com> wrote:
> Emir Arnautovic wrote
>> There should be no problems with ingestion on 24 machines. Assuming 1
>> replication, that is roughly 40 doc/sec/server. Make sure you bulk docs
>> when ingesting.
>
> What is bulking docs, and how do I do it? I'm guessing this is some sort of
> batch loading of documents?
>
> Thanks for the reply.
> -Brent
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/High-load-frequent-updates-low-latency-requirement-use-case-tp4293383p4295225.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: High load, frequent updates, low latency requirement use case

Posted by Brent <br...@gmail.com>.

Emir Arnautovic wrote
> There should be no problems with ingestion on 24 machines. Assuming 1 
> replication, that is roughly 40 doc/sec/server. Make sure you bulk docs 
> when ingesting.

What is bulking docs, and how do I do it? I'm guessing this is some sort of
batch loading of documents?

Thanks for the reply.
-Brent



--
View this message in context: http://lucene.472066.n3.nabble.com/High-load-frequent-updates-low-latency-requirement-use-case-tp4293383p4295225.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: High load, frequent updates, low latency requirement use case

Posted by Emir Arnautovic <em...@sematext.com>.

Hi Brent,
Please see inline comments.

Thanks,
Emir

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 26.08.2016 04:51, Brent P wrote:
> I'm trying to set up a Solr Cloud cluster to support a system with the
> following characteristics:
>
> It will be writing documents at a rate of approximately 500 docs/second,
> and running search queries at about the same rate.
> The documents are fairly small, with about 10 fields, most of which range
> in size from a simple int to a string that holds a UUID. There's a date
> field, and then three text fields that typically hold in the range of 350
> to 500 chars.
There should be no problems with ingestion on 24 machines. Assuming 1 
replication, that is roughly 40 doc/sec/server. Make sure you bulk docs 
when ingesting.
> Documents should be available for searching within 30 seconds of being
> added.
Make sure you don't do explicit commits and use only auto commit.
> We need an average search latency of 50 ms or faster.
What is the number of documents you expect? What type of queries do you 
have? Make sure you use filters wherever possible.
>
> We've been using DataStax Enterprise with decent results, but trying to
> determine if we can get more out of the latest version of Solr Cloud, as we
> originally chose DSE ~4 years ago *I believe* because its Cassandra-backed
> Solr provided redundancy/high availability features that weren't currently
> available with straight Solr (not even sure if Solr Cloud was available
> then).
>
> We have 24 fairly beefy servers (96 CPU cores, 256 GB RAM, SSDs) for the
> task, and I'm trying to figure out the best way to distribute the documents
> into collections, cores, and shards.
>
> If I can categorize a document into one of 8 "types", should I create 8
> collections? Is that going to provide better performance than putting them
> all into one collection and then using a filter query with the type field
> when doing a search?
If you don't need to share term frequencies between types and if you 
always search one type, I would split collections. Set up number of 
shards for each collection according to number of docs in it. 
Alternatively, you could use  routing by type or in case you need to 
split to more than one shard, you can use composite hash routing. 
(https://sematext.com/blog/2015/09/29/solrcloud-large-tenants-and-routing/).
>
> What are the options/things to consider when deciding on the number of
> shards for each collection?
Number of documents and query latency are main factors in determining 
number of shards. Smaller the shard, faster the query, but more shards 
are queried, more data to merge so at one point benefits of distributed 
query are eaten be overhead of merging.
>   As far as I know, I don't choose the number of
> Solr cores, that is just determined base on the replication factor (and
> shard count?).
>
> Some of the settings I'm using in my solrconfig that seem important:
> <lockType>${solr.lock.type:native}</lockType>
> <autoCommit>
>    <maxTime>${solr.autoCommit.maxTime:30000}</maxTime>
>    <openSearcher>false</openSearcher>
> </autoCommit>
This is not how soon your changes will be visible (openSearcher=false). 
This is how frequent your modifications will be flushed.
> <autoSoftCommit>
>    <maxTime>${solr.autoSoftCommit.maxTime:1000}</maxTime>
This is how soon you need your changes will be visible and should be set 
to 30s (or the highest possible value since caches are invalidated when 
searcher is opened.
> </autoSoftCommit>
> <useColdSearcher>true</useColdSearcher>
> <maxWarmingSearchers>8</maxWarmingSearchers>
You can keep these setting but they are hiding configuration errors. It 
is better to have those errors and fix warming configs than using cold 
and allowing large number of warming searchers.
>
> I've got the updateLog/transaction log enabled, as I think I read it's
> required for Solr Cloud.
>
> Are there any settings I should look at that affect performance
> significantly, especially outside of the solrconfig.xml for each collection
> (like jetty configs, logging properties, etc)?
>
> How much impact do the <lib/> directives in the solrconfig have on
> performance? Do they only take effect if I have something configured that
> requires them, and therefore if I'm missing one that I need, I'd get an
> error if it's not defined?
>
> Any help will be greatly appreciated. Thanks!
> -Brent
>