You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Sofiya Strochyk <ss...@interlogic.com.ua> on 2018/10/29 08:12:05 UTC

Re: **SPAM** Re: SolrCloud scaling/optimization for high request rate

Hi Walter,

yes, after some point it gets really slow (before reaching 100% CPU 
usage), so unless G1 or further tuning helps i guess we will have to add 
more replicas or shards.


On 26.10.18 20:57, Walter Underwood wrote:
> The G1 collector should improve 95th percentile performance, because it limits the length of pauses.
>
> With the CMS/ParNew collector, I ran very large Eden spaces, 2 Gb out of an 8 Gb heap. Nearly all of the allocations in Solr have the lifetime of one request, so you don’t want any of those allocations to be promoted to tenured space. Tenured space should be mostly cache evictions and should grow slowly.
>
> For our clusters, when we hit 70% CPU, we add more CPUs. If we drive Solr much harder than that, it goes into congestion collapse. That is totally expected. When you use all of a resource, things get slow. Request more than all of a resource and things get very, very slow.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On Oct 26, 2018, at 10:21 AM, Sofiya Strochyk <ss...@interlogic.com.ua> wrote:
>>
>> Thanks Erick,
>>
>> 1. We already use Solr 7.5, upgraded some of our nodes only recently to see if this eliminates the difference in performance (it doesn't, but I'll test and see if the situation with replicas syncing/recovery has improved since then)
>> 2. Yes, we only open searcher once every 30 minutes so it is not an NRT case. But it is only recommended <https://lucene.apache.org/solr/guide/7_1/shards-and-indexing-data-in-solrcloud.html#combining-replica-types-in-a-cluster> to use NRT/TLOG/TLOG+PULL replica types together (currently we have all NRT replicas), would you suggest we change leaders to TLOG and slaves to PULL? And this would also eliminate the redundancy provided by replication because PULL replicas can't become leaders, right?
>> 3. Yes but then it would be reflected in iowait metric, which is almost always near zero on our servers. Is there anything else Solr could be waiting for, and is there a way to check it? If we are going to need even more servers for faster response and faceting then there must be a way to know which resource we should get more of.
>> 5. Yes, docValues are enabled for the fields we sort on (except score which is an internal field); _version_ is left at default i think (type="long" indexed="false" stored="false", and it's also marked as having DocValues in the admin UI)
>> 6. QPS and response time seem to be about the same with and without indexing; server load also looks about the same so i assume indexing doesn't take up a lot of resources (a little strange, but possible if it is limited by network or some other things from point 3).
>>
>> 7. Will try using G1 if nothing else helps... Haven't tested it yet because it is considered unsafe and i'd like to have all other options exhausted first. (And even then it is probably going to be a minor improvement? How much more efficient could it possibly be?)
>>
>> On 26.10.18 19:18, Erick Erickson wrote:
>>> Some ideas:
>>>
>>> 1> What version of Solr? Solr 7.3 completely re-wrote Leader Initiated
>>> Recovery and 7.5 has other improvements for recovery, we're hoping
>>> that the recovery situation is much improved.
>>>
>>> 2> In the 7x code line, there are TLOG and PULL replicas. As of 7.5,
>>> you can set up so the queries are served by replica type, see:
>>> https://issues.apache.org/jira/browse/SOLR-11982 <https://issues.apache.org/jira/browse/SOLR-11982>. This might help you
>>> out. This moves all the indexing to the leader and reserves the rest
>>> of the nodes for queries only, using old-style replication. I'm
>>> assuming from your commit rate that latency between when updates
>>> happen and the updates are searchable isn't a big concern.
>>>
>>> 3> Just because the CPU isn't 100% doesn't mean Solr is running flat
>>> out. There's I/O waits while sub-requests are serviced and the like.
>>>
>>> 4> As for how to add faceting without slowing down querying, there's
>>> no way. Extra work is extra work. Depending on _what_ you're faceting
>>> on, you may be able to do some tricks, but without details it's hard
>>> to say. You need to get the query rate target first though ;)
>>>
>>> 5> OOMs Hmm, you say you're doing complex sorts, are all fields
>>> involved in sorts docValues=true? They have to be to be used in
>>> function queries of course, but what about any fields that aren't?
>>> What about your _version_ field?
>>>
>>> 6>  bq. "...indexed 2 times/day, as fast as the SOLR allows..." One
>>> experiment I'd run is to test your QPS rate when there was _no_
>>> indexing going on. That would give you a hint as to whether the
>>> TLOG/PULL configuration would be helpful. There's been talk of
>>> separate thread pools for indexing and querying to give queries a
>>> better shot at the CPU, but that's not in place yet.
>>>
>>> 7> G1GC may also help rather than CMS, but as you're well aware GC
>>> tuning "is more art than science" ;).
>>>
>>> Good luck!
>>> Erick
>>>
>>> On Fri, Oct 26, 2018 at 8:55 AM Sofiya Strochyk <ss...@interlogic.com.ua> <ma...@interlogic.com.ua> wrote:
>>>> Hi everyone,
>>>>
>>>> We have a SolrCloud setup with the following configuration:
>>>>
>>>> 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon E5-1650v2, 12 cores, with SSDs)
>>>> One collection, 4 shards, each has only a single replica (so 4 replicas in total), using compositeId router
>>>> Total index size is about 150M documents/320GB, so about 40M/80GB per node
>>>> Zookeeper is on a separate server
>>>> Documents consist of about 20 fields (most of them are both stored and indexed), average document size is about 2kB
>>>> Queries are mostly 2-3 words in the q field, with 2 fq parameters, with complex sort expression (containing IF functions)
>>>> We don't use faceting due to performance reasons but need to add it in the future
>>>> Majority of the documents are reindexed 2 times/day, as fast as the SOLR allows, in batches of 1000-10000 docs. Some of the documents are also deleted (by id, not by query)
>>>> autoCommit is set to maxTime of 1 minute with openSearcher=false and autoSoftCommit maxTime is 30 minutes with openSearcher=true. Commits from clients are ignored.
>>>> Heap size is set to 8GB.
>>>>
>>>> Target query rate is up to 500 qps, maybe 300, and we need to keep response time at <200ms. But at the moment we only see very good search performance with up to 100 requests per second. Whenever it grows to about 200, average response time abruptly increases to 0.5-1 second. (Also it seems that request rate reported by SOLR in admin metrics is 2x higher than the real one, because for every query, every shard receives 2 requests: one to obtain IDs and second one to get data by IDs; so target rate for SOLR metrics would be 1000 qps).
>>>>
>>>> During high request load, CPU usage increases dramatically on the SOLR nodes. It doesn't reach 100% but averages at 50-70% on 3 servers and about 93% on 1 server (random server each time, not the smallest one).
>>>>
>>>> The documentation mentions replication to spread the load between the servers. We tested replicating to smaller servers (32GB RAM, Intel Core i7-4770). However, when we tested it, the replicas were going out of sync all the time (possibly during commits) and reported errors like "PeerSync Recovery was not successful - trying replication." Then they proceed with replication which takes hours and the leader handles all requests singlehandedly during that time. Also both leaders and replicas started encountering OOM errors (heap space) for unknown reason. Heap dump analysis shows that most of the memory is consumed by [J (array of long) type, my best guess would be that it is "_version_" field, but it's still unclear why it happens. Also, even though with replication request rate and CPU usage drop 2 times, it doesn't seem to affect mean_ms, stddev_ms or p95_ms numbers (p75_ms is much smaller on nodes with replication, but still not as low as under load of <100 requests/s).
>>>>
>>>> Garbage collection is much more active during high load as well. Full GC happens almost exclusively during those times. We have tried tuning GC options like suggested here and it didn't change things though.
>>>>
>>>> My questions are
>>>>
>>>> How do we increase throughput? Is replication the only solution?
>>>> if yes - then why doesn't it affect response times, considering that CPU is not 100% used and index fits into memory?
>>>> How to deal with OOM and replicas going into recovery?
>>>> Is memory or CPU the main problem? (When searching on the internet, i never see CPU as main bottleneck for SOLR, but our case might be different)
>>>> Or do we need smaller shards? Could segments merging be a problem?
>>>> How to add faceting without search queries slowing down too much?
>>>> How to diagnose these problems and narrow down to the real reason in hardware or setup?
>>>>
>>>> Any help would be much appreciated.
>>>>
>>>> Thanks!
>>>>
>>>> --
>>>> Sofiia Strochyk
>>>>
>>>>
>>>>
>>>> ss@interlogic.com.ua <ma...@interlogic.com.ua>
>>>>
>>>> www.interlogic.com.ua <http://www.interlogic.com.ua/>
>>>>
>>>>
>> -- 
>> Sofiia Strochyk
>>
>>
>>
>> ss@interlogic.com.ua <ma...@interlogic.com.ua>
>> <menlkogfmjmjjahp.png>
>> www.interlogic.com.ua <https://www.interlogic.com.ua/>
>>
>> <ghlnacnjngimaejn.png> <https://www.facebook.com/InterLogicOfficial>   <kjnfknfjjoeoajce.png> <https://www.linkedin.com/company/interlogic>

-- 
Email Signature
*Sofiia Strochyk
*


ss@interlogic.com.ua <ma...@interlogic.com.ua>
	InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>

Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn 
icon <https://www.linkedin.com/company/interlogic>