You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Sofiya Strochyk <ss...@interlogic.com.ua> on 2018/10/26 15:55:09 UTC
SolrCloud scaling/optimization for high request rate
Hi everyone,
We have a SolrCloud setup with the following configuration:
* 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon
E5-1650v2, 12 cores, with SSDs)
* One collection, 4 shards, each has only a single replica (so 4
replicas in total), using compositeId router
* Total index size is about 150M documents/320GB, so about 40M/80GB
per node
* Zookeeper is on a separate server
* Documents consist of about 20 fields (most of them are both stored
and indexed), average document size is about2kB
* Queries are mostly 2-3 words in the q field, with 2 fq parameters,
with complex sort expression (containing IF functions)
* We don't use faceting due to performance reasons but need to add it
in the future
* Majority of the documents are reindexed 2 times/day, as fast as the
SOLR allows, in batches of 1000-10000 docs. Some of the documents
are also deleted (by id, not by query)
* autoCommit is set to maxTime of 1 minute with openSearcher=false and
autoSoftCommit maxTime is 30 minutes with openSearcher=true. Commits
from clients are ignored.
* Heap size is set to 8GB.
Target query rate is up to 500 qps, maybe 300, and we need to keep
response time at <200ms. But at the moment we only see very good search
performance with up to 100 requests per second. Whenever it grows to
about 200, average response time abruptly increases to 0.5-1 second.
(Also it seems that request rate reported by SOLR in admin metrics is 2x
higher than the real one, because for every query, every shard receives
2 requests: one to obtain IDs and second one to get data by IDs; so
target rate for SOLR metrics would be 1000 qps).
During high request load, CPU usage increases dramatically on the SOLR
nodes. It doesn't reach 100% but averages at 50-70% on 3 servers and
about 93% on 1 server (random server each time, not the smallest one).
The documentation mentions replication to spread the load between the
servers. We tested replicating to smaller servers (32GB RAM, Intel Core
i7-4770). However, when we tested it, the replicas were going out of
sync all the time (possibly during commits) and reported errors like
"PeerSync Recovery was not successful - trying replication." Then they
proceed with replication which takes hours and the leader handles all
requests singlehandedly during that time. Also both leaders and replicas
started encountering OOM errors (heap space) for unknown reason. Heap
dump analysis shows that most of the memory is consumed by [J (array of
long) type, my best guess would be that it is "_version_" field, but
it's still unclear why it happens. Also, even though with replication
request rate and CPU usage drop 2 times, it doesn't seem to affect
mean_ms, stddev_ms or p95_ms numbers (p75_ms is much smaller on nodes
with replication, but still not as low as under load of <100 requests/s).
Garbage collection is much more active during high load as well. Full GC
happens almost exclusively during those times. We have tried tuning GC
options like suggested here
<https://wiki.apache.org/solr/ShawnHeisey#CMS_.28ConcurrentMarkSweep.29_Collector>
and it didn't change things though.
My questions are
* How do we increase throughput? Is replication the only solution?
* if yes - then why doesn't it affect response times, considering that
CPU is not 100% used and index fits into memory?
* How to deal with OOM and replicas going into recovery?
* Is memory or CPU the main problem? (When searching on the internet,
i never see CPU as main bottleneck for SOLR, but our case might be
different)
* Or do we need smaller shards? Could segments merging be a problem?
* How to add faceting without search queries slowing down too much?
* How to diagnose these problems and narrow down to the real reason in
hardware or setup?
Any help would be much appreciated.
Thanks!
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: Re: SolrCloud scaling/optimization for high request rate
Posted by Sofiya Strochyk <ss...@interlogic.com.ua>.
Hi Ere,
Thanks for your advice! I'm aware of the performance problems with deep
paging and unfortunately it is not the case here, as the rows number is
always 24 and next pages are hardly ever requested from what i see in
the logs.
On 29.10.18 11:19, Ere Maijala wrote:
> Hi Sofiya,
>
> You've already received a lot of ideas, but I think this wasn't yet
> mentioned: You didn't specify the number of rows your queries fetch or
> whether you're using deep paging in the queries. Both can be real
> perfomance killers in a sharded index because a large set of records
> have to be fetched from all shards. This consumes a relatively high
> amount of memory, and even if the servers are able to handle a certain
> number of these queries simultaneously, you'd run into garbage
> collection trouble with more queries being served. So just one more
> thing to be aware of!
>
> Regards,
> Ere
>
> Sofiya Strochyk kirjoitti 26.10.2018 klo 18.55:
>> Hi everyone,
>>
>> We have a SolrCloud setup with the following configuration:
>>
>> * 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon
>> E5-1650v2, 12 cores, with SSDs)
>> * One collection, 4 shards, each has only a single replica (so 4
>> replicas in total), using compositeId router
>> * Total index size is about 150M documents/320GB, so about 40M/80GB
>> per node
>> * Zookeeper is on a separate server
>> * Documents consist of about 20 fields (most of them are both stored
>> and indexed), average document size is about2kB
>> * Queries are mostly 2-3 words in the q field, with 2 fq parameters,
>> with complex sort expression (containing IF functions)
>> * We don't use faceting due to performance reasons but need to add it
>> in the future
>> * Majority of the documents are reindexed 2 times/day, as fast as the
>> SOLR allows, in batches of 1000-10000 docs. Some of the documents
>> are also deleted (by id, not by query)
>> * autoCommit is set to maxTime of 1 minute with openSearcher=false and
>> autoSoftCommit maxTime is 30 minutes with openSearcher=true. Commits
>> from clients are ignored.
>> * Heap size is set to 8GB.
>>
>> Target query rate is up to 500 qps, maybe 300, and we need to keep
>> response time at <200ms. But at the moment we only see very good
>> search performance with up to 100 requests per second. Whenever it
>> grows to about 200, average response time abruptly increases to 0.5-1
>> second. (Also it seems that request rate reported by SOLR in admin
>> metrics is 2x higher than the real one, because for every query,
>> every shard receives 2 requests: one to obtain IDs and second one to
>> get data by IDs; so target rate for SOLR metrics would be 1000 qps).
>>
>> During high request load, CPU usage increases dramatically on the
>> SOLR nodes. It doesn't reach 100% but averages at 50-70% on 3 servers
>> and about 93% on 1 server (random server each time, not the smallest
>> one).
>>
>> The documentation mentions replication to spread the load between the
>> servers. We tested replicating to smaller servers (32GB RAM, Intel
>> Core i7-4770). However, when we tested it, the replicas were going
>> out of sync all the time (possibly during commits) and reported
>> errors like "PeerSync Recovery was not successful - trying
>> replication." Then they proceed with replication which takes hours
>> and the leader handles all requests singlehandedly during that time.
>> Also both leaders and replicas started encountering OOM errors (heap
>> space) for unknown reason. Heap dump analysis shows that most of the
>> memory is consumed by [J (array of long) type, my best guess would be
>> that it is "_version_" field, but it's still unclear why it happens.
>> Also, even though with replication request rate and CPU usage drop 2
>> times, it doesn't seem to affect mean_ms, stddev_ms or p95_ms numbers
>> (p75_ms is much smaller on nodes with replication, but still not as
>> low as under load of <100 requests/s).
>>
>> Garbage collection is much more active during high load as well. Full
>> GC happens almost exclusively during those times. We have tried
>> tuning GC options like suggested here
>> <https://wiki.apache.org/solr/ShawnHeisey#CMS_.28ConcurrentMarkSweep.29_Collector>
>> and it didn't change things though.
>>
>> My questions are
>>
>> * How do we increase throughput? Is replication the only solution?
>> * if yes - then why doesn't it affect response times, considering that
>> CPU is not 100% used and index fits into memory?
>> * How to deal with OOM and replicas going into recovery?
>> * Is memory or CPU the main problem? (When searching on the internet,
>> i never see CPU as main bottleneck for SOLR, but our case might be
>> different)
>> * Or do we need smaller shards? Could segments merging be a problem?
>> * How to add faceting without search queries slowing down too much?
>> * How to diagnose these problems and narrow down to the real reason in
>> hardware or setup?
>>
>> Any help would be much appreciated.
>>
>> Thanks!
>>
>> --
>> Email Signature
>> *Sofiia Strochyk
>> *
>>
>>
>> ss@interlogic.com.ua <ma...@interlogic.com.ua>
>> InterLogic
>> www.interlogic.com.ua <https://www.interlogic.com.ua>
>>
>> Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
>> icon <https://www.linkedin.com/company/interlogic>
>>
>
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: SolrCloud scaling/optimization for high request rate
Posted by Ere Maijala <er...@helsinki.fi>.
Hi Sofiya,
You've already received a lot of ideas, but I think this wasn't yet
mentioned: You didn't specify the number of rows your queries fetch or
whether you're using deep paging in the queries. Both can be real
perfomance killers in a sharded index because a large set of records
have to be fetched from all shards. This consumes a relatively high
amount of memory, and even if the servers are able to handle a certain
number of these queries simultaneously, you'd run into garbage
collection trouble with more queries being served. So just one more
thing to be aware of!
Regards,
Ere
Sofiya Strochyk kirjoitti 26.10.2018 klo 18.55:
> Hi everyone,
>
> We have a SolrCloud setup with the following configuration:
>
> * 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon
> E5-1650v2, 12 cores, with SSDs)
> * One collection, 4 shards, each has only a single replica (so 4
> replicas in total), using compositeId router
> * Total index size is about 150M documents/320GB, so about 40M/80GB
> per node
> * Zookeeper is on a separate server
> * Documents consist of about 20 fields (most of them are both stored
> and indexed), average document size is about2kB
> * Queries are mostly 2-3 words in the q field, with 2 fq parameters,
> with complex sort expression (containing IF functions)
> * We don't use faceting due to performance reasons but need to add it
> in the future
> * Majority of the documents are reindexed 2 times/day, as fast as the
> SOLR allows, in batches of 1000-10000 docs. Some of the documents
> are also deleted (by id, not by query)
> * autoCommit is set to maxTime of 1 minute with openSearcher=false and
> autoSoftCommit maxTime is 30 minutes with openSearcher=true. Commits
> from clients are ignored.
> * Heap size is set to 8GB.
>
> Target query rate is up to 500 qps, maybe 300, and we need to keep
> response time at <200ms. But at the moment we only see very good search
> performance with up to 100 requests per second. Whenever it grows to
> about 200, average response time abruptly increases to 0.5-1 second.
> (Also it seems that request rate reported by SOLR in admin metrics is 2x
> higher than the real one, because for every query, every shard receives
> 2 requests: one to obtain IDs and second one to get data by IDs; so
> target rate for SOLR metrics would be 1000 qps).
>
> During high request load, CPU usage increases dramatically on the SOLR
> nodes. It doesn't reach 100% but averages at 50-70% on 3 servers and
> about 93% on 1 server (random server each time, not the smallest one).
>
> The documentation mentions replication to spread the load between the
> servers. We tested replicating to smaller servers (32GB RAM, Intel Core
> i7-4770). However, when we tested it, the replicas were going out of
> sync all the time (possibly during commits) and reported errors like
> "PeerSync Recovery was not successful - trying replication." Then they
> proceed with replication which takes hours and the leader handles all
> requests singlehandedly during that time. Also both leaders and replicas
> started encountering OOM errors (heap space) for unknown reason. Heap
> dump analysis shows that most of the memory is consumed by [J (array of
> long) type, my best guess would be that it is "_version_" field, but
> it's still unclear why it happens. Also, even though with replication
> request rate and CPU usage drop 2 times, it doesn't seem to affect
> mean_ms, stddev_ms or p95_ms numbers (p75_ms is much smaller on nodes
> with replication, but still not as low as under load of <100 requests/s).
>
> Garbage collection is much more active during high load as well. Full GC
> happens almost exclusively during those times. We have tried tuning GC
> options like suggested here
> <https://wiki.apache.org/solr/ShawnHeisey#CMS_.28ConcurrentMarkSweep.29_Collector>
> and it didn't change things though.
>
> My questions are
>
> * How do we increase throughput? Is replication the only solution?
> * if yes - then why doesn't it affect response times, considering that
> CPU is not 100% used and index fits into memory?
> * How to deal with OOM and replicas going into recovery?
> * Is memory or CPU the main problem? (When searching on the internet,
> i never see CPU as main bottleneck for SOLR, but our case might be
> different)
> * Or do we need smaller shards? Could segments merging be a problem?
> * How to add faceting without search queries slowing down too much?
> * How to diagnose these problems and narrow down to the real reason in
> hardware or setup?
>
> Any help would be much appreciated.
>
> Thanks!
>
> --
> Email Signature
> *Sofiia Strochyk
> *
>
>
> ss@interlogic.com.ua <ma...@interlogic.com.ua>
> InterLogic
> www.interlogic.com.ua <https://www.interlogic.com.ua>
>
> Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
> icon <https://www.linkedin.com/company/interlogic>
>
--
Ere Maijala
Kansalliskirjasto / The National Library of Finland
Re: **SPAM** Re: SolrCloud scaling/optimization for high request rate
Posted by Sofiya Strochyk <ss...@interlogic.com.ua>.
Hi Walter,
yes, after some point it gets really slow (before reaching 100% CPU
usage), so unless G1 or further tuning helps i guess we will have to add
more replicas or shards.
On 26.10.18 20:57, Walter Underwood wrote:
> The G1 collector should improve 95th percentile performance, because it limits the length of pauses.
>
> With the CMS/ParNew collector, I ran very large Eden spaces, 2 Gb out of an 8 Gb heap. Nearly all of the allocations in Solr have the lifetime of one request, so you don’t want any of those allocations to be promoted to tenured space. Tenured space should be mostly cache evictions and should grow slowly.
>
> For our clusters, when we hit 70% CPU, we add more CPUs. If we drive Solr much harder than that, it goes into congestion collapse. That is totally expected. When you use all of a resource, things get slow. Request more than all of a resource and things get very, very slow.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
>> On Oct 26, 2018, at 10:21 AM, Sofiya Strochyk <ss...@interlogic.com.ua> wrote:
>>
>> Thanks Erick,
>>
>> 1. We already use Solr 7.5, upgraded some of our nodes only recently to see if this eliminates the difference in performance (it doesn't, but I'll test and see if the situation with replicas syncing/recovery has improved since then)
>> 2. Yes, we only open searcher once every 30 minutes so it is not an NRT case. But it is only recommended <https://lucene.apache.org/solr/guide/7_1/shards-and-indexing-data-in-solrcloud.html#combining-replica-types-in-a-cluster> to use NRT/TLOG/TLOG+PULL replica types together (currently we have all NRT replicas), would you suggest we change leaders to TLOG and slaves to PULL? And this would also eliminate the redundancy provided by replication because PULL replicas can't become leaders, right?
>> 3. Yes but then it would be reflected in iowait metric, which is almost always near zero on our servers. Is there anything else Solr could be waiting for, and is there a way to check it? If we are going to need even more servers for faster response and faceting then there must be a way to know which resource we should get more of.
>> 5. Yes, docValues are enabled for the fields we sort on (except score which is an internal field); _version_ is left at default i think (type="long" indexed="false" stored="false", and it's also marked as having DocValues in the admin UI)
>> 6. QPS and response time seem to be about the same with and without indexing; server load also looks about the same so i assume indexing doesn't take up a lot of resources (a little strange, but possible if it is limited by network or some other things from point 3).
>>
>> 7. Will try using G1 if nothing else helps... Haven't tested it yet because it is considered unsafe and i'd like to have all other options exhausted first. (And even then it is probably going to be a minor improvement? How much more efficient could it possibly be?)
>>
>> On 26.10.18 19:18, Erick Erickson wrote:
>>> Some ideas:
>>>
>>> 1> What version of Solr? Solr 7.3 completely re-wrote Leader Initiated
>>> Recovery and 7.5 has other improvements for recovery, we're hoping
>>> that the recovery situation is much improved.
>>>
>>> 2> In the 7x code line, there are TLOG and PULL replicas. As of 7.5,
>>> you can set up so the queries are served by replica type, see:
>>> https://issues.apache.org/jira/browse/SOLR-11982 <https://issues.apache.org/jira/browse/SOLR-11982>. This might help you
>>> out. This moves all the indexing to the leader and reserves the rest
>>> of the nodes for queries only, using old-style replication. I'm
>>> assuming from your commit rate that latency between when updates
>>> happen and the updates are searchable isn't a big concern.
>>>
>>> 3> Just because the CPU isn't 100% doesn't mean Solr is running flat
>>> out. There's I/O waits while sub-requests are serviced and the like.
>>>
>>> 4> As for how to add faceting without slowing down querying, there's
>>> no way. Extra work is extra work. Depending on _what_ you're faceting
>>> on, you may be able to do some tricks, but without details it's hard
>>> to say. You need to get the query rate target first though ;)
>>>
>>> 5> OOMs Hmm, you say you're doing complex sorts, are all fields
>>> involved in sorts docValues=true? They have to be to be used in
>>> function queries of course, but what about any fields that aren't?
>>> What about your _version_ field?
>>>
>>> 6> bq. "...indexed 2 times/day, as fast as the SOLR allows..." One
>>> experiment I'd run is to test your QPS rate when there was _no_
>>> indexing going on. That would give you a hint as to whether the
>>> TLOG/PULL configuration would be helpful. There's been talk of
>>> separate thread pools for indexing and querying to give queries a
>>> better shot at the CPU, but that's not in place yet.
>>>
>>> 7> G1GC may also help rather than CMS, but as you're well aware GC
>>> tuning "is more art than science" ;).
>>>
>>> Good luck!
>>> Erick
>>>
>>> On Fri, Oct 26, 2018 at 8:55 AM Sofiya Strochyk <ss...@interlogic.com.ua> <ma...@interlogic.com.ua> wrote:
>>>> Hi everyone,
>>>>
>>>> We have a SolrCloud setup with the following configuration:
>>>>
>>>> 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon E5-1650v2, 12 cores, with SSDs)
>>>> One collection, 4 shards, each has only a single replica (so 4 replicas in total), using compositeId router
>>>> Total index size is about 150M documents/320GB, so about 40M/80GB per node
>>>> Zookeeper is on a separate server
>>>> Documents consist of about 20 fields (most of them are both stored and indexed), average document size is about 2kB
>>>> Queries are mostly 2-3 words in the q field, with 2 fq parameters, with complex sort expression (containing IF functions)
>>>> We don't use faceting due to performance reasons but need to add it in the future
>>>> Majority of the documents are reindexed 2 times/day, as fast as the SOLR allows, in batches of 1000-10000 docs. Some of the documents are also deleted (by id, not by query)
>>>> autoCommit is set to maxTime of 1 minute with openSearcher=false and autoSoftCommit maxTime is 30 minutes with openSearcher=true. Commits from clients are ignored.
>>>> Heap size is set to 8GB.
>>>>
>>>> Target query rate is up to 500 qps, maybe 300, and we need to keep response time at <200ms. But at the moment we only see very good search performance with up to 100 requests per second. Whenever it grows to about 200, average response time abruptly increases to 0.5-1 second. (Also it seems that request rate reported by SOLR in admin metrics is 2x higher than the real one, because for every query, every shard receives 2 requests: one to obtain IDs and second one to get data by IDs; so target rate for SOLR metrics would be 1000 qps).
>>>>
>>>> During high request load, CPU usage increases dramatically on the SOLR nodes. It doesn't reach 100% but averages at 50-70% on 3 servers and about 93% on 1 server (random server each time, not the smallest one).
>>>>
>>>> The documentation mentions replication to spread the load between the servers. We tested replicating to smaller servers (32GB RAM, Intel Core i7-4770). However, when we tested it, the replicas were going out of sync all the time (possibly during commits) and reported errors like "PeerSync Recovery was not successful - trying replication." Then they proceed with replication which takes hours and the leader handles all requests singlehandedly during that time. Also both leaders and replicas started encountering OOM errors (heap space) for unknown reason. Heap dump analysis shows that most of the memory is consumed by [J (array of long) type, my best guess would be that it is "_version_" field, but it's still unclear why it happens. Also, even though with replication request rate and CPU usage drop 2 times, it doesn't seem to affect mean_ms, stddev_ms or p95_ms numbers (p75_ms is much smaller on nodes with replication, but still not as low as under load of <100 requests/s).
>>>>
>>>> Garbage collection is much more active during high load as well. Full GC happens almost exclusively during those times. We have tried tuning GC options like suggested here and it didn't change things though.
>>>>
>>>> My questions are
>>>>
>>>> How do we increase throughput? Is replication the only solution?
>>>> if yes - then why doesn't it affect response times, considering that CPU is not 100% used and index fits into memory?
>>>> How to deal with OOM and replicas going into recovery?
>>>> Is memory or CPU the main problem? (When searching on the internet, i never see CPU as main bottleneck for SOLR, but our case might be different)
>>>> Or do we need smaller shards? Could segments merging be a problem?
>>>> How to add faceting without search queries slowing down too much?
>>>> How to diagnose these problems and narrow down to the real reason in hardware or setup?
>>>>
>>>> Any help would be much appreciated.
>>>>
>>>> Thanks!
>>>>
>>>> --
>>>> Sofiia Strochyk
>>>>
>>>>
>>>>
>>>> ss@interlogic.com.ua <ma...@interlogic.com.ua>
>>>>
>>>> www.interlogic.com.ua <http://www.interlogic.com.ua/>
>>>>
>>>>
>> --
>> Sofiia Strochyk
>>
>>
>>
>> ss@interlogic.com.ua <ma...@interlogic.com.ua>
>> <menlkogfmjmjjahp.png>
>> www.interlogic.com.ua <https://www.interlogic.com.ua/>
>>
>> <ghlnacnjngimaejn.png> <https://www.facebook.com/InterLogicOfficial> <kjnfknfjjoeoajce.png> <https://www.linkedin.com/company/interlogic>
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: SolrCloud scaling/optimization for high request rate
Posted by Walter Underwood <wu...@wunderwood.org>.
The G1 collector should improve 95th percentile performance, because it limits the length of pauses.
With the CMS/ParNew collector, I ran very large Eden spaces, 2 Gb out of an 8 Gb heap. Nearly all of the allocations in Solr have the lifetime of one request, so you don’t want any of those allocations to be promoted to tenured space. Tenured space should be mostly cache evictions and should grow slowly.
For our clusters, when we hit 70% CPU, we add more CPUs. If we drive Solr much harder than that, it goes into congestion collapse. That is totally expected. When you use all of a resource, things get slow. Request more than all of a resource and things get very, very slow.
wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Oct 26, 2018, at 10:21 AM, Sofiya Strochyk <ss...@interlogic.com.ua> wrote:
>
> Thanks Erick,
>
> 1. We already use Solr 7.5, upgraded some of our nodes only recently to see if this eliminates the difference in performance (it doesn't, but I'll test and see if the situation with replicas syncing/recovery has improved since then)
> 2. Yes, we only open searcher once every 30 minutes so it is not an NRT case. But it is only recommended <https://lucene.apache.org/solr/guide/7_1/shards-and-indexing-data-in-solrcloud.html#combining-replica-types-in-a-cluster> to use NRT/TLOG/TLOG+PULL replica types together (currently we have all NRT replicas), would you suggest we change leaders to TLOG and slaves to PULL? And this would also eliminate the redundancy provided by replication because PULL replicas can't become leaders, right?
> 3. Yes but then it would be reflected in iowait metric, which is almost always near zero on our servers. Is there anything else Solr could be waiting for, and is there a way to check it? If we are going to need even more servers for faster response and faceting then there must be a way to know which resource we should get more of.
> 5. Yes, docValues are enabled for the fields we sort on (except score which is an internal field); _version_ is left at default i think (type="long" indexed="false" stored="false", and it's also marked as having DocValues in the admin UI)
> 6. QPS and response time seem to be about the same with and without indexing; server load also looks about the same so i assume indexing doesn't take up a lot of resources (a little strange, but possible if it is limited by network or some other things from point 3).
>
> 7. Will try using G1 if nothing else helps... Haven't tested it yet because it is considered unsafe and i'd like to have all other options exhausted first. (And even then it is probably going to be a minor improvement? How much more efficient could it possibly be?)
>
> On 26.10.18 19:18, Erick Erickson wrote:
>> Some ideas:
>>
>> 1> What version of Solr? Solr 7.3 completely re-wrote Leader Initiated
>> Recovery and 7.5 has other improvements for recovery, we're hoping
>> that the recovery situation is much improved.
>>
>> 2> In the 7x code line, there are TLOG and PULL replicas. As of 7.5,
>> you can set up so the queries are served by replica type, see:
>> https://issues.apache.org/jira/browse/SOLR-11982 <https://issues.apache.org/jira/browse/SOLR-11982>. This might help you
>> out. This moves all the indexing to the leader and reserves the rest
>> of the nodes for queries only, using old-style replication. I'm
>> assuming from your commit rate that latency between when updates
>> happen and the updates are searchable isn't a big concern.
>>
>> 3> Just because the CPU isn't 100% doesn't mean Solr is running flat
>> out. There's I/O waits while sub-requests are serviced and the like.
>>
>> 4> As for how to add faceting without slowing down querying, there's
>> no way. Extra work is extra work. Depending on _what_ you're faceting
>> on, you may be able to do some tricks, but without details it's hard
>> to say. You need to get the query rate target first though ;)
>>
>> 5> OOMs Hmm, you say you're doing complex sorts, are all fields
>> involved in sorts docValues=true? They have to be to be used in
>> function queries of course, but what about any fields that aren't?
>> What about your _version_ field?
>>
>> 6> bq. "...indexed 2 times/day, as fast as the SOLR allows..." One
>> experiment I'd run is to test your QPS rate when there was _no_
>> indexing going on. That would give you a hint as to whether the
>> TLOG/PULL configuration would be helpful. There's been talk of
>> separate thread pools for indexing and querying to give queries a
>> better shot at the CPU, but that's not in place yet.
>>
>> 7> G1GC may also help rather than CMS, but as you're well aware GC
>> tuning "is more art than science" ;).
>>
>> Good luck!
>> Erick
>>
>> On Fri, Oct 26, 2018 at 8:55 AM Sofiya Strochyk <ss...@interlogic.com.ua> <ma...@interlogic.com.ua> wrote:
>>> Hi everyone,
>>>
>>> We have a SolrCloud setup with the following configuration:
>>>
>>> 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon E5-1650v2, 12 cores, with SSDs)
>>> One collection, 4 shards, each has only a single replica (so 4 replicas in total), using compositeId router
>>> Total index size is about 150M documents/320GB, so about 40M/80GB per node
>>> Zookeeper is on a separate server
>>> Documents consist of about 20 fields (most of them are both stored and indexed), average document size is about 2kB
>>> Queries are mostly 2-3 words in the q field, with 2 fq parameters, with complex sort expression (containing IF functions)
>>> We don't use faceting due to performance reasons but need to add it in the future
>>> Majority of the documents are reindexed 2 times/day, as fast as the SOLR allows, in batches of 1000-10000 docs. Some of the documents are also deleted (by id, not by query)
>>> autoCommit is set to maxTime of 1 minute with openSearcher=false and autoSoftCommit maxTime is 30 minutes with openSearcher=true. Commits from clients are ignored.
>>> Heap size is set to 8GB.
>>>
>>> Target query rate is up to 500 qps, maybe 300, and we need to keep response time at <200ms. But at the moment we only see very good search performance with up to 100 requests per second. Whenever it grows to about 200, average response time abruptly increases to 0.5-1 second. (Also it seems that request rate reported by SOLR in admin metrics is 2x higher than the real one, because for every query, every shard receives 2 requests: one to obtain IDs and second one to get data by IDs; so target rate for SOLR metrics would be 1000 qps).
>>>
>>> During high request load, CPU usage increases dramatically on the SOLR nodes. It doesn't reach 100% but averages at 50-70% on 3 servers and about 93% on 1 server (random server each time, not the smallest one).
>>>
>>> The documentation mentions replication to spread the load between the servers. We tested replicating to smaller servers (32GB RAM, Intel Core i7-4770). However, when we tested it, the replicas were going out of sync all the time (possibly during commits) and reported errors like "PeerSync Recovery was not successful - trying replication." Then they proceed with replication which takes hours and the leader handles all requests singlehandedly during that time. Also both leaders and replicas started encountering OOM errors (heap space) for unknown reason. Heap dump analysis shows that most of the memory is consumed by [J (array of long) type, my best guess would be that it is "_version_" field, but it's still unclear why it happens. Also, even though with replication request rate and CPU usage drop 2 times, it doesn't seem to affect mean_ms, stddev_ms or p95_ms numbers (p75_ms is much smaller on nodes with replication, but still not as low as under load of <100 requests/s).
>>>
>>> Garbage collection is much more active during high load as well. Full GC happens almost exclusively during those times. We have tried tuning GC options like suggested here and it didn't change things though.
>>>
>>> My questions are
>>>
>>> How do we increase throughput? Is replication the only solution?
>>> if yes - then why doesn't it affect response times, considering that CPU is not 100% used and index fits into memory?
>>> How to deal with OOM and replicas going into recovery?
>>> Is memory or CPU the main problem? (When searching on the internet, i never see CPU as main bottleneck for SOLR, but our case might be different)
>>> Or do we need smaller shards? Could segments merging be a problem?
>>> How to add faceting without search queries slowing down too much?
>>> How to diagnose these problems and narrow down to the real reason in hardware or setup?
>>>
>>> Any help would be much appreciated.
>>>
>>> Thanks!
>>>
>>> --
>>> Sofiia Strochyk
>>>
>>>
>>>
>>> ss@interlogic.com.ua <ma...@interlogic.com.ua>
>>>
>>> www.interlogic.com.ua <http://www.interlogic.com.ua/>
>>>
>>>
>
> --
> Sofiia Strochyk
>
>
>
> ss@interlogic.com.ua <ma...@interlogic.com.ua>
> <menlkogfmjmjjahp.png>
> www.interlogic.com.ua <https://www.interlogic.com.ua/>
>
> <ghlnacnjngimaejn.png> <https://www.facebook.com/InterLogicOfficial> <kjnfknfjjoeoajce.png> <https://www.linkedin.com/company/interlogic>
Re: SolrCloud scaling/optimization for high request rate
Posted by Sofiya Strochyk <ss...@interlogic.com.ua>.
Erick,
thanks, i've been pulling my hair out over this for a long time and
gathered a lot of information :)
Doesn't there exist a setting for maxIndexingThreads in solrconfig with
default value of about 8? It's not clear if my updates are being
executed in parallel or not but i would expect them to use at least a
few threads.
In the past, we hosted 2 shards on one of the bigger nodes for some
time, and this resulted in high load on that node and slow requests from
those 2 shards (though not too much worse than now with only 1 shard per
node) so they might be too small for handling 2 or more replicas.
Anyway thanks for your help, i'll try profiling and looking into metrics
and see if there are some pointers to CPU consumption...
On 27.10.18 05:52, Erick Erickson wrote:
> Sofiya:
>
> I haven't said so before, but it's a great pleasure to work with
> someone who's done a lot of homework before pinging the list. The only
> unfortunate bit is that it usually means the simple "Oh, I can fix
> that without thinking about it much" doesn't work ;)
>
> 2. I'll clarify a bit here. Any TLOG replica can become the leader.
> Here's the process for an update:
>> doc comes in to the leader (may be TLOG)
>> doc is forwarded to all TLOG replicas, _but it is not indexed there_.
>> If the leader fails, the other TLOG replicas have enough documents in _their_ tlogs to "catch up" and one is elected
>> You're totally right that PULL replicas cannot become leaders
>> having all TLOG replicas means that the CPU cycles otherwise consumed by indexing are available for query processing.
> The point here is that TLOG replicas don't need to expend CPU cycles
> to index documents, freeing up all those cycles for serving queries.
>
> Now, that said you report that QPS rate doesn't particularly seem to
> be affected by whether you're indexing or not, so that makes using
> TLOG and PULL replicas less likely to solve your problem. I was
> thinking about your statement that you index as fast as possible....
>
>
> 6. This is a little surprising. Here's my guess: You're indexing in
> large batches and the batch is only really occupying a thread or two
> so it's effectively serialized thus not consuming a huge amount of
> resources.
>
> So unless G1 really solves a lot of problems, more replicas are
> indicated. On machines with large amounts of RAM and lots of CPUs, one
> other option is to run multiple JVMs per physical node that's
> sometimes helpful.
>
> One other possibility. In Solr 7.5, you have a ton of metrics
> available. If you hit the admin/metrics end point you'll see 150-200
> available metrics. Apart from running a profiler to see what's
> consuming the most cycles, the metrics can give you a view into what
> Solr is doing and may help you pinpoint what's using the most cycles.
>
> Best,
> Erick
> On Fri, Oct 26, 2018 at 12:23 PM Toke Eskildsen <to...@kb.dk> wrote:
>> David Hastings <ha...@gmail.com> wrote:
>>> Would adding the docValues in the schema, but not reindexing, cause
>>> errors? IE, only apply the doc values after the next reindex, but in the
>>> meantime keep functioning as there were none until then?
>> As soon as you specify in the schema that a field has docValues=true, Solr treats all existing documents as having docValues enabled for that field. As there is no docValue content, DocValues-aware functionality such as sorting and faceting will not work for that field, until the documents has been re-indexed.
>>
>> - Toke Eskildsen
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: SolrCloud scaling/optimization for high request rate
Posted by Sofiya Strochyk <ss...@interlogic.com.ua>.
Hi Shalin,
these are stats for caches used:
*documentCache*
class:org.apache.solr.search.LRUCache
description:LRU Cache(maxSize=128, initialSize=128)
stats:
CACHE.searcher.documentCache.cumulative_evictions:234923643
CACHE.searcher.documentCache.cumulative_hitratio:0
CACHE.searcher.documentCache.cumulative_hits:876732
CACHE.searcher.documentCache.cumulative_inserts:234997370
CACHE.searcher.documentCache.cumulative_lookups:235874102
CACHE.searcher.documentCache.evictions:2123759
CACHE.searcher.documentCache.hitratio:0
CACHE.searcher.documentCache.hits:3209
CACHE.searcher.documentCache.inserts:2124008
CACHE.searcher.documentCache.lookups:2127217
CACHE.searcher.documentCache.size:128
CACHE.searcher.documentCache.warmupTime:0
*filterCache*
class:org.apache.solr.search.FastLRUCache
description:Concurrent LRU Cache(maxSize=128, initialSize=128,
minSize=115, acceptableSize=121, cleanupThread=false, autowarmCount=64,
regenerator=org.apache.solr.search.SolrIndexSearcher$2@418597da)
stats:
CACHE.searcher.filterCache.cumulative_evictions:33848
CACHE.searcher.filterCache.cumulative_hitratio:1
CACHE.searcher.filterCache.cumulative_hits:79684607
CACHE.searcher.filterCache.cumulative_inserts:44931
CACHE.searcher.filterCache.cumulative_lookups:79729534
CACHE.searcher.filterCache.evictions:59
CACHE.searcher.filterCache.hitratio:1
CACHE.searcher.filterCache.hits:708519
CACHE.searcher.filterCache.inserts:118
CACHE.searcher.filterCache.lookups:708637
CACHE.searcher.filterCache.size:123
CACHE.searcher.filterCache.warmupTime:52330
*queryResultCache*
class:org.apache.solr.search.LRUCache
description:LRU Cache(maxSize=256, initialSize=256, autowarmCount=64,
regenerator=org.apache.solr.search.SolrIndexSearcher$3@124adf10)
stats:
CACHE.searcher.queryResultCache.cumulative_evictions:38463897
CACHE.searcher.queryResultCache.cumulative_hitratio:0.01
CACHE.searcher.queryResultCache.cumulative_hits:271649
CACHE.searcher.queryResultCache.cumulative_inserts:38638030
CACHE.searcher.queryResultCache.cumulative_lookups:38909216
CACHE.searcher.queryResultCache.evictions:351561
CACHE.searcher.queryResultCache.hitratio:0.01
CACHE.searcher.queryResultCache.hits:1952
CACHE.searcher.queryResultCache.inserts:353004
CACHE.searcher.queryResultCache.lookups:354874
CACHE.searcher.queryResultCache.size:256
CACHE.searcher.queryResultCache.warmupTime:11724
On 29.10.18 09:51, Shalin Shekhar Mangar wrote:
> What does your cache statistics look like? What's the hit ratio, size,
> evictions etc?
>
> More comments inline:
>
> On Sat, Oct 27, 2018 at 8:23 AM Erick Erickson <er...@gmail.com>
> wrote:
>
>> Sofiya:
>>
>> I haven't said so before, but it's a great pleasure to work with
>> someone who's done a lot of homework before pinging the list. The only
>> unfortunate bit is that it usually means the simple "Oh, I can fix
>> that without thinking about it much" doesn't work ;)
>>
>> 2. I'll clarify a bit here. Any TLOG replica can become the leader.
>> Here's the process for an update:
>>> doc comes in to the leader (may be TLOG)
>>> doc is forwarded to all TLOG replicas, _but it is not indexed there_.
>>> If the leader fails, the other TLOG replicas have enough documents in
>> _their_ tlogs to "catch up" and one is elected
>>> You're totally right that PULL replicas cannot become leaders
>>> having all TLOG replicas means that the CPU cycles otherwise consumed by
>> indexing are available for query processing.
>>
>> The point here is that TLOG replicas don't need to expend CPU cycles
>> to index documents, freeing up all those cycles for serving queries.
>>
>> Now, that said you report that QPS rate doesn't particularly seem to
>> be affected by whether you're indexing or not, so that makes using
>> TLOG and PULL replicas less likely to solve your problem. I was
>> thinking about your statement that you index as fast as possible....
>>
>>
>> 6. This is a little surprising. Here's my guess: You're indexing in
>> large batches and the batch is only really occupying a thread or two
>> so it's effectively serialized thus not consuming a huge amount of
>> resources.
>>
> The CloudSolrClient parallelizes updates to each shard leader. But in this
> case, there is only 1 shard so all updates are serialized. All indexing
> activity is therefore being performed by a single CPU at a time.
>
>
>> So unless G1 really solves a lot of problems, more replicas are
>> indicated. On machines with large amounts of RAM and lots of CPUs, one
>> other option is to run multiple JVMs per physical node that's
>> sometimes helpful.
>>
>> One other possibility. In Solr 7.5, you have a ton of metrics
>> available. If you hit the admin/metrics end point you'll see 150-200
>> available metrics. Apart from running a profiler to see what's
>> consuming the most cycles, the metrics can give you a view into what
>> Solr is doing and may help you pinpoint what's using the most cycles.
>>
>> Best,
>> Erick
>> On Fri, Oct 26, 2018 at 12:23 PM Toke Eskildsen <to...@kb.dk> wrote:
>>> David Hastings <ha...@gmail.com> wrote:
>>>> Would adding the docValues in the schema, but not reindexing, cause
>>>> errors? IE, only apply the doc values after the next reindex, but in
>> the
>>>> meantime keep functioning as there were none until then?
>>> As soon as you specify in the schema that a field has docValues=true,
>> Solr treats all existing documents as having docValues enabled for that
>> field. As there is no docValue content, DocValues-aware functionality such
>> as sorting and faceting will not work for that field, until the documents
>> has been re-indexed.
>>> - Toke Eskildsen
>
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: SolrCloud scaling/optimization for high request rate
Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
What does your cache statistics look like? What's the hit ratio, size,
evictions etc?
More comments inline:
On Sat, Oct 27, 2018 at 8:23 AM Erick Erickson <er...@gmail.com>
wrote:
> Sofiya:
>
> I haven't said so before, but it's a great pleasure to work with
> someone who's done a lot of homework before pinging the list. The only
> unfortunate bit is that it usually means the simple "Oh, I can fix
> that without thinking about it much" doesn't work ;)
>
> 2. I'll clarify a bit here. Any TLOG replica can become the leader.
> Here's the process for an update:
> > doc comes in to the leader (may be TLOG)
> > doc is forwarded to all TLOG replicas, _but it is not indexed there_.
> > If the leader fails, the other TLOG replicas have enough documents in
> _their_ tlogs to "catch up" and one is elected
> > You're totally right that PULL replicas cannot become leaders
> > having all TLOG replicas means that the CPU cycles otherwise consumed by
> indexing are available for query processing.
>
> The point here is that TLOG replicas don't need to expend CPU cycles
> to index documents, freeing up all those cycles for serving queries.
>
> Now, that said you report that QPS rate doesn't particularly seem to
> be affected by whether you're indexing or not, so that makes using
> TLOG and PULL replicas less likely to solve your problem. I was
> thinking about your statement that you index as fast as possible....
>
>
> 6. This is a little surprising. Here's my guess: You're indexing in
> large batches and the batch is only really occupying a thread or two
> so it's effectively serialized thus not consuming a huge amount of
> resources.
>
The CloudSolrClient parallelizes updates to each shard leader. But in this
case, there is only 1 shard so all updates are serialized. All indexing
activity is therefore being performed by a single CPU at a time.
>
> So unless G1 really solves a lot of problems, more replicas are
> indicated. On machines with large amounts of RAM and lots of CPUs, one
> other option is to run multiple JVMs per physical node that's
> sometimes helpful.
>
> One other possibility. In Solr 7.5, you have a ton of metrics
> available. If you hit the admin/metrics end point you'll see 150-200
> available metrics. Apart from running a profiler to see what's
> consuming the most cycles, the metrics can give you a view into what
> Solr is doing and may help you pinpoint what's using the most cycles.
>
> Best,
> Erick
> On Fri, Oct 26, 2018 at 12:23 PM Toke Eskildsen <to...@kb.dk> wrote:
> >
> > David Hastings <ha...@gmail.com> wrote:
> > > Would adding the docValues in the schema, but not reindexing, cause
> > > errors? IE, only apply the doc values after the next reindex, but in
> the
> > > meantime keep functioning as there were none until then?
> >
> > As soon as you specify in the schema that a field has docValues=true,
> Solr treats all existing documents as having docValues enabled for that
> field. As there is no docValue content, DocValues-aware functionality such
> as sorting and faceting will not work for that field, until the documents
> has been re-indexed.
> >
> > - Toke Eskildsen
>
--
Regards,
Shalin Shekhar Mangar.
Re: SolrCloud scaling/optimization for high request rate
Posted by Erick Erickson <er...@gmail.com>.
Sofiya:
I haven't said so before, but it's a great pleasure to work with
someone who's done a lot of homework before pinging the list. The only
unfortunate bit is that it usually means the simple "Oh, I can fix
that without thinking about it much" doesn't work ;)
2. I'll clarify a bit here. Any TLOG replica can become the leader.
Here's the process for an update:
> doc comes in to the leader (may be TLOG)
> doc is forwarded to all TLOG replicas, _but it is not indexed there_.
> If the leader fails, the other TLOG replicas have enough documents in _their_ tlogs to "catch up" and one is elected
> You're totally right that PULL replicas cannot become leaders
> having all TLOG replicas means that the CPU cycles otherwise consumed by indexing are available for query processing.
The point here is that TLOG replicas don't need to expend CPU cycles
to index documents, freeing up all those cycles for serving queries.
Now, that said you report that QPS rate doesn't particularly seem to
be affected by whether you're indexing or not, so that makes using
TLOG and PULL replicas less likely to solve your problem. I was
thinking about your statement that you index as fast as possible....
6. This is a little surprising. Here's my guess: You're indexing in
large batches and the batch is only really occupying a thread or two
so it's effectively serialized thus not consuming a huge amount of
resources.
So unless G1 really solves a lot of problems, more replicas are
indicated. On machines with large amounts of RAM and lots of CPUs, one
other option is to run multiple JVMs per physical node that's
sometimes helpful.
One other possibility. In Solr 7.5, you have a ton of metrics
available. If you hit the admin/metrics end point you'll see 150-200
available metrics. Apart from running a profiler to see what's
consuming the most cycles, the metrics can give you a view into what
Solr is doing and may help you pinpoint what's using the most cycles.
Best,
Erick
On Fri, Oct 26, 2018 at 12:23 PM Toke Eskildsen <to...@kb.dk> wrote:
>
> David Hastings <ha...@gmail.com> wrote:
> > Would adding the docValues in the schema, but not reindexing, cause
> > errors? IE, only apply the doc values after the next reindex, but in the
> > meantime keep functioning as there were none until then?
>
> As soon as you specify in the schema that a field has docValues=true, Solr treats all existing documents as having docValues enabled for that field. As there is no docValue content, DocValues-aware functionality such as sorting and faceting will not work for that field, until the documents has been re-indexed.
>
> - Toke Eskildsen
Re: SolrCloud scaling/optimization for high request rate
Posted by Toke Eskildsen <to...@kb.dk>.
David Hastings <ha...@gmail.com> wrote:
> Would adding the docValues in the schema, but not reindexing, cause
> errors? IE, only apply the doc values after the next reindex, but in the
> meantime keep functioning as there were none until then?
As soon as you specify in the schema that a field has docValues=true, Solr treats all existing documents as having docValues enabled for that field. As there is no docValue content, DocValues-aware functionality such as sorting and faceting will not work for that field, until the documents has been re-indexed.
- Toke Eskildsen
Re: SolrCloud scaling/optimization for high request rate
Posted by David Hastings <ha...@gmail.com>.
Would adding the docValues in the schema, but not reindexing, cause
errors? IE, only apply the doc values after the next reindex, but in the
meantime keep functioning as there were none until then?
On Fri, Oct 26, 2018 at 2:15 PM Toke Eskildsen <to...@kb.dk> wrote:
> Sofiya Strochyk <ss...@interlogic.com.ua> wrote:
> > 5. Yes, docValues are enabled for the fields we sort on
> > (except score which is an internal field); [...]
>
> I am currently working on
> https://issues.apache.org/jira/browse/LUCENE-8374
> which speeds up DocValues-operations for indexes with many documents.
>
> What "many" means is hard to say, but as can be seen in the JIRA, Tim
> Underwood sees a nice speed up for faceting with his 80M doc index.
> Hopefully it can also benefit your 40M doc (per shard) index with sorting
> on (I infer) multiple DocValued fields. I'd be happy to assist, should you
> need help with the patch.
>
> - Toke Eskildsen
>
Re: SolrCloud scaling/optimization for high request rate
Posted by Toke Eskildsen <to...@kb.dk>.
Sofiya Strochyk <ss...@interlogic.com.ua> wrote:
> 5. Yes, docValues are enabled for the fields we sort on
> (except score which is an internal field); [...]
I am currently working on
https://issues.apache.org/jira/browse/LUCENE-8374
which speeds up DocValues-operations for indexes with many documents.
What "many" means is hard to say, but as can be seen in the JIRA, Tim Underwood sees a nice speed up for faceting with his 80M doc index. Hopefully it can also benefit your 40M doc (per shard) index with sorting on (I infer) multiple DocValued fields. I'd be happy to assist, should you need help with the patch.
- Toke Eskildsen
Re: SolrCloud scaling/optimization for high request rate
Posted by Sofiya Strochyk <ss...@interlogic.com.ua>.
Thanks Erick,
1. We already use Solr 7.5, upgraded some of our nodes only recently to
see if this eliminates the difference in performance (it doesn't, but
I'll test and see if the situation with replicas syncing/recovery has
improved since then)
2. Yes, we only open searcher once every 30 minutes so it is not an NRT
case. But it is only recommended
<https://lucene.apache.org/solr/guide/7_1/shards-and-indexing-data-in-solrcloud.html#combining-replica-types-in-a-cluster>
to use NRT/TLOG/TLOG+PULL replica types together (currently we have all
NRT replicas), would you suggest we change leaders to TLOG and slaves to
PULL? And this would also eliminate the redundancy provided by
replication because PULL replicas can't become leaders, right?
3. Yes but then it would be reflected in iowait metric, which is almost
always near zero on our servers. Is there anything else Solr could be
waiting for, and is there a way to check it? If we are going to need
even more servers for faster response and faceting then there must be a
way to know which resource we should get more of.
5. Yes, docValues are enabled for the fields we sort on (except score
which is an internal field); _version_ is left at default i think
(type="long" indexed="false" stored="false", and it's also marked as
having DocValues in the admin UI)
6. QPS and response time seem to be about the same with and without
indexing; server load also looks about the same so i assume indexing
doesn't take up a lot of resources (a little strange, but possible if it
is limited by network or some other things from point 3).
7. Will try using G1 if nothing else helps... Haven't tested it yet
because it is considered unsafe and i'd like to have all other options
exhausted first. (And even then it is probably going to be a minor
improvement? How much more efficient could it possibly be?)
On 26.10.18 19:18, Erick Erickson wrote:
> Some ideas:
>
> 1> What version of Solr? Solr 7.3 completely re-wrote Leader Initiated
> Recovery and 7.5 has other improvements for recovery, we're hoping
> that the recovery situation is much improved.
>
> 2> In the 7x code line, there are TLOG and PULL replicas. As of 7.5,
> you can set up so the queries are served by replica type, see:
> https://issues.apache.org/jira/browse/SOLR-11982. This might help you
> out. This moves all the indexing to the leader and reserves the rest
> of the nodes for queries only, using old-style replication. I'm
> assuming from your commit rate that latency between when updates
> happen and the updates are searchable isn't a big concern.
>
> 3> Just because the CPU isn't 100% doesn't mean Solr is running flat
> out. There's I/O waits while sub-requests are serviced and the like.
>
> 4> As for how to add faceting without slowing down querying, there's
> no way. Extra work is extra work. Depending on _what_ you're faceting
> on, you may be able to do some tricks, but without details it's hard
> to say. You need to get the query rate target first though ;)
>
> 5> OOMs Hmm, you say you're doing complex sorts, are all fields
> involved in sorts docValues=true? They have to be to be used in
> function queries of course, but what about any fields that aren't?
> What about your _version_ field?
>
> 6> bq. "...indexed 2 times/day, as fast as the SOLR allows..." One
> experiment I'd run is to test your QPS rate when there was _no_
> indexing going on. That would give you a hint as to whether the
> TLOG/PULL configuration would be helpful. There's been talk of
> separate thread pools for indexing and querying to give queries a
> better shot at the CPU, but that's not in place yet.
>
> 7> G1GC may also help rather than CMS, but as you're well aware GC
> tuning "is more art than science" ;).
>
> Good luck!
> Erick
>
> On Fri, Oct 26, 2018 at 8:55 AM Sofiya Strochyk <ss...@interlogic.com.ua> wrote:
>> Hi everyone,
>>
>> We have a SolrCloud setup with the following configuration:
>>
>> 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon E5-1650v2, 12 cores, with SSDs)
>> One collection, 4 shards, each has only a single replica (so 4 replicas in total), using compositeId router
>> Total index size is about 150M documents/320GB, so about 40M/80GB per node
>> Zookeeper is on a separate server
>> Documents consist of about 20 fields (most of them are both stored and indexed), average document size is about 2kB
>> Queries are mostly 2-3 words in the q field, with 2 fq parameters, with complex sort expression (containing IF functions)
>> We don't use faceting due to performance reasons but need to add it in the future
>> Majority of the documents are reindexed 2 times/day, as fast as the SOLR allows, in batches of 1000-10000 docs. Some of the documents are also deleted (by id, not by query)
>> autoCommit is set to maxTime of 1 minute with openSearcher=false and autoSoftCommit maxTime is 30 minutes with openSearcher=true. Commits from clients are ignored.
>> Heap size is set to 8GB.
>>
>> Target query rate is up to 500 qps, maybe 300, and we need to keep response time at <200ms. But at the moment we only see very good search performance with up to 100 requests per second. Whenever it grows to about 200, average response time abruptly increases to 0.5-1 second. (Also it seems that request rate reported by SOLR in admin metrics is 2x higher than the real one, because for every query, every shard receives 2 requests: one to obtain IDs and second one to get data by IDs; so target rate for SOLR metrics would be 1000 qps).
>>
>> During high request load, CPU usage increases dramatically on the SOLR nodes. It doesn't reach 100% but averages at 50-70% on 3 servers and about 93% on 1 server (random server each time, not the smallest one).
>>
>> The documentation mentions replication to spread the load between the servers. We tested replicating to smaller servers (32GB RAM, Intel Core i7-4770). However, when we tested it, the replicas were going out of sync all the time (possibly during commits) and reported errors like "PeerSync Recovery was not successful - trying replication." Then they proceed with replication which takes hours and the leader handles all requests singlehandedly during that time. Also both leaders and replicas started encountering OOM errors (heap space) for unknown reason. Heap dump analysis shows that most of the memory is consumed by [J (array of long) type, my best guess would be that it is "_version_" field, but it's still unclear why it happens. Also, even though with replication request rate and CPU usage drop 2 times, it doesn't seem to affect mean_ms, stddev_ms or p95_ms numbers (p75_ms is much smaller on nodes with replication, but still not as low as under load of <100 requests/s).
>>
>> Garbage collection is much more active during high load as well. Full GC happens almost exclusively during those times. We have tried tuning GC options like suggested here and it didn't change things though.
>>
>> My questions are
>>
>> How do we increase throughput? Is replication the only solution?
>> if yes - then why doesn't it affect response times, considering that CPU is not 100% used and index fits into memory?
>> How to deal with OOM and replicas going into recovery?
>> Is memory or CPU the main problem? (When searching on the internet, i never see CPU as main bottleneck for SOLR, but our case might be different)
>> Or do we need smaller shards? Could segments merging be a problem?
>> How to add faceting without search queries slowing down too much?
>> How to diagnose these problems and narrow down to the real reason in hardware or setup?
>>
>> Any help would be much appreciated.
>>
>> Thanks!
>>
>> --
>> Sofiia Strochyk
>>
>>
>>
>> ss@interlogic.com.ua
>>
>> www.interlogic.com.ua
>>
>>
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: SolrCloud scaling/optimization for high request rate
Posted by Erick Erickson <er...@gmail.com>.
Some ideas:
1> What version of Solr? Solr 7.3 completely re-wrote Leader Initiated
Recovery and 7.5 has other improvements for recovery, we're hoping
that the recovery situation is much improved.
2> In the 7x code line, there are TLOG and PULL replicas. As of 7.5,
you can set up so the queries are served by replica type, see:
https://issues.apache.org/jira/browse/SOLR-11982. This might help you
out. This moves all the indexing to the leader and reserves the rest
of the nodes for queries only, using old-style replication. I'm
assuming from your commit rate that latency between when updates
happen and the updates are searchable isn't a big concern.
3> Just because the CPU isn't 100% doesn't mean Solr is running flat
out. There's I/O waits while sub-requests are serviced and the like.
4> As for how to add faceting without slowing down querying, there's
no way. Extra work is extra work. Depending on _what_ you're faceting
on, you may be able to do some tricks, but without details it's hard
to say. You need to get the query rate target first though ;)
5> OOMs Hmm, you say you're doing complex sorts, are all fields
involved in sorts docValues=true? They have to be to be used in
function queries of course, but what about any fields that aren't?
What about your _version_ field?
6> bq. "...indexed 2 times/day, as fast as the SOLR allows..." One
experiment I'd run is to test your QPS rate when there was _no_
indexing going on. That would give you a hint as to whether the
TLOG/PULL configuration would be helpful. There's been talk of
separate thread pools for indexing and querying to give queries a
better shot at the CPU, but that's not in place yet.
7> G1GC may also help rather than CMS, but as you're well aware GC
tuning "is more art than science" ;).
Good luck!
Erick
On Fri, Oct 26, 2018 at 8:55 AM Sofiya Strochyk <ss...@interlogic.com.ua> wrote:
>
> Hi everyone,
>
> We have a SolrCloud setup with the following configuration:
>
> 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon E5-1650v2, 12 cores, with SSDs)
> One collection, 4 shards, each has only a single replica (so 4 replicas in total), using compositeId router
> Total index size is about 150M documents/320GB, so about 40M/80GB per node
> Zookeeper is on a separate server
> Documents consist of about 20 fields (most of them are both stored and indexed), average document size is about 2kB
> Queries are mostly 2-3 words in the q field, with 2 fq parameters, with complex sort expression (containing IF functions)
> We don't use faceting due to performance reasons but need to add it in the future
> Majority of the documents are reindexed 2 times/day, as fast as the SOLR allows, in batches of 1000-10000 docs. Some of the documents are also deleted (by id, not by query)
> autoCommit is set to maxTime of 1 minute with openSearcher=false and autoSoftCommit maxTime is 30 minutes with openSearcher=true. Commits from clients are ignored.
> Heap size is set to 8GB.
>
> Target query rate is up to 500 qps, maybe 300, and we need to keep response time at <200ms. But at the moment we only see very good search performance with up to 100 requests per second. Whenever it grows to about 200, average response time abruptly increases to 0.5-1 second. (Also it seems that request rate reported by SOLR in admin metrics is 2x higher than the real one, because for every query, every shard receives 2 requests: one to obtain IDs and second one to get data by IDs; so target rate for SOLR metrics would be 1000 qps).
>
> During high request load, CPU usage increases dramatically on the SOLR nodes. It doesn't reach 100% but averages at 50-70% on 3 servers and about 93% on 1 server (random server each time, not the smallest one).
>
> The documentation mentions replication to spread the load between the servers. We tested replicating to smaller servers (32GB RAM, Intel Core i7-4770). However, when we tested it, the replicas were going out of sync all the time (possibly during commits) and reported errors like "PeerSync Recovery was not successful - trying replication." Then they proceed with replication which takes hours and the leader handles all requests singlehandedly during that time. Also both leaders and replicas started encountering OOM errors (heap space) for unknown reason. Heap dump analysis shows that most of the memory is consumed by [J (array of long) type, my best guess would be that it is "_version_" field, but it's still unclear why it happens. Also, even though with replication request rate and CPU usage drop 2 times, it doesn't seem to affect mean_ms, stddev_ms or p95_ms numbers (p75_ms is much smaller on nodes with replication, but still not as low as under load of <100 requests/s).
>
> Garbage collection is much more active during high load as well. Full GC happens almost exclusively during those times. We have tried tuning GC options like suggested here and it didn't change things though.
>
> My questions are
>
> How do we increase throughput? Is replication the only solution?
> if yes - then why doesn't it affect response times, considering that CPU is not 100% used and index fits into memory?
> How to deal with OOM and replicas going into recovery?
> Is memory or CPU the main problem? (When searching on the internet, i never see CPU as main bottleneck for SOLR, but our case might be different)
> Or do we need smaller shards? Could segments merging be a problem?
> How to add faceting without search queries slowing down too much?
> How to diagnose these problems and narrow down to the real reason in hardware or setup?
>
> Any help would be much appreciated.
>
> Thanks!
>
> --
> Sofiia Strochyk
>
>
>
> ss@interlogic.com.ua
>
> www.interlogic.com.ua
>
>
Re: SolrCloud scaling/optimization for high request rate
Posted by Sofiya Strochyk <ss...@interlogic.com.ua>.
Sure, i can test that, will set to zero now :)
We never tried a small number for the autowarming parameter but it has
been running with zero (default value) for a while before being changed
to 64, and the startup after the commit has been a bit slow. But
overall, there was rather little difference between 0 and 64, so the
spike after commit could be related just to heavy searcher opening
operation which can't be affected.
On 29.10.18 17:20, Erick Erickson wrote:
> Speaking of your caches... Either it's a problem with the metrics
> reporting or your warmup times very, very long. 11 seconds and, er,
> 52 seconds! My guess is that you have your autowarm counts set to a
> very high number and are consuming a lot of CPU time every time a
> commit happens. Which will only happen when indexing. I usually start
> autowarms for these caches at < 20.
>
> Quick note on autowarm: These caches are a map with the key being the
> query and the value being some representation of the docs that satisfy
> it. Autowarming just replays the most recently used N of these.
> documentCache can't be autowarmed, so we can ignore it.
>
> So in your case, the main value of the queryResultCache is to read
> into memory all of the parts of the index to satisfy them, including,
> say, the sort structures (docValues), the index terms and, really,
> whatever is necessary. Ditto for the filterCache.
>
> The queryResultCache was originally intended to support paging, it
> only holds a few doc IDs per query. Memory wise it's pretty
> insignificant. Your hit ratio indicates you're not paging. All that
> said, the autowarm bits much more important so I wouldn't disable it
> entirely.
>
> Each filterCache entry is bounded by maxDoc/8 size-wise (plus some
> extra, but that's the number that usually counts). It may be smaller
> for sparse result sets but we can ignore that for now. You usually
> want this as small as possible and still get a decent hit ratio.
>
> The entire purpose of autowarm is so that the _first_ query that's run
> after a commit (hard with openSearcher=true or soft) isn't noticeably
> slower due to having to initially load parts of the index into memory.
> As the autowarm count goes up you pretty quickly hit diminishing
> returns.
>
> Now, all that may not be the actual problem, but here's a quick way to test:
>
> turn your autowarm counts off. What you should see is a correlation
> between when a commit happens and a small spike in response time for
> the first few queries, but otherwise a better query response profile.
> If that's true, try gradually increasing the autowarm count 10 at a
> time. My bet: If this is germane, you'll pretty soon see no difference
> in response times as you increase your autowarm count. I.e. there'll
> be no noticeable difference between 20 and 30 for instance. And your
> autowarm times will be drastically smaller. And most of the CPU you're
> expending to autowarm will be freed up to actually satisfy use
> queries.
>
> If any of that speculation is borne out, you have something that'll
> help. Or you have another blind alley ;)
>
> Best
> Erick
>
> On Mon, Oct 29, 2018 at 8:00 AM Sofiya Strochyk <ss...@interlogic.com.ua> wrote:
>> Hi Deepak and thanks for your reply,
>>
>>
>> On 27.10.18 10:35, Deepak Goel wrote:
>>
>>
>> Last, what is the nature of your request. Are the queries the same? Or they are very random? Random queries would need more tuning than if the queries the same.
>>
>> The search term (q) is different for each query, and filter query terms (fq) are repeated very often. (so we have very little cache hit ratio for query result cache, and very high hit ratio for filter cache)
>>
>> --
>> Sofiia Strochyk
>>
>>
>>
>> ss@interlogic.com.ua
>>
>> www.interlogic.com.ua
>>
>>
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: SolrCloud scaling/optimization for high request rate
Posted by Erick Erickson <er...@gmail.com>.
Speaking of your caches... Either it's a problem with the metrics
reporting or your warmup times very, very long. 11 seconds and, er,
52 seconds! My guess is that you have your autowarm counts set to a
very high number and are consuming a lot of CPU time every time a
commit happens. Which will only happen when indexing. I usually start
autowarms for these caches at < 20.
Quick note on autowarm: These caches are a map with the key being the
query and the value being some representation of the docs that satisfy
it. Autowarming just replays the most recently used N of these.
documentCache can't be autowarmed, so we can ignore it.
So in your case, the main value of the queryResultCache is to read
into memory all of the parts of the index to satisfy them, including,
say, the sort structures (docValues), the index terms and, really,
whatever is necessary. Ditto for the filterCache.
The queryResultCache was originally intended to support paging, it
only holds a few doc IDs per query. Memory wise it's pretty
insignificant. Your hit ratio indicates you're not paging. All that
said, the autowarm bits much more important so I wouldn't disable it
entirely.
Each filterCache entry is bounded by maxDoc/8 size-wise (plus some
extra, but that's the number that usually counts). It may be smaller
for sparse result sets but we can ignore that for now. You usually
want this as small as possible and still get a decent hit ratio.
The entire purpose of autowarm is so that the _first_ query that's run
after a commit (hard with openSearcher=true or soft) isn't noticeably
slower due to having to initially load parts of the index into memory.
As the autowarm count goes up you pretty quickly hit diminishing
returns.
Now, all that may not be the actual problem, but here's a quick way to test:
turn your autowarm counts off. What you should see is a correlation
between when a commit happens and a small spike in response time for
the first few queries, but otherwise a better query response profile.
If that's true, try gradually increasing the autowarm count 10 at a
time. My bet: If this is germane, you'll pretty soon see no difference
in response times as you increase your autowarm count. I.e. there'll
be no noticeable difference between 20 and 30 for instance. And your
autowarm times will be drastically smaller. And most of the CPU you're
expending to autowarm will be freed up to actually satisfy use
queries.
If any of that speculation is borne out, you have something that'll
help. Or you have another blind alley ;)
Best
Erick
On Mon, Oct 29, 2018 at 8:00 AM Sofiya Strochyk <ss...@interlogic.com.ua> wrote:
>
> Hi Deepak and thanks for your reply,
>
>
> On 27.10.18 10:35, Deepak Goel wrote:
>
>
> Last, what is the nature of your request. Are the queries the same? Or they are very random? Random queries would need more tuning than if the queries the same.
>
> The search term (q) is different for each query, and filter query terms (fq) are repeated very often. (so we have very little cache hit ratio for query result cache, and very high hit ratio for filter cache)
>
> --
> Sofiia Strochyk
>
>
>
> ss@interlogic.com.ua
>
> www.interlogic.com.ua
>
>
Re: SolrCloud scaling/optimization for high request rate
Posted by Sofiya Strochyk <ss...@interlogic.com.ua>.
Sure, here is IO for bigger machine:
https://upload.cc/i1/2018/10/30/tQovyM.png
for smaller machine:
https://upload.cc/i1/2018/10/30/cP8DxU.png
CPU utilization including iowait:
https://upload.cc/i1/2018/10/30/eSs1YT.png
iowait only:
https://upload.cc/i1/2018/10/30/CHgx41.png
On 30.10.18 15:55, Deepak Goel wrote:
> Please see inline...
>
>
> Deepak
> "The greatness of a nation can be judged by the way its animals are
> treated. Please consider stopping the cruelty by becoming a Vegan"
>
> +91 73500 12833
> deicool@gmail.com <ma...@gmail.com>
>
> Facebook: https://www.facebook.com/deicool
> LinkedIn: www.linkedin.com/in/deicool <http://www.linkedin.com/in/deicool>
>
> "Plant a Tree, Go Green"
>
> Make In India : http://www.makeinindia.com/home
>
>
> On Tue, Oct 30, 2018 at 5:21 PM Sofiya Strochyk <ss@interlogic.com.ua
> <ma...@interlogic.com.ua>> wrote:
>
> My swappiness is set to 10, swap is almost not used (used space is
> on scale of a few MB) and there is no swap IO.
>
> There is disk IO like this, though:
>
> https://upload.cc/i1/2018/10/30/43lGfj.png
> https://upload.cc/i1/2018/10/30/T3u9oY.png
>
> ******
> The time for the data is too short. Can you provide for larger timeframes?
> ******
>
>
> However CPU iowait is still zero, so not sure if the disk io is
> introducing any kind of delay...
>
> ******
> Can you provide graphs for cpu iowait too? (For large timeframes)
> ******
>
> On 30.10.18 10:21, Deepak Goel wrote:
>> Yes. Swapping from disk to memory & vice versa
>>
>>
>> Deepak
>> "The greatness of a nation can be judged by the way its animals
>> are treated. Please consider stopping the cruelty by becoming a
>> Vegan"
>>
>> +91 73500 12833
>> deicool@gmail.com <ma...@gmail.com>
>>
>> Facebook: https://www.facebook.com/deicool
>> LinkedIn: www.linkedin.com/in/deicool
>> <http://www.linkedin.com/in/deicool>
>>
>> "Plant a Tree, Go Green"
>>
>> Make In India : http://www.makeinindia.com/home
>>
>>
>> On Mon, Oct 29, 2018 at 11:24 PM Sofiya Strochyk
>> <ss@interlogic.com.ua <ma...@interlogic.com.ua>> wrote:
>>
>> Could you please clarify what is memory disk layer? Do you
>> mean swapping from memory to disk, reading from disk to
>> memory, or something else?
>>
>>
>> On 29.10.18 17:20, Deepak Goel wrote:
>>> I would then suspect performance is choking in memory disk
>>> layer. can you please check the performance?
>>>
>>> On Mon, 29 Oct 2018, 20:30 Sofiya Strochyk,
>>> <ss@interlogic.com.ua <ma...@interlogic.com.ua>> wrote:
>>>
>>> Hi Deepak and thanks for your reply,
>>>
>>>
>>> On 27.10.18 10:35, Deepak Goel wrote:
>>>>
>>>> Last, what is the nature of your request. Are the
>>>> queries the same? Or they are very random? Random
>>>> queries would need more tuning than if the queries the
>>>> same.
>>> The search term (q) is different for each query, and
>>> filter query terms (fq) are repeated very often. (so we
>>> have very little cache hit ratio for query result cache,
>>> and very high hit ratio for filter cache)
>>>
>>> --
>>> *Sofiia Strochyk
>>> *
>>>
>>>
>>> ss@interlogic.com.ua <ma...@interlogic.com.ua>
>>> InterLogic
>>> www.interlogic.com.ua <https://www.interlogic.com.ua>
>>>
>>> Facebook icon
>>> <https://www.facebook.com/InterLogicOfficial> LinkedIn
>>> icon <https://www.linkedin.com/company/interlogic>
>>>
>>
>> --
>> *Sofiia Strochyk
>> *
>>
>>
>> ss@interlogic.com.ua <ma...@interlogic.com.ua>
>> InterLogic
>> www.interlogic.com.ua <https://www.interlogic.com.ua>
>>
>> Facebook icon <https://www.facebook.com/InterLogicOfficial>
>> LinkedIn icon <https://www.linkedin.com/company/interlogic>
>>
>
> --
> *Sofiia Strochyk
> *
>
>
> ss@interlogic.com.ua <ma...@interlogic.com.ua>
> InterLogic
> www.interlogic.com.ua <https://www.interlogic.com.ua>
>
> Facebook icon <https://www.facebook.com/InterLogicOfficial>
> LinkedIn icon <https://www.linkedin.com/company/interlogic>
>
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: SolrCloud scaling/optimization for high request rate
Posted by Deepak Goel <de...@gmail.com>.
Please see inline...
Deepak
"The greatness of a nation can be judged by the way its animals are
treated. Please consider stopping the cruelty by becoming a Vegan"
+91 73500 12833
deicool@gmail.com
Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool
"Plant a Tree, Go Green"
Make In India : http://www.makeinindia.com/home
On Tue, Oct 30, 2018 at 5:21 PM Sofiya Strochyk <ss...@interlogic.com.ua>
wrote:
> My swappiness is set to 10, swap is almost not used (used space is on
> scale of a few MB) and there is no swap IO.
>
> There is disk IO like this, though:
>
> https://upload.cc/i1/2018/10/30/43lGfj.png
> https://upload.cc/i1/2018/10/30/T3u9oY.png
>
******
The time for the data is too short. Can you provide for larger timeframes?
******
>
> However CPU iowait is still zero, so not sure if the disk io is
> introducing any kind of delay...
>
> ******
Can you provide graphs for cpu iowait too? (For large timeframes)
******
> On 30.10.18 10:21, Deepak Goel wrote:
>
> Yes. Swapping from disk to memory & vice versa
>
>
> Deepak
> "The greatness of a nation can be judged by the way its animals are
> treated. Please consider stopping the cruelty by becoming a Vegan"
>
> +91 73500 12833
> deicool@gmail.com
>
> Facebook: https://www.facebook.com/deicool
> LinkedIn: www.linkedin.com/in/deicool
>
> "Plant a Tree, Go Green"
>
> Make In India : http://www.makeinindia.com/home
>
>
> On Mon, Oct 29, 2018 at 11:24 PM Sofiya Strochyk <ss...@interlogic.com.ua>
> wrote:
>
>> Could you please clarify what is memory disk layer? Do you mean swapping
>> from memory to disk, reading from disk to memory, or something else?
>>
>> On 29.10.18 17:20, Deepak Goel wrote:
>>
>> I would then suspect performance is choking in memory disk layer. can you
>> please check the performance?
>>
>> On Mon, 29 Oct 2018, 20:30 Sofiya Strochyk, <ss...@interlogic.com.ua> wrote:
>>
>>> Hi Deepak and thanks for your reply,
>>>
>>> On 27.10.18 10:35, Deepak Goel wrote:
>>>
>>>
>>> Last, what is the nature of your request. Are the queries the same? Or
>>> they are very random? Random queries would need more tuning than if the
>>> queries the same.
>>>
>>> The search term (q) is different for each query, and filter query terms
>>> (fq) are repeated very often. (so we have very little cache hit ratio for
>>> query result cache, and very high hit ratio for filter cache)
>>>
>>> --
>>>
>>> *Sofiia Strochyk *
>>>
>>>
>>> ss@interlogic.com.ua
>>> [image: InterLogic]
>>> www.interlogic.com.ua
>>>
>>> [image: Facebook icon] <https://www.facebook.com/InterLogicOfficial> [image:
>>> LinkedIn icon] <https://www.linkedin.com/company/interlogic>
>>>
>>
>> --
>>
>> *Sofiia Strochyk *
>>
>>
>> ss@interlogic.com.ua
>> [image: InterLogic]
>> www.interlogic.com.ua
>>
>> [image: Facebook icon] <https://www.facebook.com/InterLogicOfficial> [image:
>> LinkedIn icon] <https://www.linkedin.com/company/interlogic>
>>
>
> --
>
> *Sofiia Strochyk *
>
>
> ss@interlogic.com.ua
> [image: InterLogic]
> www.interlogic.com.ua
>
> [image: Facebook icon] <https://www.facebook.com/InterLogicOfficial> [image:
> LinkedIn icon] <https://www.linkedin.com/company/interlogic>
>
Re: SolrCloud scaling/optimization for high request rate
Posted by Sofiya Strochyk <ss...@interlogic.com.ua>.
My swappiness is set to 10, swap is almost not used (used space is on
scale of a few MB) and there is no swap IO.
There is disk IO like this, though:
https://upload.cc/i1/2018/10/30/43lGfj.png
https://upload.cc/i1/2018/10/30/T3u9oY.png
However CPU iowait is still zero, so not sure if the disk io is
introducing any kind of delay...
On 30.10.18 10:21, Deepak Goel wrote:
> Yes. Swapping from disk to memory & vice versa
>
>
> Deepak
> "The greatness of a nation can be judged by the way its animals are
> treated. Please consider stopping the cruelty by becoming a Vegan"
>
> +91 73500 12833
> deicool@gmail.com <ma...@gmail.com>
>
> Facebook: https://www.facebook.com/deicool
> LinkedIn: www.linkedin.com/in/deicool <http://www.linkedin.com/in/deicool>
>
> "Plant a Tree, Go Green"
>
> Make In India : http://www.makeinindia.com/home
>
>
> On Mon, Oct 29, 2018 at 11:24 PM Sofiya Strochyk <ss@interlogic.com.ua
> <ma...@interlogic.com.ua>> wrote:
>
> Could you please clarify what is memory disk layer? Do you mean
> swapping from memory to disk, reading from disk to memory, or
> something else?
>
>
> On 29.10.18 17:20, Deepak Goel wrote:
>> I would then suspect performance is choking in memory disk layer.
>> can you please check the performance?
>>
>> On Mon, 29 Oct 2018, 20:30 Sofiya Strochyk, <ss@interlogic.com.ua
>> <ma...@interlogic.com.ua>> wrote:
>>
>> Hi Deepak and thanks for your reply,
>>
>>
>> On 27.10.18 10:35, Deepak Goel wrote:
>>>
>>> Last, what is the nature of your request. Are the queries
>>> the same? Or they are very random? Random queries would need
>>> more tuning than if the queries the same.
>> The search term (q) is different for each query, and filter
>> query terms (fq) are repeated very often. (so we have very
>> little cache hit ratio for query result cache, and very high
>> hit ratio for filter cache)
>>
>> --
>> *Sofiia Strochyk
>> *
>>
>>
>> ss@interlogic.com.ua <ma...@interlogic.com.ua>
>> InterLogic
>> www.interlogic.com.ua <https://www.interlogic.com.ua>
>>
>> Facebook icon <https://www.facebook.com/InterLogicOfficial>
>> LinkedIn icon <https://www.linkedin.com/company/interlogic>
>>
>
> --
> *Sofiia Strochyk
> *
>
>
> ss@interlogic.com.ua <ma...@interlogic.com.ua>
> InterLogic
> www.interlogic.com.ua <https://www.interlogic.com.ua>
>
> Facebook icon <https://www.facebook.com/InterLogicOfficial>
> LinkedIn icon <https://www.linkedin.com/company/interlogic>
>
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: SolrCloud scaling/optimization for high request rate
Posted by Deepak Goel <de...@gmail.com>.
Yes. Swapping from disk to memory & vice versa
Deepak
"The greatness of a nation can be judged by the way its animals are
treated. Please consider stopping the cruelty by becoming a Vegan"
+91 73500 12833
deicool@gmail.com
Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool
"Plant a Tree, Go Green"
Make In India : http://www.makeinindia.com/home
On Mon, Oct 29, 2018 at 11:24 PM Sofiya Strochyk <ss...@interlogic.com.ua>
wrote:
> Could you please clarify what is memory disk layer? Do you mean swapping
> from memory to disk, reading from disk to memory, or something else?
>
> On 29.10.18 17:20, Deepak Goel wrote:
>
> I would then suspect performance is choking in memory disk layer. can you
> please check the performance?
>
> On Mon, 29 Oct 2018, 20:30 Sofiya Strochyk, <ss...@interlogic.com.ua> wrote:
>
>> Hi Deepak and thanks for your reply,
>>
>> On 27.10.18 10:35, Deepak Goel wrote:
>>
>>
>> Last, what is the nature of your request. Are the queries the same? Or
>> they are very random? Random queries would need more tuning than if the
>> queries the same.
>>
>> The search term (q) is different for each query, and filter query terms
>> (fq) are repeated very often. (so we have very little cache hit ratio for
>> query result cache, and very high hit ratio for filter cache)
>>
>> --
>>
>> *Sofiia Strochyk *
>>
>>
>> ss@interlogic.com.ua
>> [image: InterLogic]
>> www.interlogic.com.ua
>>
>> [image: Facebook icon] <https://www.facebook.com/InterLogicOfficial> [image:
>> LinkedIn icon] <https://www.linkedin.com/company/interlogic>
>>
>
> --
>
> *Sofiia Strochyk *
>
>
> ss@interlogic.com.ua
> [image: InterLogic]
> www.interlogic.com.ua
>
> [image: Facebook icon] <https://www.facebook.com/InterLogicOfficial> [image:
> LinkedIn icon] <https://www.linkedin.com/company/interlogic>
>
Re: SolrCloud scaling/optimization for high request rate
Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/29/2018 8:56 PM, Erick Erickson wrote:
> The interval between when a commit happens and all the autowarm
> queries are finished if 52 seconds for the filterCache. seen warming
> that that long unless something's very unusual. I'd actually be very
> surprised if you're really only firing 64 autowarm queries and it's
> taking almost 52 seconds.
I wouldn't be surprised. The servers I used to manage had a filterCache
autowarmCount of *four*. Commits sometimes still took up to 15 seconds
to happen, and almost all of that time was filterCache warming. Before
I reduced autowarmCount, it could take a VERY long time on occasion for
commits. Some users had such large filters that I was forced to
increase the max allowed header size for Solr's container to 32K, so
that the occasional query with a 20K URL size could be handled.
Thanks,
Shawn
Re: SolrCloud scaling/optimization for high request rate
Posted by Erick Erickson <er...@gmail.com>.
Sofiya:
The interval between when a commit happens and all the autowarm
queries are finished if 52 seconds for the filterCache. seen warming
that that long unless something's very unusual. I'd actually be very
surprised if you're really only firing 64 autowarm queries and it's
taking almost 52 seconds.
However, if it really is taking that long for that few queries, then
most of what I think I know about this subject is probably wrong
anyway. I guess it could just be CPU starvation in that case. At any
rate, it's an anomaly that should get an explanation.
I suppose the other approach would be if you can get the autowarm time
when there's not much querying going on. If you get autowarm times of,
say, 40 seconds when there's no querying going on then it's not the
number of autowarm queries, but "something about your queries" that's
slowing things down, at least that gives you a place to start looking.
And I should have mentioned that when I think about excessive autowarm
counts, I've seen them in the thousands, 64 isn't really excessive,
even if diminishing returns start around 10-20...
Good Luck!
Erick
On Mon, Oct 29, 2018 at 10:54 AM Sofiya Strochyk <ss...@interlogic.com.ua> wrote:
>
> Could you please clarify what is memory disk layer? Do you mean swapping from memory to disk, reading from disk to memory, or something else?
>
>
> On 29.10.18 17:20, Deepak Goel wrote:
>
> I would then suspect performance is choking in memory disk layer. can you please check the performance?
>
> On Mon, 29 Oct 2018, 20:30 Sofiya Strochyk, <ss...@interlogic.com.ua> wrote:
>>
>> Hi Deepak and thanks for your reply,
>>
>>
>> On 27.10.18 10:35, Deepak Goel wrote:
>>
>>
>> Last, what is the nature of your request. Are the queries the same? Or they are very random? Random queries would need more tuning than if the queries the same.
>>
>> The search term (q) is different for each query, and filter query terms (fq) are repeated very often. (so we have very little cache hit ratio for query result cache, and very high hit ratio for filter cache)
>>
>> --
>> Sofiia Strochyk
>>
>>
>>
>> ss@interlogic.com.ua
>>
>> www.interlogic.com.ua
>>
>>
>
>
> --
> Sofiia Strochyk
>
>
>
> ss@interlogic.com.ua
>
> www.interlogic.com.ua
>
>
Re: SolrCloud scaling/optimization for high request rate
Posted by Sofiya Strochyk <ss...@interlogic.com.ua>.
Could you please clarify what is memory disk layer? Do you mean swapping
from memory to disk, reading from disk to memory, or something else?
On 29.10.18 17:20, Deepak Goel wrote:
> I would then suspect performance is choking in memory disk layer. can
> you please check the performance?
>
> On Mon, 29 Oct 2018, 20:30 Sofiya Strochyk, <ss@interlogic.com.ua
> <ma...@interlogic.com.ua>> wrote:
>
> Hi Deepak and thanks for your reply,
>
>
> On 27.10.18 10:35, Deepak Goel wrote:
>>
>> Last, what is the nature of your request. Are the queries the
>> same? Or they are very random? Random queries would need more
>> tuning than if the queries the same.
> The search term (q) is different for each query, and filter query
> terms (fq) are repeated very often. (so we have very little cache
> hit ratio for query result cache, and very high hit ratio for
> filter cache)
>
> --
> *Sofiia Strochyk
> *
>
>
> ss@interlogic.com.ua <ma...@interlogic.com.ua>
> InterLogic
> www.interlogic.com.ua <https://www.interlogic.com.ua>
>
> Facebook icon <https://www.facebook.com/InterLogicOfficial>
> LinkedIn icon <https://www.linkedin.com/company/interlogic>
>
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: SolrCloud scaling/optimization for high request rate
Posted by Deepak Goel <de...@gmail.com>.
I would then suspect performance is choking in memory disk layer. can you
please check the performance?
On Mon, 29 Oct 2018, 20:30 Sofiya Strochyk, <ss...@interlogic.com.ua> wrote:
> Hi Deepak and thanks for your reply,
>
> On 27.10.18 10:35, Deepak Goel wrote:
>
>
> Last, what is the nature of your request. Are the queries the same? Or
> they are very random? Random queries would need more tuning than if the
> queries the same.
>
> The search term (q) is different for each query, and filter query terms
> (fq) are repeated very often. (so we have very little cache hit ratio for
> query result cache, and very high hit ratio for filter cache)
>
> --
>
> *Sofiia Strochyk *
>
>
> ss@interlogic.com.ua
> [image: InterLogic]
> www.interlogic.com.ua
>
> [image: Facebook icon] <https://www.facebook.com/InterLogicOfficial> [image:
> LinkedIn icon] <https://www.linkedin.com/company/interlogic>
>
Re: SolrCloud scaling/optimization for high request rate
Posted by Sofiya Strochyk <ss...@interlogic.com.ua>.
Hi Deepak and thanks for your reply,
On 27.10.18 10:35, Deepak Goel wrote:
>
> Last, what is the nature of your request. Are the queries the same? Or
> they are very random? Random queries would need more tuning than if
> the queries the same.
The search term (q) is different for each query, and filter query terms
(fq) are repeated very often. (so we have very little cache hit ratio
for query result cache, and very high hit ratio for filter cache)
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: SolrCloud scaling/optimization for high request rate
Posted by Deepak Goel <de...@gmail.com>.
On Fri, Oct 26, 2018 at 9:25 PM Sofiya Strochyk <ss...@interlogic.com.ua>
wrote:
> Hi everyone,
>
> We have a SolrCloud setup with the following configuration:
>
> - 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon
> E5-1650v2, 12 cores, with SSDs)
> - One collection, 4 shards, each has only a single replica (so 4
> replicas in total), using compositeId router
> - Total index size is about 150M documents/320GB, so about 40M/80GB
> per node
> - Zookeeper is on a separate server
> - Documents consist of about 20 fields (most of them are both stored
> and indexed), average document size is about 2kB
> - Queries are mostly 2-3 words in the q field, with 2 fq parameters,
> with complex sort expression (containing IF functions)
> - We don't use faceting due to performance reasons but need to add it
> in the future
> - Majority of the documents are reindexed 2 times/day, as fast as the
> SOLR allows, in batches of 1000-10000 docs. Some of the documents are also
> deleted (by id, not by query)
> - autoCommit is set to maxTime of 1 minute with openSearcher=false and
> autoSoftCommit maxTime is 30 minutes with openSearcher=true. Commits from
> clients are ignored.
> - Heap size is set to 8GB.
>
> Target query rate is up to 500 qps, maybe 300, and we need to keep
> response time at <200ms. But at the moment we only see very good search
> performance with up to 100 requests per second. Whenever it grows to about
> 200, average response time abruptly increases to 0.5-1 second. (Also it
> seems that request rate reported by SOLR in admin metrics is 2x higher than
> the real one, because for every query, every shard receives 2 requests: one
> to obtain IDs and second one to get data by IDs; so target rate for SOLR
> metrics would be 1000 qps).
>
> During high request load, CPU usage increases dramatically on the SOLR
> nodes. It doesn't reach 100% but averages at 50-70% on 3 servers and about
> 93% on 1 server (random server each time, not the smallest one).
>
> The documentation mentions replication to spread the load between the
> servers. We tested replicating to smaller servers (32GB RAM, Intel Core
> i7-4770). However, when we tested it, the replicas were going out of sync
> all the time (possibly during commits) and reported errors like "PeerSync
> Recovery was not successful - trying replication." Then they proceed with
> replication which takes hours and the leader handles all requests
> singlehandedly during that time. Also both leaders and replicas started
> encountering OOM errors (heap space) for unknown reason. Heap dump analysis
> shows that most of the memory is consumed by [J (array of long) type, my
> best guess would be that it is "_version_" field, but it's still unclear
> why it happens. Also, even though with replication request rate and CPU
> usage drop 2 times, it doesn't seem to affect mean_ms, stddev_ms or p95_ms
> numbers (p75_ms is much smaller on nodes with replication, but still not as
> low as under load of <100 requests/s).
>
> Garbage collection is much more active during high load as well. Full GC
> happens almost exclusively during those times. We have tried tuning GC
> options like suggested here
> <https://wiki.apache.org/solr/ShawnHeisey#CMS_.28ConcurrentMarkSweep.29_Collector>
> and it didn't change things though.
>
> My questions are
>
> - How do we increase throughput? Is replication the only solution?
>
> 1. Increase the CPU speed
2. Increase the heap size (& tune the GC)
3. Replication
4. Have one more node on the hardware server (if cpu is not reaching 100%)
>
> -
> - if yes - then why doesn't it affect response times, considering that
> CPU is not 100% used and index fits into memory?
> - How to deal with OOM and replicas going into recovery?
>
> 1. Increase the heap size
2. Memory debug to check for memory leaks (rare)
>
> - Is memory or CPU the main problem? (When searching on the internet,
> i never see CPU as main bottleneck for SOLR, but our case might be
> different)
>
> 1. Could be both
>
> -
> - Or do we need smaller shards? Could segments merging be a problem?
> - How to add faceting without search queries slowing down too much?
> - How to diagnose these problems and narrow down to the real reason in
> hardware or setup?
>
> 1. I would first tune all the software (OS, JVM, Solr) & benchmark the
current hardware setup
2. Then i would play around with the hardware to check performance benefits
>
>
> Any help would be much appreciated.
>
Increase in response time of 1 sec when you bump up the load indicates
Queuing happening in your setup. (Since CPU is not 100% utilised, it most
likely indicates memory-disk-network or software problem)
Last, what is the nature of your request. Are the queries the same? Or they
are very random? Random queries would need more tuning than if the queries
the same.
> Thanks!
> --
>
> *Sofiia Strochyk *
>
>
> ss@interlogic.com.ua
> [image: InterLogic]
> www.interlogic.com.ua
>
> [image: Facebook icon] <https://www.facebook.com/InterLogicOfficial> [image:
> LinkedIn icon] <https://www.linkedin.com/company/interlogic>
>
Deepak
"The greatness of a nation can be judged by the way its animals are
treated. Please consider stopping the cruelty by becoming a Vegan"
+91 73500 12833
deicool@gmail.com
Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool
"Plant a Tree, Go Green"
Make In India : http://www.makeinindia.com/home
Re: SolrCloud scaling/optimization for high request rate
Posted by Toke Eskildsen <to...@kb.dk>.
On Mon, 2018-10-29 at 10:55 +0200, Sofiya Strochyk wrote:
> I think we could try that, but most likely it turns out that at some
> point we are receiving 300 requests per second, and are able to
> reasonably handle 150 per second, which means everything else is
> going to be kept in the growing queue and increase response times
> even further..
Just as there should always be an upper limit on concurrent
connections, so should there be a limit on the queue. If your system is
overwhelmed (and you don't have some fancy auto-add-hardware), there
are only two possibilities: Crash the system or discard requests. It is
rarely the case that crashing the system is the acceptable action.
Queueing works to avoid congestion (improving throughput) and to avoid
crashes due to overwhelming. If the queue runs full and needs to
discard requests, turning off the queue would just mean that the system
is overwhelmed instead.
> Also, if one node has 12 cores that would mean it can process 12
> concurrent searches? And since every request is sent to all shards to
> check if there are results, does this also mean the whole cluster can
> handle 12 concurrent requests on average?
It is typically the case that threads goes idle while waiting for data
from memory or storage. This means that you get more performance out of
running more concurrent jobs than the number of CPUs.
How much one should over-provision is very hard to generalize, which is
why I suggest measuring (which of course also takes resources, this
time in the form of work hours). My rough suggestion of a factor 10 for
your system is guesswork erring on the side of a high number.
- Toke Eskildsen, Royal Danish Library
Re: SolrCloud scaling/optimization for high request rate
Posted by Sofiya Strochyk <ss...@interlogic.com.ua>.
I think we could try that, but most likely it turns out that at some
point we are receiving 300 requests per second, and are able to
reasonably handle 150 per second, which means everything else is going
to be kept in the growing queue and increase response times even further..
Also, if one node has 12 cores that would mean it can process 12
concurrent searches? And since every request is sent to all shards to
check if there are results, does this also mean the whole cluster can
handle 12 concurrent requests on average?
On 27.10.18 09:00, Toke Eskildsen wrote:
> Sofiya Strochyk <ss...@interlogic.com.ua> wrote:
>> Target query rate is up to 500 qps, maybe 300, and we need
>> to keep response time at <200ms. But at the moment we only
>> see very good search performance with up to 100 requests
>> per second. Whenever it grows to about 200, average response
>> time abruptly increases to 0.5-1 second.
> Keep in mind that upping the number of concurrent searches in Solr does not raise throughput, if the system is already saturated. On the contrary, this will lower throughput due to thread- and memory-congestion.
>
> As your machines has 12 cores (including HyperThreading) and IO does not seem to be an issue, 500 or even just 200 concurrent searches seems likely to result in lower throughput than (really guessing here) 100 concurrent searches. As Walther point out, the end result is collapse, but slowdown happens before that.
>
> Consider putting a proxy in front with a max amount of concurrent connections and a sensible queue. Preferably after a bit of testing to locale where the highest throughput is. It won't make you hit your overall goal, but it can move you closer to it.
>
> - Toke Eskildsen
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: SolrCloud scaling/optimization for high request rate
Posted by Toke Eskildsen <to...@kb.dk>.
Sofiya Strochyk <ss...@interlogic.com.ua> wrote:
> Target query rate is up to 500 qps, maybe 300, and we need
> to keep response time at <200ms. But at the moment we only
> see very good search performance with up to 100 requests
> per second. Whenever it grows to about 200, average response
> time abruptly increases to 0.5-1 second.
Keep in mind that upping the number of concurrent searches in Solr does not raise throughput, if the system is already saturated. On the contrary, this will lower throughput due to thread- and memory-congestion.
As your machines has 12 cores (including HyperThreading) and IO does not seem to be an issue, 500 or even just 200 concurrent searches seems likely to result in lower throughput than (really guessing here) 100 concurrent searches. As Walther point out, the end result is collapse, but slowdown happens before that.
Consider putting a proxy in front with a max amount of concurrent connections and a sensible queue. Preferably after a bit of testing to locale where the highest throughput is. It won't make you hit your overall goal, but it can move you closer to it.
- Toke Eskildsen
Re: SolrCloud scaling/optimization for high request rate
Posted by Sofiya Strochyk <ss...@interlogic.com.ua>.
Thanks to everyone for the suggestions. We managed to get the
performance to a bearable level by splitting the index into ~20 separate
collections (one collection per country) and spreading them between
existing servers as evenly as possible. The largest country is also
split into 2 shards. This means that
1. QPS is lower for each instance since it only receives requests to the
corresponding country.
2. Index size is smaller for each instance as it only contains documents
for the corresponding country.
3. If one instance fails then most of the other instances keep running
(possibly except the ones colocated with the failed one)
We didn't make any changes to the main query, but have added a few
fields to facet on. This had a small negative impact on performance but
overall kept working nicely.
On 14.11.18 12:18, Toke Eskildsen wrote:
> On Mon, 2018-11-12 at 14:19 +0200, Sofiya Strochyk wrote:
>> I'll check if the filter queries or the main query tokenizers/filters
>> might have anything to do with this, but I'm afraid query
>> optimization can only get us so far.
> Why do you think that? As you tried eliminating sorting and retrieval
> previously, the queries are all that's left. There are multiple
> performance traps when querying and a lot of them can be bypassed by
> changing the index or querying in a different way.
>
>> Since we will have to add facets later, the queries will only become
>> heavier, and there has to be a way to scale this setup and deal with
>> both higher load and more complex queries.
> There is of course a way. It is more a question of what you are willing
> to pay.
>
> If you have money, just buy more hardware: We know (with very high
> probability) that it will work as your problem is search throughput,
> which can be solved by adding more replicas on extra machines.
>
> If you have more engineering hours, you can use them on some of the
> things discussed previously:
>
> * Pinpoint query bottlenecks
> * Use less/more shards
> * Applyhttps://issues.apache.org/jira/browse/LUCENE-8374
> * Experiment with different amounts of concurrent requests to see what
> gives the optimum throughput. This also tells you how much extra
> hardware you need, if you decide you need to expand..
>
>
> - Toke Eskildsen, Royal Danish Library
>
>
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: SolrCloud scaling/optimization for high request rate
Posted by Toke Eskildsen <to...@kb.dk>.
On Mon, 2018-11-12 at 14:19 +0200, Sofiya Strochyk wrote:
> I'll check if the filter queries or the main query tokenizers/filters
> might have anything to do with this, but I'm afraid query
> optimization can only get us so far.
Why do you think that? As you tried eliminating sorting and retrieval
previously, the queries are all that's left. There are multiple
performance traps when querying and a lot of them can be bypassed by
changing the index or querying in a different way.
> Since we will have to add facets later, the queries will only become
> heavier, and there has to be a way to scale this setup and deal with
> both higher load and more complex queries.
There is of course a way. It is more a question of what you are willing
to pay.
If you have money, just buy more hardware: We know (with very high
probability) that it will work as your problem is search throughput,
which can be solved by adding more replicas on extra machines.
If you have more engineering hours, you can use them on some of the
things discussed previously:
* Pinpoint query bottlenecks
* Use less/more shards
* Apply https://issues.apache.org/jira/browse/LUCENE-8374
* Experiment with different amounts of concurrent requests to see what
gives the optimum throughput. This also tells you how much extra
hardware you need, if you decide you need to expand..
- Toke Eskildsen, Royal Danish Library
Re: SolrCloud scaling/optimization for high request rate
Posted by Sofiya Strochyk <ss...@interlogic.com.ua>.
Thanks for your suggestions. I'll check if the filter queries or the
main query tokenizers/filters might have anything to do with this, but
I'm afraid query optimization can only get us so far. Since we will have
to add facets later, the queries will only become heavier, and there has
to be a way to scale this setup and deal with both higher load and more
complex queries.
On 08.11.18 10:53, Toke Eskildsen wrote:
> On Tue, 2018-11-06 at 16:38 +0200, Sofiya Strochyk wrote:
>> I have tested all of the suggested changes none of these seem to make
>> a noticeable difference (usually response time and other metrics
>> fluctuate over time, and the changes caused by different parameters
>> are smaller than the fluctuations). What this probably means is that
>> the heaviest task is retrieving IDs by query and not fields by ID.
> Barring anything overlooked, I agree on the query thing.
>
> Were I to sit at the machine, I would try removing part of the query
> until performance were satisfactory. Hopefully that would unearth very
> few problematic parts, such as regexp, function or prefix-wildcard
> queries. There might be ways to replace or tune those.
>
> - Toke Eskildsen, Royal Danish Library
>
>
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: SolrCloud scaling/optimization for high request rate
Posted by Toke Eskildsen <to...@kb.dk>.
On Tue, 2018-11-06 at 16:38 +0200, Sofiya Strochyk wrote:
> I have tested all of the suggested changes none of these seem to make
> a noticeable difference (usually response time and other metrics
> fluctuate over time, and the changes caused by different parameters
> are smaller than the fluctuations). What this probably means is that
> the heaviest task is retrieving IDs by query and not fields by ID.
Barring anything overlooked, I agree on the query thing.
Were I to sit at the machine, I would try removing part of the query
until performance were satisfactory. Hopefully that would unearth very
few problematic parts, such as regexp, function or prefix-wildcard
queries. There might be ways to replace or tune those.
- Toke Eskildsen, Royal Danish Library
Re: **SPAM** Re: SolrCloud scaling/optimization for high request rate
Posted by Ere Maijala <er...@helsinki.fi>.
From what I've gathered and what's been my experience docValues should
be enabled, but if you can't think of anything else, I'd try turning
them off to see if it makes any difference. As far as I can recall
turning them off will increase usage of Solr's own caches and that
caused noticeable slowdown for us, but your mileage may vary.
--Ere
Sofiya Strochyk kirjoitti 12.11.2018 klo 14.23:
> Thanks for the suggestion Ere. It looks like they are actually enabled;
> in schema file the field is only marked as stored (field name="_id"
> type="string" multiValued="false" indexed="true" required="true"
> stored="true") but the admin UI shows DocValues as enabled, so I guess
> this is by default. Is the solution to add "docValues=false" in the schema?
>
>
> On 12.11.18 10:43, Ere Maijala wrote:
>> Sofiya,
>>
>> Do you have docValues enabled for the id field? Apparently that can
>> make a significant difference. I'm failing to find the relevant
>> references right now, but just something worth checking out.
>>
>> Regards,
>> Ere
>>
>> Sofiya Strochyk kirjoitti 6.11.2018 klo 16.38:
>>> Hi Toke,
>>>
>>> sorry for the late reply. The query i wrote here is edited to hide
>>> production details, but I can post additional info if this helps.
>>>
>>> I have tested all of the suggested changes none of these seem to make
>>> a noticeable difference (usually response time and other metrics
>>> fluctuate over time, and the changes caused by different parameters
>>> are smaller than the fluctuations). What this probably means is that
>>> the heaviest task is retrieving IDs by query and not fields by ID.
>>> I've also checked QTime logged for these types of operations, and it
>>> is much higher for "get IDs by query" than for "get fields by IDs
>>> list". What could be done about this?
>>>
>>> On 05.11.18 14:43, Toke Eskildsen wrote:
>>>> So far no answer from Sofiya. That's fair enough: My suggestions might
>>>> have seemed random. Let me try to qualify them a bit.
>>>>
>>>>
>>>> What we have to work with is the redacted query
>>>> q=<q expression>&fl=<full list of fields>&start=0&sort=<sort
>>>> expression>&fq=<fq expression>&rows=24&version=2.2&wt=json
>>>> and an earlier mention that sorting was complex.
>>>>
>>>> My suggestions were to try
>>>>
>>>> 1) Only request simple sorting by score
>>>>
>>>> If this improves performance substantially, we could try and see if
>>>> sorting could be made more efficient: Reducing complexity, pre-
>>>> calculating numbers etc.
>>>>
>>>> 2) Reduce rows to 0
>>>> 3) Increase rows to 100
>>>>
>>>> This measures one aspect of retrieval. If there is a big performance
>>>> difference between these two, we can further probe if the problem is
>>>> the number or size of fields - perhaps there is a ton of stored text,
>>>> perhaps there is a bunch of DocValued fields?
>>>>
>>>> 4) Set fl=id only
>>>>
>>>> This is a variant of 2+3 to do a quick check if it is the resolving of
>>>> specific field values that is the problem. If using fl=id speeds up
>>>> substantially, the next step would be to add fields gradually until
>>>> (hopefully) there is a sharp performance decrease.
>>>>
>>>> - Toke Eskildsen, Royal Danish Library
>>>>
>>>>
>>>
>>> --
>>> Email Signature
>>> *Sofiia Strochyk
>>> *
>>>
>>>
>>> ss@interlogic.com.ua <ma...@interlogic.com.ua>
>>> InterLogic
>>> www.interlogic.com.ua <https://www.interlogic.com.ua>
>>>
>>> Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
>>> icon <https://www.linkedin.com/company/interlogic>
>>>
>>
>
> --
> Email Signature
> *Sofiia Strochyk
> *
>
>
> ss@interlogic.com.ua <ma...@interlogic.com.ua>
> InterLogic
> www.interlogic.com.ua <https://www.interlogic.com.ua>
>
> Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
> icon <https://www.linkedin.com/company/interlogic>
>
--
Ere Maijala
Kansalliskirjasto / The National Library of Finland
Re: **SPAM** Re: SolrCloud scaling/optimization for high request rate
Posted by Sofiya Strochyk <ss...@interlogic.com.ua>.
Thanks for the suggestion Ere. It looks like they are actually enabled;
in schema file the field is only marked as stored (field name="_id"
type="string" multiValued="false" indexed="true" required="true"
stored="true") but the admin UI shows DocValues as enabled, so I guess
this is by default. Is the solution to add "docValues=false" in the schema?
On 12.11.18 10:43, Ere Maijala wrote:
> Sofiya,
>
> Do you have docValues enabled for the id field? Apparently that can
> make a significant difference. I'm failing to find the relevant
> references right now, but just something worth checking out.
>
> Regards,
> Ere
>
> Sofiya Strochyk kirjoitti 6.11.2018 klo 16.38:
>> Hi Toke,
>>
>> sorry for the late reply. The query i wrote here is edited to hide
>> production details, but I can post additional info if this helps.
>>
>> I have tested all of the suggested changes none of these seem to make
>> a noticeable difference (usually response time and other metrics
>> fluctuate over time, and the changes caused by different parameters
>> are smaller than the fluctuations). What this probably means is that
>> the heaviest task is retrieving IDs by query and not fields by ID.
>> I've also checked QTime logged for these types of operations, and it
>> is much higher for "get IDs by query" than for "get fields by IDs
>> list". What could be done about this?
>>
>> On 05.11.18 14:43, Toke Eskildsen wrote:
>>> So far no answer from Sofiya. That's fair enough: My suggestions might
>>> have seemed random. Let me try to qualify them a bit.
>>>
>>>
>>> What we have to work with is the redacted query
>>> q=<q expression>&fl=<full list of fields>&start=0&sort=<sort
>>> expression>&fq=<fq expression>&rows=24&version=2.2&wt=json
>>> and an earlier mention that sorting was complex.
>>>
>>> My suggestions were to try
>>>
>>> 1) Only request simple sorting by score
>>>
>>> If this improves performance substantially, we could try and see if
>>> sorting could be made more efficient: Reducing complexity, pre-
>>> calculating numbers etc.
>>>
>>> 2) Reduce rows to 0
>>> 3) Increase rows to 100
>>>
>>> This measures one aspect of retrieval. If there is a big performance
>>> difference between these two, we can further probe if the problem is
>>> the number or size of fields - perhaps there is a ton of stored text,
>>> perhaps there is a bunch of DocValued fields?
>>>
>>> 4) Set fl=id only
>>>
>>> This is a variant of 2+3 to do a quick check if it is the resolving of
>>> specific field values that is the problem. If using fl=id speeds up
>>> substantially, the next step would be to add fields gradually until
>>> (hopefully) there is a sharp performance decrease.
>>>
>>> - Toke Eskildsen, Royal Danish Library
>>>
>>>
>>
>> --
>> Email Signature
>> *Sofiia Strochyk
>> *
>>
>>
>> ss@interlogic.com.ua <ma...@interlogic.com.ua>
>> InterLogic
>> www.interlogic.com.ua <https://www.interlogic.com.ua>
>>
>> Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
>> icon <https://www.linkedin.com/company/interlogic>
>>
>
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: SolrCloud scaling/optimization for high request rate
Posted by Ere Maijala <er...@helsinki.fi>.
Sofiya,
Do you have docValues enabled for the id field? Apparently that can make
a significant difference. I'm failing to find the relevant references
right now, but just something worth checking out.
Regards,
Ere
Sofiya Strochyk kirjoitti 6.11.2018 klo 16.38:
> Hi Toke,
>
> sorry for the late reply. The query i wrote here is edited to hide
> production details, but I can post additional info if this helps.
>
> I have tested all of the suggested changes none of these seem to make a
> noticeable difference (usually response time and other metrics fluctuate
> over time, and the changes caused by different parameters are smaller
> than the fluctuations). What this probably means is that the heaviest
> task is retrieving IDs by query and not fields by ID. I've also checked
> QTime logged for these types of operations, and it is much higher for
> "get IDs by query" than for "get fields by IDs list". What could be done
> about this?
>
> On 05.11.18 14:43, Toke Eskildsen wrote:
>> So far no answer from Sofiya. That's fair enough: My suggestions might
>> have seemed random. Let me try to qualify them a bit.
>>
>>
>> What we have to work with is the redacted query
>> q=<q expression>&fl=<full list of fields>&start=0&sort=<sort
>> expression>&fq=<fq expression>&rows=24&version=2.2&wt=json
>> and an earlier mention that sorting was complex.
>>
>> My suggestions were to try
>>
>> 1) Only request simple sorting by score
>>
>> If this improves performance substantially, we could try and see if
>> sorting could be made more efficient: Reducing complexity, pre-
>> calculating numbers etc.
>>
>> 2) Reduce rows to 0
>> 3) Increase rows to 100
>>
>> This measures one aspect of retrieval. If there is a big performance
>> difference between these two, we can further probe if the problem is
>> the number or size of fields - perhaps there is a ton of stored text,
>> perhaps there is a bunch of DocValued fields?
>>
>> 4) Set fl=id only
>>
>> This is a variant of 2+3 to do a quick check if it is the resolving of
>> specific field values that is the problem. If using fl=id speeds up
>> substantially, the next step would be to add fields gradually until
>> (hopefully) there is a sharp performance decrease.
>>
>> - Toke Eskildsen, Royal Danish Library
>>
>>
>
> --
> Email Signature
> *Sofiia Strochyk
> *
>
>
> ss@interlogic.com.ua <ma...@interlogic.com.ua>
> InterLogic
> www.interlogic.com.ua <https://www.interlogic.com.ua>
>
> Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
> icon <https://www.linkedin.com/company/interlogic>
>
--
Ere Maijala
Kansalliskirjasto / The National Library of Finland
Re: SolrCloud scaling/optimization for high request rate
Posted by Sofiya Strochyk <ss...@interlogic.com.ua>.
Hi Toke,
sorry for the late reply. The query i wrote here is edited to hide
production details, but I can post additional info if this helps.
I have tested all of the suggested changes none of these seem to make a
noticeable difference (usually response time and other metrics fluctuate
over time, and the changes caused by different parameters are smaller
than the fluctuations). What this probably means is that the heaviest
task is retrieving IDs by query and not fields by ID. I've also checked
QTime logged for these types of operations, and it is much higher for
"get IDs by query" than for "get fields by IDs list". What could be done
about this?
On 05.11.18 14:43, Toke Eskildsen wrote:
> So far no answer from Sofiya. That's fair enough: My suggestions might
> have seemed random. Let me try to qualify them a bit.
>
>
> What we have to work with is the redacted query
> q=<q expression>&fl=<full list of fields>&start=0&sort=<sort
> expression>&fq=<fq expression>&rows=24&version=2.2&wt=json
> and an earlier mention that sorting was complex.
>
> My suggestions were to try
>
> 1) Only request simple sorting by score
>
> If this improves performance substantially, we could try and see if
> sorting could be made more efficient: Reducing complexity, pre-
> calculating numbers etc.
>
> 2) Reduce rows to 0
> 3) Increase rows to 100
>
> This measures one aspect of retrieval. If there is a big performance
> difference between these two, we can further probe if the problem is
> the number or size of fields - perhaps there is a ton of stored text,
> perhaps there is a bunch of DocValued fields?
>
> 4) Set fl=id only
>
> This is a variant of 2+3 to do a quick check if it is the resolving of
> specific field values that is the problem. If using fl=id speeds up
> substantially, the next step would be to add fields gradually until
> (hopefully) there is a sharp performance decrease.
>
> - Toke Eskildsen, Royal Danish Library
>
>
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: Re: SolrCloud scaling/optimization for high request rate
Posted by Toke Eskildsen <to...@kb.dk>.
So far no answer from Sofiya. That's fair enough: My suggestions might
have seemed random. Let me try to qualify them a bit.
What we have to work with is the redacted query
q=<q expression>&fl=<full list of fields>&start=0&sort=<sort
expression>&fq=<fq expression>&rows=24&version=2.2&wt=json
and an earlier mention that sorting was complex.
My suggestions were to try
1) Only request simple sorting by score
If this improves performance substantially, we could try and see if
sorting could be made more efficient: Reducing complexity, pre-
calculating numbers etc.
2) Reduce rows to 0
3) Increase rows to 100
This measures one aspect of retrieval. If there is a big performance
difference between these two, we can further probe if the problem is
the number or size of fields - perhaps there is a ton of stored text,
perhaps there is a bunch of DocValued fields?
4) Set fl=id only
This is a variant of 2+3 to do a quick check if it is the resolving of
specific field values that is the problem. If using fl=id speeds up
substantially, the next step would be to add fields gradually until
(hopefully) there is a sharp performance decrease.
- Toke Eskildsen, Royal Danish Library
Re: Re: SolrCloud scaling/optimization for high request rate
Posted by Toke Eskildsen <to...@kb.dk>.
On Wed, 2018-10-31 at 13:42 +0200, Sofiya Strochyk wrote:
> q=<q expression>&fl=<full list of fields>&start=0&sort=<sort
> expression>&fq=<fq expression>&rows=24&version=2.2&wt=json
Not much to see here, perhaps because you are not allowed to share it?
Maybe we can try and isolate the cause? Could you try different runs,
where you change different components and tell us roughly how that
affects performance?
1) Only request simple sorting by score
2) Reduce rows to 0
3) Increase rows to 100
4) Set fl=id only
- Toke Eskildsen, Royal Danish Library
Re: Re: SolrCloud scaling/optimization for high request rate
Posted by Sofiya Strochyk <ss...@interlogic.com.ua>.
> The logfiles on your servers should be verbose enough to indicate what
> machines are handling which parts of the request.
Yes, generally i see the following entries in logs:
1. df=_text_&distrib=false&fl=_id&fl=score&shards.purpose=4&start=0&fsv=true&sort=<sort
expression>fq=<fq expression>&shard.url=<shard IP and
path>&rows=24&version=2&q=<q
expression>&NOW=1540984948280&isShard=true&wt=javabin
2. df=_text_&distrib=false&fl=<full list of
fields>&shards.purpose=64&start=0&fq=<fq
expression>&shard.url=<shard IP and path>&rows=24&version=2&q=<q
expression>&NOW=1540984948280&ids=<list of IDs>&isShard=true&wt=javabin
3. q=<q expression>&fl=<full list of fields>&start=0&sort=<sort
expression>&fq=<fq expression>&rows=24&version=2.2&wt=json
Request type #3 (full request) is seen only 1 time across all shards,
and I suppose it is the original/aggregated request. The shard is
different every time, so this means load balancing is working.
Request #1 (get IDs by query) is always present for one replica of each
shard.
Request #2 (get fields by IDs) is, however, sometimes missing even
though request #1 has a non-zero number of hits for that shard. But i
don't know if this could indicate a problem or it is working as expected?
> Only one attachment made it to the list. I'm surprised that ANY of
> them made it -- usually they don't. Generally you need to use a file
> sharing website and provide links. Dropbox is one site that works
> well. Gist might also work.
>
> The GC log that made it through (solr_gc.log.7.1) is only two minutes
> long. Nothing useful can be learned from a log that short. It is
> also missing the information at the top about the JVM that created it,
> so I'm wondering if you edited the file so it was shorter before
> including it.
>
> Thanks,
> Shawn
You are right, sorry, i didn't know this :)
(there is a 1MB limitation on attachments which is why i trimmed the log)
Here are the full GC logs: 1
<https://www.dropbox.com/s/zrv6221vbj9pwnp/solr_gc.log.7?dl=0> 2
<https://www.dropbox.com/s/gdiusi0a9w95a4w/solr_gc.log.8?dl=0>
and images: 1 <https://upload.cc/i1/2018/10/31/EMvelJ.png> 2
<https://upload.cc/i1/2018/10/31/xiNwDW.png>3
<https://upload.cc/i1/2018/10/31/fLH09Y.png>
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: SolrCloud scaling/optimization for high request rate
Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/29/2018 7:24 AM, Sofiya Strochyk wrote:
> Actually the smallest server doesn't look bad in terms of performance,
> it has been consistently better that the other ones (without
> replication) which seems a bit strange (it should be about the same or
> slightly worse, right?). I guess the memory being smaller than index
> doesn't cause problems due to the fact that we use SSDs.
SSD, while fast, is nowhere near as fast as main memory. As I said, the
memory numbers might cause performance problems, or they might not.
Glad you're in the latter category.
> What if we are sending requests to machine which is part of the
> cluster but doesn't host any shards? Does it handle the initial
> request and merging of the results, or this has to be handled by one
> of the shards anyway?
> Also i was thinking "more shards -> each shard searches smaller set of
> documents -> search is faster". Or is the overhead for merging results
> bigger than overhead from searching larger set of documents?
If every shard is on its own machine, many shards might not be a
performance bottleneck with a high query rate. The more shards you
have, the more the machine doing the aggregation must do to produce results.
SolrCloud complicates the situation further. It normally does load
balancing of all requests that come in across the cloud. So the machine
handling the request might not be the machine where you SENT the request.
>> Very likely the one with a higher load is the one that is aggregating
>> shard requests for a full result.
> Is there a way to confirm this? Maybe the aggregating shard is going
> to have additional requests in its solr.log?
The logfiles on your servers should be verbose enough to indicate what
machines are handling which parts of the request.
>> Most Solr performance issues are memory related. With an extreme
>> query rate, CPU can also be a bottleneck, but memory will almost
>> always be the bottleneck you run into first.
> This is the advice i've seen often, but how exactly can we run out of
> memory if total RAM is 128, heap is 8 and index size is 80. Especially
> since node with 64G runs just as fine if not better.
Even when memory is insufficient, "running out" of memory generally
doesn't happen unless the heap is too small.Java will work within the
limits imposed by the system if it can. For OS disk cache, the OS tries
to be as smart as it can about which data stays in the cache and which
data is discarded.
>> A lot of useful information can be obtained from the GC logs that
>> Solr's built-in scripting creates. Can you share these logs?
>>
>> The screenshots described here can also be very useful for
>> troubleshooting:
>>
>> https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue
> I have attached some GC logs and screenshots, hope these are helpful
> (can only attach small files)
Only one attachment made it to the list. I'm surprised that ANY of them
made it -- usually they don't. Generally you need to use a file sharing
website and provide links. Dropbox is one site that works well. Gist
might also work.
The GC log that made it through (solr_gc.log.7.1) is only two minutes
long. Nothing useful can be learned from a log that short. It is also
missing the information at the top about the JVM that created it, so I'm
wondering if you edited the file so it was shorter before including it.
Thanks,
Shawn
Re: Re: SolrCloud scaling/optimization for high request rate
Posted by Sofiya Strochyk <ss...@interlogic.com.ua>.
Hi Shawn,
On 27.10.18 09:28, Shawn Heisey wrote:
>
> With 80GB of index data and one node that only has 64GB of memory, the
> full index won't fit into memory on that one server. With
> approximately 56GB of memory (assuming there's nothing besides Solr
> running on these servers and the size of all Java heaps on the system
> is 8GB) to cache 80GB of index data, performance might be good. Or it
> might be terrible. It's impossible to predict effectively.
>
Actually the smallest server doesn't look bad in terms of performance,
it has been consistently better that the other ones (without
replication) which seems a bit strange (it should be about the same or
slightly worse, right?). I guess the memory being smaller than index
doesn't cause problems due to the fact that we use SSDs.
> I'm not sure that an 8GB heap is large enough. Especially given what
> you said later about experiencing OOM and seeing a lot of full GCs.
Increasing the heap is not a problem, but it doesn't seem to make any
difference
> If you can reduce the number of shards, the amount of work involved
> for a single request will decrease, which MIGHT increase the queries
> per second your hardware can handle. With four shards, one query
> typically is actually 9 requests.
>
> Unless your clients are all Java-based, to avoid a single point of
> failure, you need a load balancer as well. (The Java client can talk
> to the entire SolrCloud cluster and wouldn't need a load balancer)
What if we are sending requests to machine which is part of the cluster
but doesn't host any shards? Does it handle the initial request and
merging of the results, or this has to be handled by one of the shards
anyway?
Also i was thinking "more shards -> each shard searches smaller set of
documents -> search is faster". Or is the overhead for merging results
bigger than overhead from searching larger set of documents?
> Very likely the one with a higher load is the one that is aggregating
> shard requests for a full result.
Is there a way to confirm this? Maybe the aggregating shard is going to
have additional requests in its solr.log?
> Most Solr performance issues are memory related. With an extreme
> query rate, CPU can also be a bottleneck, but memory will almost
> always be the bottleneck you run into first.
This is the advice i've seen often, but how exactly can we run out of
memory if total RAM is 128, heap is 8 and index size is 80. Especially
since node with 64G runs just as fine if not better.
> A lot of useful information can be obtained from the GC logs that
> Solr's built-in scripting creates. Can you share these logs?
>
> The screenshots described here can also be very useful for
> troubleshooting:
>
> https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue
>
I have attached some GC logs and screenshots, hope these are helpful
(can only attach small files)
--
Email Signature
*Sofiia Strochyk
*
ss@interlogic.com.ua <ma...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
icon <https://www.linkedin.com/company/interlogic>
Re: SolrCloud scaling/optimization for high request rate
Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/26/2018 9:55 AM, Sofiya Strochyk wrote:
>
> We have a SolrCloud setup with the following configuration:
>
I'm late to this party. You've gotten some good replies already. I
hope I can add something useful.
> * 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon
> E5-1650v2, 12 cores, with SSDs)
> * One collection, 4 shards, each has only a single replica (so 4
> replicas in total), using compositeId router
> * Total index size is about 150M documents/320GB, so about 40M/80GB
> per node
>
With 80GB of index data and one node that only has 64GB of memory, the
full index won't fit into memory on that one server. With approximately
56GB of memory (assuming there's nothing besides Solr running on these
servers and the size of all Java heaps on the system is 8GB) to cache
80GB of index data, performance might be good. Or it might be
terrible. It's impossible to predict effectively.
> * Heap size is set to 8GB.
>
I'm not sure that an 8GB heap is large enough. Especially given what
you said later about experiencing OOM and seeing a lot of full GCs.
If properly tuned, the G1 collector is overall more efficient than CMS,
but CMS can be quite good. If GC is not working well with CMS, chances
are that switching to G1 will not help. The root problem is likely to
be something that a different collector can't fix -- like the heap being
too small.
I wrote the page you referenced for GC tuning. I have *never* had a
single problem using G1 with Solr.
> Target query rate is up to 500 qps, maybe 300, and we need to keep
> response time at <200ms. But at the moment we only see very good
> search performance with up to 100 requests per second. Whenever it
> grows to about 200, average response time abruptly increases to 0.5-1
> second. (Also it seems that request rate reported by SOLR in admin
> metrics is 2x higher than the real one, because for every query, every
> shard receives 2 requests: one to obtain IDs and second one to get
> data by IDs; so target rate for SOLR metrics would be 1000 qps).
>
Getting 100 requests per second on a single replica is quite good,
especially with a sharded index. I never could get performance like
that. To handle hundreds of requests per second, you need several replicas.
If you can reduce the number of shards, the amount of work involved for
a single request will decrease, which MIGHT increase the queries per
second your hardware can handle. With four shards, one query typically
is actually 9 requests.
Unless your clients are all Java-based, to avoid a single point of
failure, you need a load balancer as well. (The Java client can talk to
the entire SolrCloud cluster and wouldn't need a load balancer)
What you are seeing where there is a sharp drop in performance from a
relatively modest load increase is VERY common. This is the way that
almost all software systems behave when faced with extreme loads.
Search for "knee" on this page:
https://www.oreilly.com/library/view/the-art-of/9780596155858/ch04.html
> During high request load, CPU usage increases dramatically on the SOLR
> nodes. It doesn't reach 100% but averages at 50-70% on 3 servers and
> about 93% on 1 server (random server each time, not the smallest one).
>
Very likely the one with a higher load is the one that is aggregating
shard requests for a full result.
> The documentation mentions replication to spread the load between the
> servers. We tested replicating to smaller servers (32GB RAM, Intel
> Core i7-4770). However, when we tested it, the replicas were going out
> of sync all the time (possibly during commits) and reported errors
> like "PeerSync Recovery was not successful - trying replication." Then
> they proceed with replication which takes hours and the leader handles
> all requests singlehandedly during that time. Also both leaders and
> replicas started encountering OOM errors (heap space) for unknown reason.
>
With only 32GB of memory, assuming 8GB is allocated to the heap, there's
only 24GB to cache the 80GB of index data. That's not enough, and
performance would be MUCH worse than your 64GB or 128GB machines.
I would suspect extreme GC pauses and/or general performance issues from
not enough cache memory to be the root cause of the sync and recovery
problems.
> Heap dump analysis shows that most of the memory is consumed by [J
> (array of long) type, my best guess would be that it is "_version_"
> field, but it's still unclear why it happens.
>
I'm not familiar enough with how Lucene allocates memory internally to
have any hope of telling you exactly what that memory structure is.
> Also, even though with replication request rate and CPU usage drop 2
> times, it doesn't seem to affect mean_ms, stddev_ms or p95_ms numbers
> (p75_ms is much smaller on nodes with replication, but still not as
> low as under load of <100 requests/s).
>
> Garbage collection is much more active during high load as well. Full
> GC happens almost exclusively during those times. We have tried tuning
> GC options like suggested here
> <https://wiki.apache.org/solr/ShawnHeisey#CMS_.28ConcurrentMarkSweep.29_Collector>
> and it didn't change things though.
>
Symptoms like that generally mean that your heap is too small and needs
to be increased.
> * How do we increase throughput? Is replication the only solution?
>
Ensuring there's enough memory for caching is the first step. But that
can only take you so far. Dealing with the very high query rate you've
got will require multiple replicas.
> * if yes - then why doesn't it affect response times, considering
> that CPU is not 100% used and index fits into memory?
>
Hard to say without an in-depth look. See the end of my reply.
> * How to deal with OOM and replicas going into recovery?
>
There are precisely two ways to deal with OOM. One is to increase the
size of the resource that's depleted. The other is to change things so
that the program doesn't require as much of that resource. The second
option is frequently not possible.
> * Is memory or CPU the main problem? (When searching on the
> internet, i never see CPU as main bottleneck for SOLR, but our
> case might be different)
>
Most Solr performance issues are memory related. With an extreme query
rate, CPU can also be a bottleneck, but memory will almost always be the
bottleneck you run into first.
> * Or do we need smaller shards? Could segments merging be a problem?
>
Smaller shards really won't make much difference in segment merging,
unless the size reduction is *EXTREME* -- switching to a VERY large
number of shards.
If you increase the numbers in your merge policy, then merging will
happen less frequently. The config that I chose to use was 35 for
maxMergeAtOnce and segmentsPerTier, with 105 for
maxMergeAtOnceExplicit. The disadvantage to this is that your indexes
will have a LOT more files in them, so it's much easier to run into an
open file limit in the OS.
> * How to add faceting without search queries slowing down too much?
>
As Erick said ... this isn't possible. To handle the query load you've
mentioned *with* facets will require even more replicas. Facets require
more heap memory, more CPU resources, and are likely to access more of
the index data -- which means having plenty of cache memory is even more
important.
> * How to diagnose these problems and narrow down to the real reason
> in hardware or setup?
>
A lot of useful information can be obtained from the GC logs that Solr's
built-in scripting creates. Can you share these logs?
The screenshots described here can also be very useful for troubleshooting:
https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue
Thanks,
Shawn