You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Gael Jourdan-Weil <ga...@kelkoogroup.com> on 2019/02/19 12:56:25 UTC

RE: Re: Delayed/waiting requests

Quick update just in case someone comes on this thread someday: we did lower the autowarm but it didn't have effect on the performance issues we are seeing.

We are still investigating...

Regards,
Gaël

________________________________
De : Gael Jourdan-Weil
Envoyé : mardi 15 janvier 2019 18:33
À : solr-user@lucene.apache.org
Objet : RE: Re: Delayed/waiting requests


@Erick:


We will try to lower the autowarm and run some tests to compare.

If I get your point, having a big cache might cause more troubles than help if the cache hit ratio is not high enough because the cache is constantly evicting/inserting entries?



@Jeremy:


Index size: ~20G and ~14M documents

Server memory available: 256G from which ~30G used and ~100G system cache

Server CPU count: 32, ~10% usage

JVM memory settings: -Xms12G -Xmx12G


We have 3 servers and 3 clusters of 3 Solr instances.

That is each server hosts 1 Solr instance for each cluster.

And, indeed, each cluster only has 1 shard with replication factor 3.


Among all these Solr instances, the pauses are observed on only one single cluster but on every server at different times (sometimes on all servers at the same time but I would say it's very rare).

We do observe the traffic is evenly balanced across the 3 servers, around 30-40 queries per second sent to each server.



Regards,

Gaël


________________________________
De : Branham, Jeremy (Experis) <jb...@allstate.com>
Envoyé : mardi 15 janvier 2019 17:59:56
À : solr-user@lucene.apache.org
Objet : Re: Re: Delayed/waiting requests

Hi Gael –

Could you share this information?
Size of the index
Server memory available
Server CPU count
JVM memory settings

You mentioned a cloud configuration of 3 replicas.
Does that mean you have 1 shard with a replication factor of 3?
Do the pauses occur on all 3 servers?
Is the traffic evenly balanced across those servers?


Jeremy Branham
jbrcm@allstate.com


On 1/15/19, 9:50 AM, "Erick Erickson" <er...@gmail.com> wrote:

    Well, it was a nice theory anyway.

    "Other collections with the same settings"
    doesn't really mean much unless those other collections are very similar,
    especially in terms of numbers of docs.

    You should only see a new searcher opening when you do a
    hard-commit-with-opensearcher-true or soft commit.

    So what happens when you just try lowering the autowarm
    count? I'm assuming you're free to test in some non-prod
    system.

    Focusing on the hit ratio is something of a red herring. Remember
    that each entry in your filterCache is roughly maxDoc/8 + a little
    overhead, the increase in GC pressure has to be balanced
    against getting the hits from the cache.

    Now, all that said if there's no correlation, then you need to put
    a profiler on the system when you see this kind of thing and
    find out where the hotspots are, otherwise it's guesswork and
    I'm out of ideas.

    Best,
    Erick

    On Tue, Jan 15, 2019 at 12:06 AM Gael Jourdan-Weil
    <ga...@kelkoogroup.com> wrote:
    >
    > Hi Erick,
    >
    >
    > Thank you for your detailed answer, I better understand autowarming.
    >
    >
    > We have an autowarming time of ~10s for filterCache (queryResultCache is not used at all, ratio = 0.02).
    >
    > We increased the size of the filterCache from 6k to 12k (and autowarming size set to same values) to have a better ratio which is _only_ around 0.85/0.90.
    >
    >
    > The thing I don't understand is I should see "Opening new searcher" in the logs everytime a new searcher is opened and thus an autowarming happens, right?
    >
    > But I don't see "Opening new searcher" very often, and I don't see it being correlated with the response time peaks.
    >
    >
    > Also, I didn't mention it earlier but, we have other SolrCloud clusters with similar settings and load (~10s filterCache autowarming, 10k entries) and we don't observe the same behavior.
    >
    >
    > Regards,
    >
    > ________________________________
    > De : Erick Erickson <er...@gmail.com>
    > Envoyé : lundi 14 janvier 2019 17:44:38
    > À : solr-user
    > Objet : Re: Delayed/waiting requests
    >
    > Gael:
    >
    > bq. Nevertheless, our filterCache is set to autowarm 12k entries which
    > is also the maxSize
    >
    > That is far, far, far too many. Let's assume you actually have 12K
    > entries in the filterCache.
    > Every time you open a new searcher, 12K queries are executed _before_
    > the searcher
    > accepts any new requests. While being able to re-use a filterCache
    > entry is useful, one of
    > the primary purposes is to pre-load index data from disk into memory
    > which can be
    > the event that takes the most time.
    >
    > The queryResultCache has a similar function. I often find that this
    > cache doesn't have a
    > very high hit ratio, but again executing a _few_ of these queries
    > warms the index from
    > disk.
    >
    > I think of both caches as a map, where the key is the "thing", (fq
    > clause in the case
    > of filterCache, the whole query in the case of the queryResultCache).
    > Autowarming
    > replays the most recently executed N of these entries, essentially
    > just as though
    > they were submitted by a user.
    >
    > Hypothesis: You're massively over-warming, and when that kicks in you're seeing
    > increased CPU and GC pressure leading to the anomalies you're seeing. Further,
    > you have such excessive autowarming going on that it's hard to see the
    > associated messages in the log.
    >
    > Here's what I'd recommend: Set your autowarm counts to something on the order
    > of 16. If the culprit is just excessive autowarming, I'd expect your spikes to
    > be much less severe. It _might_ be that your users see some increased (very
    > temporary) variance in response time. You can tell that the autowarming
    > configurations are "more art than science", I can't give you any other
    > recommendations than "start small and increase until you're happy"
    > unfortunately.
    >
    > I usually do this with some kind of load tester in a dev lab of course ;).
    >
    > Finally, if you use the metrics data (see:
    > https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F1_metrics-2Dreporting.html&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=9BWTVr32mplsfAWQ3hnWuVx5V1cL_RgLNDDpg8S2mtk&e=)
    > you can see the autowarm times. Don't get too lost in the page to
    > start, just hit the "https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8983_solr_admin_metrics&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=F8ahrx-8r_hWi7F1M0fHZc-_gEqsmr-yPvSZlGYX5rw&e=" endpoint
    > and look for "warmupTime", then refine on how to get _only_
    > the warmup stats ;).
    >
    > Best,
    > Erick
    >
    > On Mon, Jan 14, 2019 at 5:08 AM Gael Jourdan-Weil
    > <ga...@kelkoogroup.com> wrote:
    > >
    > > I had a look to GC logs this morning but I'm not sure how to interpret them.
    > >
    > >
    > > Over a period of 54mn, there is:
    > >
    > > - Number of pauses: 2739
    > >
    > > - Accumulated pauses: 93s => that is 2.86% of the time
    > >
    > > - Average pause duration: 0.03s
    > >
    > > - Average pause interval: 1.18s
    > >
    > > - Accumulated full GC: 0
    > >
    > > I'm not sure if this is a lot or not. What do you think ?
    > >
    > >
    > > Looking more closely to GC logs with GC Viewer, I can notice that the high response time peaks happens at the same time where GC pauses takes 2x more time (around 0.06s) than average.
    > >
    > >
    > > Also we are indeed indexing at the same time but we have autowarming set.
    > >
    > > I don't see any Searcher opened at the time we experience slowness.
    > >
    > > Nevertheless, our filterCache is set to autowarm 12k entries which is also the maxSize.
    > >
    > > Could this have any downside?
    > >
    > >
    > > Thanks,
    > >
    > > Gaël
    > >
    > >
    > > ________________________________
    > > De : Erick Erickson <er...@gmail.com>
    > > Envoyé : vendredi 11 janvier 2019 17:21
    > > À : solr-user
    > > Objet : Re: Delayed/waiting requests
    > >
    > > Jimi's comment is one of the very common culprits.
    > >
    > > Autowarming is another. Are you indexing at the same
    > > time? If so it could well be  you aren't autowarming and
    > > the spikes are caused by using a new IndexSearcher
    > > that has to read much of the index off disk when commits
    > > happen. The "smoking gun" here would be if the spikes
    > > correlate to your commits (soft or hard-with-opensearcher-true).
    > >
    > > Best,
    > > Erick
    > >
    > > On Fri, Jan 11, 2019 at 1:23 AM Gael Jourdan-Weil
    > > <ga...@kelkoogroup.com> wrote:
    > > >
    > > > Interesting indeed, we did not see anything with VisualVM but having a look at the GC logs could gives us more info, especially on the pauses.
    > > >
    > > > I will collect data over the week-end and look at it.
    > > >
    > > >
    > > > Thanks
    > > >
    > > > ________________________________
    > > > De : Hullegård, Jimi <ji...@svensktnaringsliv.se>
    > > > Envoyé : vendredi 11 janvier 2019 03:46:02
    > > > À : solr-user@lucene.apache.org
    > > > Objet : Re: Delayed/waiting requests
    > > >
    > > > Could be caused by garbage collection in the jvm.
    > > >
    > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_solr_SolrPerformanceProblems&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=csud3Q2nMknKdqj5WMFHMA7mK8nDBNlxAmbDafyp2us&e=
    > > >
    > > > Go down to the segment called “GC pause problems”
    > > >
    > > > /Jimi
    > > >
    > > > Sent from my iPhone
    > > >
    > > > On 11 Jan 2019, at 05:05, Gael Jourdan-Weil <ga...@kelkoogroup.com>> wrote:
    > > >
    > > > Hello,
    > > >
    > > > We are experiencing some performance issues on a simple SolrCloud cluster of 3 replicas (1 core) but what we found during our analysis seems a bit odd, so we thought the community could have relevant ideas on this.
    > > >
    > > > Load: between 30 and 40 queries per second, constant over time of analysis
    > > >
    > > > Symptoms: high response time over short period of time but quite frequently.
    > > > We are talking about requests response time going from 50ms to 5000ms or even worse during less than 5 seconds, and then going back to normal.
    > > >
    > > > What we found out: just before response time increase, requests seems to be delayed.
    > > > That is during 2/3 seconds, requests pile up, no response is sent, and then all requests are resolved and responses are all returned to the clients at the same time.
    > > > Very much like if there was a lock happening somewhere. But we found no "lock" time nor at JVM or system level.
    > > >
    > > > Does someone can think of something that could explain this in the way Solr works ?
    > > > Or ideas to track down the root cause..
    > > >
    > > > Solr version is 7.2.1.
    > > >
    > > > Thanks for reading,
    > > >
    > > > Gaël Jourdan-Weil
    > > >
    > > > Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.svensktnaringsliv.se_dataskydd_integritet-2Doch-2Dbehandling-2Dav-2Dpersonuppgifter-5F697219.html-3Futm-5Fsource-3Dsn-2Demail-26utm-5Fmedium-3Demail&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=YaT3hPS_GQ4_rKBEfndYZC583WM6K6boCdbEJKTQirA&e=>