You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Alex Jablonski <aj...@thoughtworks.com> on 2019/11/20 19:36:22 UTC

Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Hi there!

My colleague and I have run into an issue that seems to appear when running
the Solr Prometheus exporter in SolrCloud mode against a large (> 100 node)
cluster. The symptoms we're observing are "connection pool shut down"
exceptions in the logs and the inability to collect metrics from more than
100 nodes in the cluster.

We think we've traced down the issue to
lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java
. In that class, hostClientCache exists as a cache of HttpSolrClients
(currently having fixed size 100) that, on evicting a client from the
cache, closes the client's connection. The hostClientCache is used in
createHttpSolrClients to return a map of base URLs to HttpSolrClients.

Given, say, 300 base URLs, createHttpSolrClients will happily add those
base URLs to the cache, and the "get" method on the cache will happily
return the new additions to the cache. But on adding the 101st
HttpSolrClient to the cache, the first HttpSolrClient gets evicted and
closed. This repeats itself until the only open clients we have are to base
URLs 201 through 300; clients for the first 200 base URLs will be returned,
but will already have been closed. When we later use the result of
createHttpSolrClients to collect metrics, expecting valid and open
HttpSolrClients, we fail to connect when using any of those clients that
have already been closed, leading to the "Connection pool shut down"
exception and not collecting metrics from those nodes.

Our idea for a fix was to change the existing cache to, instead of having a
fixed maximum size, use `expireAfterAccess` with a timeout that's a
multiple of the scrape interval (twice the scrape interval?). We wanted to
confirm a few things:

1. Has this issue been reported before, and if so, is there another fix in
progress already?
2. Does this approach seem desirable?
3. If so, are there any opinions on what the cache timeout should be
besides just double the scrape interval?

We'll also open a PR shortly with the changes we're proposing and link
here. Please let me know if any of the above is unclear or incorrect.

Thanks!
Alex Jablonski

Re: Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Posted by Richard Goodman <ri...@brandwatch.com>.
Sorry, I meant to say, Hey Alex

It's because this time around I did reply all, and Serj's name was
highlighted in my email!

On Fri, 22 Nov 2019 at 17:17, Richard Goodman <ri...@brandwatch.com>
wrote:

> Hey Serj,
>
> So I've applied your patch, deployed the latest build of the
> solr-prometheus exporter, and left it running for a little bit. It's had
> some interesting results.
>
> As I mentioned, I split up my config from 1 monolithic config to separate
> configs, and have an instance running for each one *(they are split up by
> groups defined in the /solr/admin/metrics endpoint)*, for more
> information, I have split them up via the following:
>
>    - Core
>    - Node
>    - JVM
>    - Jetty
>
> And I was getting the connection pool shut down, for every single type.
> However, this is not the case any more.
>
> *Core*
> Whilst I am no longer getting Connection Pool Shutdown, I am now getting
> the likes of;
> ERROR - 2019-11-22 16:44:17.522;
> org.apache.solr.prometheus.scraper.SolrScraper; failed to request:
> /admin/metrics Timeout occured while waiting response from server at:
> http://127.0.0.1:8085/solr
>
> Because of this, I am now no longer collecting any core metrics at all.
> I'm yet to figure out in the code where this error is reported, and seeing
> if anything can help with this.
>
> *Node*
> I did get a NPE that I've never seen before, however, this appeared to be
> a 1 off, and now this is also reporting 0 connection pool shut down issues,
> which I was getting previously. I'll attach a dump of the node NPE below as
> like an "appendix"
>
> *JVM*
> Completely seems to resolve the issue, have no longer been getting it.
>
> *Jetty*
> Unfortunately here, I am still getting connection pool shut down errors,
> again I will upload a dump of the error in the "appendix"
>
> Whilst this hasn't appeared to completely fix the issue, it's definitely a
> step in the right direction. As I said, for Node and JVM I am no longer
> getting this, and have seen an improve in the amount of metrics being
> collected. But it looks like there is still some work we can look at and do
>
> --------
> *Appendix*
>
> *Node NPE Error*
> ERROR - 2019-11-22 16:46:38.594;
> org.apache.solr.prometheus.scraper.SolrScraper; failed to request:
> /admin/metrics Timeout occured while waiting response from server at:
> http://127.0.0.1:8083/solr
> WARN  - 2019-11-22 16:46:59.105; org.apache.solr.prometheus.scraper.Async;
> Error occurred during metrics collection
> java.util.concurrent.ExecutionException: java.lang.NullPointerException
>         at
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> ~[?:1.8.0_141]
>         at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> ~[?:1.8.0_141]
>         at
> org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
> ~[?:1.8.0_141]
>         at
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
> ~[?:1.8.0_141]
>         at
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
> ~[?:1.8.0_141]
>         at
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> ~[?:1.8.0_141]
>         at
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> ~[?:1.8.0_141]
>         at
> java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
> ~[?:1.8.0_141]
>         at
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
> ~[?:1.8.0_141]
>         at
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> ~[?:1.8.0_141]
>         at
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
> ~[?:1.8.0_141]
>         at
> org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
> ~[?:1.8.0_141]
>         at
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
> ~[?:1.8.0_141]
>         at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> ~[?:1.8.0_141]
>         at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595)
> ~[?:1.8.0_141]
>         at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_141]
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_141]
>         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
> Caused by: java.lang.NullPointerException
>         at
> org.apache.solr.prometheus.scraper.SolrScraper.request(SolrScraper.java:112)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> org.apache.solr.prometheus.scraper.SolrCloudScraper.lambda$metricsForAllHosts$6(SolrCloudScraper.java:121)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> org.apache.solr.prometheus.scraper.SolrScraper.lambda$null$0(SolrScraper.java:81)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> ~[?:1.8.0_141]
>         ... 4 more
>
>
> *Jetty Connection Pool Shut down*
> WARN  - 2019-11-22 16:41:57.505; org.apache.solr.prometheus.scraper.Async;
> Error occurred during metrics collection
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException:
> Connection pool shut down
>         at
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> ~[?:1.8.0_141]
>         at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> ~[?:1.8.0_141]
>         at
> org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987
> b - jenkins - 2019-11-22 16:23:03]
>         at
> java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
> ~[?:1.8.0_141]
>         at
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
> ~[?:1.8.0_141]
>         at
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
> ~[?:1.8.0_141]
>         at
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> ~[?:1.8.0_141]
>         at
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> ~[?:1.8.0_141]
>         at
> java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
> ~[?:1.8.0_141]
>         at
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
> ~[?:1.8.0_141]
>         at
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> ~[?:1.8.0_141]
>         at
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
> ~[?:1.8.0_141]
>         at
> org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
> ~[?:1.8.0_141]
>         at
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
> ~[?:1.8.0_141]
>         at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> ~[?:1.8.0_141]
>         at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595)
> ~[?:1.8.0_141]
>         at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_141]
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_141]
>         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
> Caused by: java.lang.IllegalStateException: Connection pool shut down
>         at org.apache.http.util.Asserts.check(Asserts.java:34)
> ~[httpcore-4.4.10.jar:4.4.10]
>         at
> org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:191)
> ~[httpcore-4.4.10.jar:4.4.10]
>         at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:267)
> ~[httpclient-4.5.6.jar:4.5.6]
>         at
> org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176)
> ~[httpclient-4.5.6.jar:4.5.6]
>         at
> org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
> ~[httpclient-4.5.6.jar:4.5.6]
>         at
> org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
> ~[httpclient-4.5.6.jar:4.5.6]
>         at
> org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
> ~[httpclient-4.5.6.jar:4.5.6]
>         at
> org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
> ~[httpclient-4.5.6.jar:4.5.6]
>         at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
> ~[httpclient-4.5.6.jar:4.5.6]
>         at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
> ~[httpclient-4.5.6.jar:4.5.6]
>         at
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:542)
> ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
>         at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
> ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
>         at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
> ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
>         at
> org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1260)
> ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
>         at
> org.apache.solr.prometheus.scraper.SolrScraper.request(SolrScraper.java:102)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> org.apache.solr.prometheus.scraper.SolrCloudScraper.lambda$metricsForAllHosts$6(SolrCloudScraper.java:121)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> org.apache.solr.prometheus.scraper.SolrScraper.lambda$null$0(SolrScraper.java:81)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
>         at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> ~[?:1.8.0_141]
>         ... 4 more
> INFO  - 2019-11-22 16:41:57.657;
> org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Completed
> metrics collection
>
> On Fri, 22 Nov 2019 at 16:42, Alex Jablonski <aj...@thoughtworks.com>
> wrote:
>
>> Hey Richard,
>>
>> I'd definitely love to hear whether this improves things for you.
>> According to Guava's documentation, the cache can start evicting items when
>> it gets close to the limit (
>> https://github.com/google/guava/wiki/CachesExplained#size-based-eviction),
>> not just when it reaches it, so if this does end up helping you out, that
>> could be the reason. I haven't dug into the implementation of "close to the
>> maximum" that Guava's cache uses, or whether that would happen in the
>> course of building up the maps to ping all of the nodes (which is where the
>> issue for us seemed to be), but it's at least a possible avenue to explore
>> further.
>>
>> In any case, thanks for trying this out!
>>
>> On Fri, Nov 22, 2019 at 10:16 AM Richard Goodman <ri...@brandwatch.com>
>> wrote:
>>
>>> Hi Alex,
>>>
>>> This makes me really happy to see an email about this. I've been working
>>> on a little while about setting up the prometheus exporter for our
>>> clusters. Spent good amount of time setting up config, and started getting
>>> some really decent graphs in grafana on metrics we've never been able to
>>> collect before.
>>>
>>> For our stage environment, this worked like a charm, so shortly rolled
>>> it out to our live environment. This is when I started to get into trouble.
>>>
>>> I too was getting the exact problem you was facing, I then decided to
>>> split out all of my config so I had one config dedicated to JVM metric
>>> collection, one dedicated to Node level metrics etc., etc., I was still
>>> getting loads of errors coming through, which confused me.
>>>
>>> Our clusters are typically 96 nodes, so from your report, not sure how I
>>> would be getting this issue. One theory I had was the timeouts happening on
>>> the core admin API *(our indexes are range between 5gb-20gb in size
>>> each)*, and our clusters will typically be around 10s of TB in size.
>>> Because of this, when we have any replica state change, we notice
>>> significant delays in /solr/admin/cores , sometimes taking a few minutes to
>>> return.
>>>
>>> Because of this, I think there is a strong connection to the core admin
>>> being a problem here, the reason for this is we have 1 unique cluster where
>>> its typically storing 30days worth of data within its collections, new day
>>> comes along, we create a collection for that day, and any collections older
>>> than 30 days get dropped. Documents within this cluster typically don't
>>> change either, so there's never really any state change, and causes the
>>> cluster to be significantly reliable for us, where our other main group of
>>> clusters go through significant amount of change a day.
>>>
>>> I'm currently applying your patch into our build, and will deploy this
>>> and keep you updated to see if this helps. At the moment, I'm looking if
>>> there is a way to add a default to indexInfo=false to the core admin api,
>>> that could help us here *(because using that makes the response time
>>> insanely fast as per usual, however, does remove some statistics)*.
>>>
>>> With that though, its very experimental, and not sure if it's the best
>>> approach, but you have to start somewhere right?
>>>
>>> I'd be keen to look into this issue with you, as it's been a problem for
>>> us also.
>>>
>>> I'll reply again with any results I find from applying your patch.
>>>
>>> Cheers,
>>>
>>> On Wed, 20 Nov 2019 at 20:34, Alex Jablonski <
>>> ajablonski@thoughtworks.com> wrote:
>>>
>>>> Pull request is here: https://github.com/apache/lucene-solr/pull/1022/
>>>>
>>>> Thanks!
>>>> Alex Jablonski
>>>>
>>>> On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski <
>>>> ajablonski@thoughtworks.com> wrote:
>>>>
>>>>> Hi there!
>>>>>
>>>>> My colleague and I have run into an issue that seems to appear when
>>>>> running the Solr Prometheus exporter in SolrCloud mode against a large (>
>>>>> 100 node) cluster. The symptoms we're observing are "connection pool shut
>>>>> down" exceptions in the logs and the inability to collect metrics from more
>>>>> than 100 nodes in the cluster.
>>>>>
>>>>> We think we've traced down the issue to
>>>>> lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java
>>>>> . In that class, hostClientCache exists as a cache of HttpSolrClients
>>>>> (currently having fixed size 100) that, on evicting a client from the
>>>>> cache, closes the client's connection. The hostClientCache is used in
>>>>> createHttpSolrClients to return a map of base URLs to HttpSolrClients.
>>>>>
>>>>> Given, say, 300 base URLs, createHttpSolrClients will happily add
>>>>> those base URLs to the cache, and the "get" method on the cache will
>>>>> happily return the new additions to the cache. But on adding the 101st
>>>>> HttpSolrClient to the cache, the first HttpSolrClient gets evicted and
>>>>> closed. This repeats itself until the only open clients we have are to base
>>>>> URLs 201 through 300; clients for the first 200 base URLs will be returned,
>>>>> but will already have been closed. When we later use the result of
>>>>> createHttpSolrClients to collect metrics, expecting valid and open
>>>>> HttpSolrClients, we fail to connect when using any of those clients that
>>>>> have already been closed, leading to the "Connection pool shut down"
>>>>> exception and not collecting metrics from those nodes.
>>>>>
>>>>> Our idea for a fix was to change the existing cache to, instead of
>>>>> having a fixed maximum size, use `expireAfterAccess` with a timeout that's
>>>>> a multiple of the scrape interval (twice the scrape interval?). We wanted
>>>>> to confirm a few things:
>>>>>
>>>>> 1. Has this issue been reported before, and if so, is there another
>>>>> fix in progress already?
>>>>> 2. Does this approach seem desirable?
>>>>> 3. If so, are there any opinions on what the cache timeout should be
>>>>> besides just double the scrape interval?
>>>>>
>>>>> We'll also open a PR shortly with the changes we're proposing and link
>>>>> here. Please let me know if any of the above is unclear or incorrect.
>>>>>
>>>>> Thanks!
>>>>> Alex Jablonski
>>>>>
>>>>>
>>>
>>> --
>>>
>>> Richard Goodman    |    Data Infrastructure engineer
>>>
>>> richardg@brandwatch.com
>>>
>>>
>>> NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
>>> PARIS   | SINGAPORE | SYDNEY
>>>
>>> <https://www.brandwatch.com/blog/digital-consumer-intelligence/>
>>>
>>
>
> --
>
> Richard Goodman    |    Data Infrastructure engineer
>
> richardg@brandwatch.com
>
>
> NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
> PARIS   | SINGAPORE | SYDNEY
>
> <https://www.brandwatch.com/blog/digital-consumer-intelligence/>
>


-- 

Richard Goodman    |    Data Infrastructure engineer

richardg@brandwatch.com


NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
PARIS   | SINGAPORE | SYDNEY

<https://www.brandwatch.com/blog/digital-consumer-intelligence/>

Re: Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Posted by Richard Goodman <ri...@brandwatch.com>.
Hey Serj,

So I've applied your patch, deployed the latest build of the
solr-prometheus exporter, and left it running for a little bit. It's had
some interesting results.

As I mentioned, I split up my config from 1 monolithic config to separate
configs, and have an instance running for each one *(they are split up by
groups defined in the /solr/admin/metrics endpoint)*, for more information,
I have split them up via the following:

   - Core
   - Node
   - JVM
   - Jetty

And I was getting the connection pool shut down, for every single type.
However, this is not the case any more.

*Core*
Whilst I am no longer getting Connection Pool Shutdown, I am now getting
the likes of;
ERROR - 2019-11-22 16:44:17.522;
org.apache.solr.prometheus.scraper.SolrScraper; failed to request:
/admin/metrics Timeout occured while waiting response from server at:
http://127.0.0.1:8085/solr

Because of this, I am now no longer collecting any core metrics at all. I'm
yet to figure out in the code where this error is reported, and seeing if
anything can help with this.

*Node*
I did get a NPE that I've never seen before, however, this appeared to be a
1 off, and now this is also reporting 0 connection pool shut down issues,
which I was getting previously. I'll attach a dump of the node NPE below as
like an "appendix"

*JVM*
Completely seems to resolve the issue, have no longer been getting it.

*Jetty*
Unfortunately here, I am still getting connection pool shut down errors,
again I will upload a dump of the error in the "appendix"

Whilst this hasn't appeared to completely fix the issue, it's definitely a
step in the right direction. As I said, for Node and JVM I am no longer
getting this, and have seen an improve in the amount of metrics being
collected. But it looks like there is still some work we can look at and do

--------
*Appendix*

*Node NPE Error*
ERROR - 2019-11-22 16:46:38.594;
org.apache.solr.prometheus.scraper.SolrScraper; failed to request:
/admin/metrics Timeout occured while waiting response from server at:
http://127.0.0.1:8083/solr
WARN  - 2019-11-22 16:46:59.105; org.apache.solr.prometheus.scraper.Async;
Error occurred during metrics collection
java.util.concurrent.ExecutionException: java.lang.NullPointerException
        at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
~[?:1.8.0_141]
        at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
~[?:1.8.0_141]
        at
org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
~[?:1.8.0_141]
        at
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
~[?:1.8.0_141]
        at
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
~[?:1.8.0_141]
        at
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
~[?:1.8.0_141]
        at
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
~[?:1.8.0_141]
        at
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
~[?:1.8.0_141]
        at
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
~[?:1.8.0_141]
        at
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
~[?:1.8.0_141]
        at
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
~[?:1.8.0_141]
        at
org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
~[?:1.8.0_141]
        at
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
~[?:1.8.0_141]
        at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
~[?:1.8.0_141]
        at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595)
~[?:1.8.0_141]
        at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_141]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_141]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: java.lang.NullPointerException
        at
org.apache.solr.prometheus.scraper.SolrScraper.request(SolrScraper.java:112)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
org.apache.solr.prometheus.scraper.SolrCloudScraper.lambda$metricsForAllHosts$6(SolrCloudScraper.java:121)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
org.apache.solr.prometheus.scraper.SolrScraper.lambda$null$0(SolrScraper.java:81)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
~[?:1.8.0_141]
        ... 4 more


*Jetty Connection Pool Shut down*
WARN  - 2019-11-22 16:41:57.505; org.apache.solr.prometheus.scraper.Async;
Error occurred during metrics collection
java.util.concurrent.ExecutionException: java.lang.IllegalStateException:
Connection pool shut down
        at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
~[?:1.8.0_141]
        at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
~[?:1.8.0_141]
        at
org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987
b - jenkins - 2019-11-22 16:23:03]
        at
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
~[?:1.8.0_141]
        at
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
~[?:1.8.0_141]
        at
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
~[?:1.8.0_141]
        at
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
~[?:1.8.0_141]
        at
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
~[?:1.8.0_141]
        at
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
~[?:1.8.0_141]
        at
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
~[?:1.8.0_141]
        at
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
~[?:1.8.0_141]
        at
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
~[?:1.8.0_141]
        at
org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
~[?:1.8.0_141]
        at
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
~[?:1.8.0_141]
        at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
~[?:1.8.0_141]
        at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595)
~[?:1.8.0_141]
        at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_141]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_141]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: java.lang.IllegalStateException: Connection pool shut down
        at org.apache.http.util.Asserts.check(Asserts.java:34)
~[httpcore-4.4.10.jar:4.4.10]
        at
org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:191)
~[httpcore-4.4.10.jar:4.4.10]
        at
org.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:267)
~[httpclient-4.5.6.jar:4.5.6]
        at
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176)
~[httpclient-4.5.6.jar:4.5.6]
        at
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
~[httpclient-4.5.6.jar:4.5.6]
        at
org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
~[httpclient-4.5.6.jar:4.5.6]
        at
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
~[httpclient-4.5.6.jar:4.5.6]
        at
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
~[httpclient-4.5.6.jar:4.5.6]
        at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
~[httpclient-4.5.6.jar:4.5.6]
        at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
~[httpclient-4.5.6.jar:4.5.6]
        at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:542)
~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at
org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1260)
~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at
org.apache.solr.prometheus.scraper.SolrScraper.request(SolrScraper.java:102)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
org.apache.solr.prometheus.scraper.SolrCloudScraper.lambda$metricsForAllHosts$6(SolrCloudScraper.java:121)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
org.apache.solr.prometheus.scraper.SolrScraper.lambda$null$0(SolrScraper.java:81)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
~[?:1.8.0_141]
        ... 4 more
INFO  - 2019-11-22 16:41:57.657;
org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Completed
metrics collection

On Fri, 22 Nov 2019 at 16:42, Alex Jablonski <aj...@thoughtworks.com>
wrote:

> Hey Richard,
>
> I'd definitely love to hear whether this improves things for you.
> According to Guava's documentation, the cache can start evicting items when
> it gets close to the limit (
> https://github.com/google/guava/wiki/CachesExplained#size-based-eviction),
> not just when it reaches it, so if this does end up helping you out, that
> could be the reason. I haven't dug into the implementation of "close to the
> maximum" that Guava's cache uses, or whether that would happen in the
> course of building up the maps to ping all of the nodes (which is where the
> issue for us seemed to be), but it's at least a possible avenue to explore
> further.
>
> In any case, thanks for trying this out!
>
> On Fri, Nov 22, 2019 at 10:16 AM Richard Goodman <ri...@brandwatch.com>
> wrote:
>
>> Hi Alex,
>>
>> This makes me really happy to see an email about this. I've been working
>> on a little while about setting up the prometheus exporter for our
>> clusters. Spent good amount of time setting up config, and started getting
>> some really decent graphs in grafana on metrics we've never been able to
>> collect before.
>>
>> For our stage environment, this worked like a charm, so shortly rolled it
>> out to our live environment. This is when I started to get into trouble.
>>
>> I too was getting the exact problem you was facing, I then decided to
>> split out all of my config so I had one config dedicated to JVM metric
>> collection, one dedicated to Node level metrics etc., etc., I was still
>> getting loads of errors coming through, which confused me.
>>
>> Our clusters are typically 96 nodes, so from your report, not sure how I
>> would be getting this issue. One theory I had was the timeouts happening on
>> the core admin API *(our indexes are range between 5gb-20gb in size
>> each)*, and our clusters will typically be around 10s of TB in size.
>> Because of this, when we have any replica state change, we notice
>> significant delays in /solr/admin/cores , sometimes taking a few minutes to
>> return.
>>
>> Because of this, I think there is a strong connection to the core admin
>> being a problem here, the reason for this is we have 1 unique cluster where
>> its typically storing 30days worth of data within its collections, new day
>> comes along, we create a collection for that day, and any collections older
>> than 30 days get dropped. Documents within this cluster typically don't
>> change either, so there's never really any state change, and causes the
>> cluster to be significantly reliable for us, where our other main group of
>> clusters go through significant amount of change a day.
>>
>> I'm currently applying your patch into our build, and will deploy this
>> and keep you updated to see if this helps. At the moment, I'm looking if
>> there is a way to add a default to indexInfo=false to the core admin api,
>> that could help us here *(because using that makes the response time
>> insanely fast as per usual, however, does remove some statistics)*.
>>
>> With that though, its very experimental, and not sure if it's the best
>> approach, but you have to start somewhere right?
>>
>> I'd be keen to look into this issue with you, as it's been a problem for
>> us also.
>>
>> I'll reply again with any results I find from applying your patch.
>>
>> Cheers,
>>
>> On Wed, 20 Nov 2019 at 20:34, Alex Jablonski <aj...@thoughtworks.com>
>> wrote:
>>
>>> Pull request is here: https://github.com/apache/lucene-solr/pull/1022/
>>>
>>> Thanks!
>>> Alex Jablonski
>>>
>>> On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski <
>>> ajablonski@thoughtworks.com> wrote:
>>>
>>>> Hi there!
>>>>
>>>> My colleague and I have run into an issue that seems to appear when
>>>> running the Solr Prometheus exporter in SolrCloud mode against a large (>
>>>> 100 node) cluster. The symptoms we're observing are "connection pool shut
>>>> down" exceptions in the logs and the inability to collect metrics from more
>>>> than 100 nodes in the cluster.
>>>>
>>>> We think we've traced down the issue to
>>>> lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java
>>>> . In that class, hostClientCache exists as a cache of HttpSolrClients
>>>> (currently having fixed size 100) that, on evicting a client from the
>>>> cache, closes the client's connection. The hostClientCache is used in
>>>> createHttpSolrClients to return a map of base URLs to HttpSolrClients.
>>>>
>>>> Given, say, 300 base URLs, createHttpSolrClients will happily add those
>>>> base URLs to the cache, and the "get" method on the cache will happily
>>>> return the new additions to the cache. But on adding the 101st
>>>> HttpSolrClient to the cache, the first HttpSolrClient gets evicted and
>>>> closed. This repeats itself until the only open clients we have are to base
>>>> URLs 201 through 300; clients for the first 200 base URLs will be returned,
>>>> but will already have been closed. When we later use the result of
>>>> createHttpSolrClients to collect metrics, expecting valid and open
>>>> HttpSolrClients, we fail to connect when using any of those clients that
>>>> have already been closed, leading to the "Connection pool shut down"
>>>> exception and not collecting metrics from those nodes.
>>>>
>>>> Our idea for a fix was to change the existing cache to, instead of
>>>> having a fixed maximum size, use `expireAfterAccess` with a timeout that's
>>>> a multiple of the scrape interval (twice the scrape interval?). We wanted
>>>> to confirm a few things:
>>>>
>>>> 1. Has this issue been reported before, and if so, is there another fix
>>>> in progress already?
>>>> 2. Does this approach seem desirable?
>>>> 3. If so, are there any opinions on what the cache timeout should be
>>>> besides just double the scrape interval?
>>>>
>>>> We'll also open a PR shortly with the changes we're proposing and link
>>>> here. Please let me know if any of the above is unclear or incorrect.
>>>>
>>>> Thanks!
>>>> Alex Jablonski
>>>>
>>>>
>>
>> --
>>
>> Richard Goodman    |    Data Infrastructure engineer
>>
>> richardg@brandwatch.com
>>
>>
>> NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
>> PARIS   | SINGAPORE | SYDNEY
>>
>> <https://www.brandwatch.com/blog/digital-consumer-intelligence/>
>>
>

-- 

Richard Goodman    |    Data Infrastructure engineer

richardg@brandwatch.com


NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
PARIS   | SINGAPORE | SYDNEY

<https://www.brandwatch.com/blog/digital-consumer-intelligence/>

Re: Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Posted by Alex Jablonski <aj...@thoughtworks.com>.
Hey Richard,

I'd definitely love to hear whether this improves things for you. According
to Guava's documentation, the cache can start evicting items when it gets
close to the limit (
https://github.com/google/guava/wiki/CachesExplained#size-based-eviction),
not just when it reaches it, so if this does end up helping you out, that
could be the reason. I haven't dug into the implementation of "close to the
maximum" that Guava's cache uses, or whether that would happen in the
course of building up the maps to ping all of the nodes (which is where the
issue for us seemed to be), but it's at least a possible avenue to explore
further.

In any case, thanks for trying this out!

On Fri, Nov 22, 2019 at 10:16 AM Richard Goodman <ri...@brandwatch.com>
wrote:

> Hi Alex,
>
> This makes me really happy to see an email about this. I've been working
> on a little while about setting up the prometheus exporter for our
> clusters. Spent good amount of time setting up config, and started getting
> some really decent graphs in grafana on metrics we've never been able to
> collect before.
>
> For our stage environment, this worked like a charm, so shortly rolled it
> out to our live environment. This is when I started to get into trouble.
>
> I too was getting the exact problem you was facing, I then decided to
> split out all of my config so I had one config dedicated to JVM metric
> collection, one dedicated to Node level metrics etc., etc., I was still
> getting loads of errors coming through, which confused me.
>
> Our clusters are typically 96 nodes, so from your report, not sure how I
> would be getting this issue. One theory I had was the timeouts happening on
> the core admin API *(our indexes are range between 5gb-20gb in size each)*,
> and our clusters will typically be around 10s of TB in size. Because of
> this, when we have any replica state change, we notice significant delays
> in /solr/admin/cores , sometimes taking a few minutes to return.
>
> Because of this, I think there is a strong connection to the core admin
> being a problem here, the reason for this is we have 1 unique cluster where
> its typically storing 30days worth of data within its collections, new day
> comes along, we create a collection for that day, and any collections older
> than 30 days get dropped. Documents within this cluster typically don't
> change either, so there's never really any state change, and causes the
> cluster to be significantly reliable for us, where our other main group of
> clusters go through significant amount of change a day.
>
> I'm currently applying your patch into our build, and will deploy this and
> keep you updated to see if this helps. At the moment, I'm looking if there
> is a way to add a default to indexInfo=false to the core admin api, that
> could help us here *(because using that makes the response time insanely
> fast as per usual, however, does remove some statistics)*.
>
> With that though, its very experimental, and not sure if it's the best
> approach, but you have to start somewhere right?
>
> I'd be keen to look into this issue with you, as it's been a problem for
> us also.
>
> I'll reply again with any results I find from applying your patch.
>
> Cheers,
>
> On Wed, 20 Nov 2019 at 20:34, Alex Jablonski <aj...@thoughtworks.com>
> wrote:
>
>> Pull request is here: https://github.com/apache/lucene-solr/pull/1022/
>>
>> Thanks!
>> Alex Jablonski
>>
>> On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski <
>> ajablonski@thoughtworks.com> wrote:
>>
>>> Hi there!
>>>
>>> My colleague and I have run into an issue that seems to appear when
>>> running the Solr Prometheus exporter in SolrCloud mode against a large (>
>>> 100 node) cluster. The symptoms we're observing are "connection pool shut
>>> down" exceptions in the logs and the inability to collect metrics from more
>>> than 100 nodes in the cluster.
>>>
>>> We think we've traced down the issue to
>>> lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java
>>> . In that class, hostClientCache exists as a cache of HttpSolrClients
>>> (currently having fixed size 100) that, on evicting a client from the
>>> cache, closes the client's connection. The hostClientCache is used in
>>> createHttpSolrClients to return a map of base URLs to HttpSolrClients.
>>>
>>> Given, say, 300 base URLs, createHttpSolrClients will happily add those
>>> base URLs to the cache, and the "get" method on the cache will happily
>>> return the new additions to the cache. But on adding the 101st
>>> HttpSolrClient to the cache, the first HttpSolrClient gets evicted and
>>> closed. This repeats itself until the only open clients we have are to base
>>> URLs 201 through 300; clients for the first 200 base URLs will be returned,
>>> but will already have been closed. When we later use the result of
>>> createHttpSolrClients to collect metrics, expecting valid and open
>>> HttpSolrClients, we fail to connect when using any of those clients that
>>> have already been closed, leading to the "Connection pool shut down"
>>> exception and not collecting metrics from those nodes.
>>>
>>> Our idea for a fix was to change the existing cache to, instead of
>>> having a fixed maximum size, use `expireAfterAccess` with a timeout that's
>>> a multiple of the scrape interval (twice the scrape interval?). We wanted
>>> to confirm a few things:
>>>
>>> 1. Has this issue been reported before, and if so, is there another fix
>>> in progress already?
>>> 2. Does this approach seem desirable?
>>> 3. If so, are there any opinions on what the cache timeout should be
>>> besides just double the scrape interval?
>>>
>>> We'll also open a PR shortly with the changes we're proposing and link
>>> here. Please let me know if any of the above is unclear or incorrect.
>>>
>>> Thanks!
>>> Alex Jablonski
>>>
>>>
>
> --
>
> Richard Goodman    |    Data Infrastructure engineer
>
> richardg@brandwatch.com
>
>
> NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
> PARIS   | SINGAPORE | SYDNEY
>
> <https://www.brandwatch.com/blog/digital-consumer-intelligence/>
>

Re: Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Posted by Richard Goodman <ri...@brandwatch.com>.
Hi Alex,

This makes me really happy to see an email about this. I've been working on
a little while about setting up the prometheus exporter for our clusters.
Spent good amount of time setting up config, and started getting some
really decent graphs in grafana on metrics we've never been able to collect
before.

For our stage environment, this worked like a charm, so shortly rolled it
out to our live environment. This is when I started to get into trouble.

I too was getting the exact problem you was facing, I then decided to split
out all of my config so I had one config dedicated to JVM metric
collection, one dedicated to Node level metrics etc., etc., I was still
getting loads of errors coming through, which confused me.

Our clusters are typically 96 nodes, so from your report, not sure how I
would be getting this issue. One theory I had was the timeouts happening on
the core admin API *(our indexes are range between 5gb-20gb in size each)*,
and our clusters will typically be around 10s of TB in size. Because of
this, when we have any replica state change, we notice significant delays
in /solr/admin/cores , sometimes taking a few minutes to return.

Because of this, I think there is a strong connection to the core admin
being a problem here, the reason for this is we have 1 unique cluster where
its typically storing 30days worth of data within its collections, new day
comes along, we create a collection for that day, and any collections older
than 30 days get dropped. Documents within this cluster typically don't
change either, so there's never really any state change, and causes the
cluster to be significantly reliable for us, where our other main group of
clusters go through significant amount of change a day.

I'm currently applying your patch into our build, and will deploy this and
keep you updated to see if this helps. At the moment, I'm looking if there
is a way to add a default to indexInfo=false to the core admin api, that
could help us here *(because using that makes the response time insanely
fast as per usual, however, does remove some statistics)*.

With that though, its very experimental, and not sure if it's the best
approach, but you have to start somewhere right?

I'd be keen to look into this issue with you, as it's been a problem for us
also.

I'll reply again with any results I find from applying your patch.

Cheers,

On Wed, 20 Nov 2019 at 20:34, Alex Jablonski <aj...@thoughtworks.com>
wrote:

> Pull request is here: https://github.com/apache/lucene-solr/pull/1022/
>
> Thanks!
> Alex Jablonski
>
> On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski <
> ajablonski@thoughtworks.com> wrote:
>
>> Hi there!
>>
>> My colleague and I have run into an issue that seems to appear when
>> running the Solr Prometheus exporter in SolrCloud mode against a large (>
>> 100 node) cluster. The symptoms we're observing are "connection pool shut
>> down" exceptions in the logs and the inability to collect metrics from more
>> than 100 nodes in the cluster.
>>
>> We think we've traced down the issue to
>> lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java
>> . In that class, hostClientCache exists as a cache of HttpSolrClients
>> (currently having fixed size 100) that, on evicting a client from the
>> cache, closes the client's connection. The hostClientCache is used in
>> createHttpSolrClients to return a map of base URLs to HttpSolrClients.
>>
>> Given, say, 300 base URLs, createHttpSolrClients will happily add those
>> base URLs to the cache, and the "get" method on the cache will happily
>> return the new additions to the cache. But on adding the 101st
>> HttpSolrClient to the cache, the first HttpSolrClient gets evicted and
>> closed. This repeats itself until the only open clients we have are to base
>> URLs 201 through 300; clients for the first 200 base URLs will be returned,
>> but will already have been closed. When we later use the result of
>> createHttpSolrClients to collect metrics, expecting valid and open
>> HttpSolrClients, we fail to connect when using any of those clients that
>> have already been closed, leading to the "Connection pool shut down"
>> exception and not collecting metrics from those nodes.
>>
>> Our idea for a fix was to change the existing cache to, instead of having
>> a fixed maximum size, use `expireAfterAccess` with a timeout that's a
>> multiple of the scrape interval (twice the scrape interval?). We wanted to
>> confirm a few things:
>>
>> 1. Has this issue been reported before, and if so, is there another fix
>> in progress already?
>> 2. Does this approach seem desirable?
>> 3. If so, are there any opinions on what the cache timeout should be
>> besides just double the scrape interval?
>>
>> We'll also open a PR shortly with the changes we're proposing and link
>> here. Please let me know if any of the above is unclear or incorrect.
>>
>> Thanks!
>> Alex Jablonski
>>
>>

-- 

Richard Goodman    |    Data Infrastructure engineer

richardg@brandwatch.com


NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
PARIS   | SINGAPORE | SYDNEY

<https://www.brandwatch.com/blog/digital-consumer-intelligence/>

Re: Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Posted by Alex Jablonski <aj...@thoughtworks.com>.
Pull request is here: https://github.com/apache/lucene-solr/pull/1022/

Thanks!
Alex Jablonski

On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski <aj...@thoughtworks.com>
wrote:

> Hi there!
>
> My colleague and I have run into an issue that seems to appear when
> running the Solr Prometheus exporter in SolrCloud mode against a large (>
> 100 node) cluster. The symptoms we're observing are "connection pool shut
> down" exceptions in the logs and the inability to collect metrics from more
> than 100 nodes in the cluster.
>
> We think we've traced down the issue to
> lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java
> . In that class, hostClientCache exists as a cache of HttpSolrClients
> (currently having fixed size 100) that, on evicting a client from the
> cache, closes the client's connection. The hostClientCache is used in
> createHttpSolrClients to return a map of base URLs to HttpSolrClients.
>
> Given, say, 300 base URLs, createHttpSolrClients will happily add those
> base URLs to the cache, and the "get" method on the cache will happily
> return the new additions to the cache. But on adding the 101st
> HttpSolrClient to the cache, the first HttpSolrClient gets evicted and
> closed. This repeats itself until the only open clients we have are to base
> URLs 201 through 300; clients for the first 200 base URLs will be returned,
> but will already have been closed. When we later use the result of
> createHttpSolrClients to collect metrics, expecting valid and open
> HttpSolrClients, we fail to connect when using any of those clients that
> have already been closed, leading to the "Connection pool shut down"
> exception and not collecting metrics from those nodes.
>
> Our idea for a fix was to change the existing cache to, instead of having
> a fixed maximum size, use `expireAfterAccess` with a timeout that's a
> multiple of the scrape interval (twice the scrape interval?). We wanted to
> confirm a few things:
>
> 1. Has this issue been reported before, and if so, is there another fix in
> progress already?
> 2. Does this approach seem desirable?
> 3. If so, are there any opinions on what the cache timeout should be
> besides just double the scrape interval?
>
> We'll also open a PR shortly with the changes we're proposing and link
> here. Please let me know if any of the above is unclear or incorrect.
>
> Thanks!
> Alex Jablonski
>
>