You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Elizaveta Golova <EG...@uk.ibm.com> on 2020/10/01 13:11:51 UTC

HealthCheck not working when network problems between the client and a Solr node

Hello,
 
We are using Solr  8.5.2
 
We are having trouble with dealing with network errors between a Solr node and a client.
In our situation, our Solr Nodes and Zk hosts are healthy and can communication with each other, all our collections are up and healthy.
 
When we simulate a network problem between a client and a Solr Node (whilst maintaining the connections and healthy status of everything else), our Admin health check (HealthCheckRequest)fails with this type of network issue as we get a
"org.apache.solr.client.solrj.SolrServerException: IOException occurred when talking to server at: https://solr2:8984/solr "
with the root cause being a 
"java.net.SocketTimeoutException: connect timed out"
(seen in LBSolrClient).
 
In admin commands, it appears that the client's Zombie list is only updated and the operation only continues when the root cause is a ConnectException. 
We can confirm that a ConnectException (by changing it manually in the debugger) works as we would like. The operation succeeds. And subsequent calls to the client consider our blocked node as a Zombie.

A SocketTimeoutException type of exception does not update the client's Zombie list and continue with the operation, instead throwing an overall exception. 
And as the Zombie list is not updated, next time we try with the same client, we have the same problem as the node that has been blocked is still the first one that is returned in the live nodes list, and is the first that the request is sent to.
 
How can we work around this?
 
We have drilled down into the LBSolrClient to have a look.
 
Our main concern is that we believe that this will also be a problem for us with Updates.
 
 
An example scenario:
Solr1 on server Solr1
Solr2 on server Solr2
A collection with replication factor 2 with replicas for each shard being hosted on both Solr nodes.
An application server is on ApplicationServer1.
Another application server is on ApplicationServer2.
 
The Solr Nodes are up and the collection is healthy.
 
(Depending on the order of the live nodes)
If access is blocked to Solr2 from ApplicationServer1, update from ApplicationServer1 should succeed and a health check/ping from ApplicationServer1 should return "healthy".
Update from ApplicationServer2 should succeed and health check/ping from ApplicationServer2 should return "healthy".
 
If access is then unblocked to Solr2 from ApplicationServer1 but blocked to Solr1, then update from ApplicationServer1 fails and a health check/ping from ApplicationServer1 throws an exception.
Update from ApplicationServer2 should succeed and health check/ping from ApplicationServer2 should return "healthy".
 
Redacted stacktrace:

[err] org.apache.solr.client.solrj.SolrServerException: IOException occurred when talking to server at: https://solr2:8984/solr
[err]     at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:695)
[err]     at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:266)
[err]     at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
[err]     at org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:370)
[err]     at org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:298)
[err]     at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1157)
[err]     at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:918)
[err]     at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:850)
[err]     at <Redacted internal package that calls through to the SolrClient> (SolrClientProxy.java:136)
[err]     at <Redacted internal calls>
[err]     at <Redacted internal calls>
[err]     at <Redacted internal calls>
[err]     at <Redacted internal calls>
[err]     at <Redacted internal calls>
[err]     at <Redacted internal calls>
[err]     at <Redacted internal calls>
[err] Caused by: 
[err] org.apache.http.conn.ConnectTimeoutException: Connect to solr2:8984 [solr2/172.18.0.6] failed: connect timed out
[err]     at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
[err]     at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
[err]     at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
[err]     at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
[err]     at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
[err]     at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
[err]     at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
[err]     at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
[err]     at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
[err]     at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
[err]     at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:571)
[err]     ... 22 more
[err] Caused by: 
[err] java.net.SocketTimeoutException: connect timed out
[err]     at java.net.PlainSocketImpl.socketConnect(Native Method)
[err]     at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
[err]     at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
[err]     at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
[err]     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
[err]     at java.net.Socket.connect(Socket.java:607)
[err]     at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:368)
[err]     at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
[err]     ... 32 more
 
 
We were also wondering why admin requests that do not modify anything, e.g. a Ping or a HealthCheck, are nonRetryable? They should be idempotent too, shouldn't they?
 
Thanks!
LisaUnless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU