You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2019/07/03 22:42:00 UTC
[jira] [Updated] (SOLR-13532) Unable to start core recovery due to
timeout in ping request
[ https://issues.apache.org/jira/browse/SOLR-13532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hoss Man updated SOLR-13532:
----------------------------
Attachment: SOLR-13532.patch
Status: Open (was: Open)
bq. The other alternative to this would be to update the {{RecoveryStrategy}} code to use something like {{cc.getConfig().getUpdateShardHandlerConfig()}} ...
Here's a variant of Suril's patch along those lines, with some refactoring to put the logic into a helper method.
I don't love it -- but i don't hate it either.
I'm still running tests to make sure i didn't break anything, but in the meantime what do folks think? ... can anyone see any problems with this approach?
([~surilshah]: does this patch -- and the usage of the solr.xml configures values instead of hardcoded magic constants -- solvethe problems you're seeing?)
> Unable to start core recovery due to timeout in ping request
> ------------------------------------------------------------
>
> Key: SOLR-13532
> URL: https://issues.apache.org/jira/browse/SOLR-13532
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 7.6
> Reporter: Suril Shah
> Priority: Major
> Attachments: SOLR-13532.patch
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Discovered following issue with the core recovery:
> * Core recovery is not being initialized and throwing following exception message :
> {code:java}
> 2019-06-07 00:53:12.436 INFO (recoveryExecutor-4-thread-1-processing-n:<solr_ip>:8983_solr x:<collection_name>_shard41_replica_n2777 c:<collection_name> s:shard41 r:core_node2778) x:<collection_name>_shard41_replica_n2777 o.a.s.c.RecoveryStrategy Failed to connect leader http://<solr_ip>:8983/solr on recovery, try again{code}
> * Above error occurs when ping request takes time more than a timeout period which is hard-coded to one second in solr source code. However In a general production setting it is common to have ping time more than one second, hence, the core recovery never starts and exception is thrown.
> * Also the other major concern is that this exception is logged as an info message, hence it is very difficult to identify the error if info logging is not enabled.
> * Please refer to following code snippet from the [source code|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L789-L803] to understand the above issue.
> {code:java}
> try (HttpSolrClient httpSolrClient = new HttpSolrClient.Builder(leaderReplica.getCoreUrl())
> .withSocketTimeout(1000)
> .withConnectionTimeout(1000)
> .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient())
> .build()) {
> SolrPingResponse resp = httpSolrClient.ping();
> return leaderReplica;
> } catch (IOException e) {
> log.info("Failed to connect leader {} on recovery, try again", leaderReplica.getBaseUrl());
> Thread.sleep(500);
> } catch (Exception e) {
> if (e.getCause() instanceof IOException) {
> log.info("Failed to connect leader {} on recovery, try again", leaderReplica.getBaseUrl());
> Thread.sleep(500);
> } else {
> return leaderReplica;
> }
> }
> {code}
> The above issue will have high impact in production level clusters, since cores not being able to recover may lead to data loss.
> Following improvements would be really helpful:
> 1. The [timeout for ping request|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L790-L791] in *RecoveryStrategy.java* should be configurable and the defaults set to high values like 15seconds.
> 2. The exception message in [line 797|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L797] and [line 801|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L801] in *RecoveryStrategy.java* should be logged as *error* messages instead of *info* messages
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org