You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ayon Sinha (JIRA)" <ji...@apache.org> on 2015/12/02 20:38:11 UTC

[jira] [Commented] (SOLR-8225) Leader should send update requests to replicas in recovery asynchronously

    [ https://issues.apache.org/jira/browse/SOLR-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036443#comment-15036443 ] 

Ayon Sinha commented on SOLR-8225:
----------------------------------

This problem is becoming a deal-breaker for any Solr cluster. The larger the cluster becomes, the higher is the likelihood of at least one replica being unhealthy/slow/recovering. Right now, as it stands, indexing comes to a grinding halt when one or more replicas are recovering. 
To begin this fix, we MUST at least add a setting where leader does not send the update to a recovering replica at all. It should get that update from wherever its recovering from.

[~yonik@apache.org] Can you please comment on the best way to handle this, and we can take this on and submit the patch?
This patch with https://issues.apache.org/jira/browse/SOLR-8227 needs to be considered together.

> Leader should send update requests to replicas in recovery asynchronously
> -------------------------------------------------------------------------
>
>                 Key: SOLR-8225
>                 URL: https://issues.apache.org/jira/browse/SOLR-8225
>             Project: Solr
>          Issue Type: Improvement
>          Components: SolrCloud
>            Reporter: Timothy Potter
>
> When a replica goes into recovery, the leader still sends docs to that replica while it is recovering. What I'm seeing is that the recovering node is still slow to respond to the leader (at least slower than the healthy replicas). Thus it would be good if the leader could send the updates to the recovering replica asynchronously, i.e. the leader will block as it does today when forwarding updates to healthy / active replicas, but send updates to recovering replicas async, thus preventing the whole update request from being slowed down by a potentially degraded.
> FWIW - I've actually seen this occur in an environment that has more than 3 replicas per shard. One of the replicas went into recovery and then was much slower to handle requests than the healthy replicas, but the leader had to wait for the slowest replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org