You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Yonik Seeley (JIRA)" <ji...@apache.org> on 2015/11/11 15:39:11 UTC

[jira] [Commented] (SOLR-8225) Leader should send update requests to replicas in recovery asynchronously

    [ https://issues.apache.org/jira/browse/SOLR-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000442#comment-15000442 ] 

Yonik Seeley commented on SOLR-8225:
------------------------------------

bq. What I'm seeing is that the recovering node is still slow to respond to the leader (at least slower than the healthy replicas).

Hmmm, that's interesting.  Any pointers as to why?  There's actually a lot less work (we just buffer in the tlog).
Perhaps it's the IO bandwidth being taken up by the index replication?

Sending async will introduce some complexities around the replica becoming active. Right now, the replica itself knows when it can become active... after it's finished replicating the index + replaying all buffered updates.  With an async-send, that would no longer be the case.

> Leader should send update requests to replicas in recovery asynchronously
> -------------------------------------------------------------------------
>
>                 Key: SOLR-8225
>                 URL: https://issues.apache.org/jira/browse/SOLR-8225
>             Project: Solr
>          Issue Type: Improvement
>          Components: SolrCloud
>            Reporter: Timothy Potter
>
> When a replica goes into recovery, the leader still sends docs to that replica while it is recovering. What I'm seeing is that the recovering node is still slow to respond to the leader (at least slower than the healthy replicas). Thus it would be good if the leader could send the updates to the recovering replica asynchronously, i.e. the leader will block as it does today when forwarding updates to healthy / active replicas, but send updates to recovering replicas async, thus preventing the whole update request from being slowed down by a potentially degraded.
> FWIW - I've actually seen this occur in an environment that has more than 3 replicas per shard. One of the replicas went into recovery and then was much slower to handle requests than the healthy replicas, but the leader had to wait for the slowest replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org