You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org> on 2015/07/27 13:49:04 UTC

[jira] [Updated] (SOLR-7819) ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss

     [ https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shalin Shekhar Mangar updated SOLR-7819:
----------------------------------------
    Attachment: SOLR-7819.patch

Here's a patch which:
# Adds retryOnConnLoss in ZkController's ensureReplicaInLeaderInitiatedRecovery, updateLeaderInitiatedRecoveryState and markShardAsDownIfLeader method.
# Starts a LIR thread if leader cannot mark replica as down on connection loss. Earlier a session loss or connection loss both would skip starting the LIR thread.

I'm still running Solr's integration and jepsen tests.

This causes a subtle change in behavior which is best analyzed with two different scenarios:
# Leader fails to send an update to replica but also suffers a temporary blip in its ZK connection during the DistributedUpdateProcessor's doFinish method
## Currently, a few indexing threads will hang but eventually succeed in marking the 'replica' as down and the leader will start a new LIR thread to ask the replica to recover.
## With this patch, the indexing threads do not hang but a connection loss exception is thrown. At this point, we started a new LIR thread to ask the replica to recover. Although this removes the safety of explicitly marking the 'replica' as down, the LIR thread does provide us a timeout-based safety of making sure that the replica does recover from the leader.
# Leader fails to send an update to replica but also suffers a long network partition between itself and ZK server during DUP.doFinish method.
## Currently, a few indexing threads will hang in ZkController.ensureReplicaInLeaderInitiatedRecovery until the ZK operations time out because of connection loss or session loss and no LIR thread will be created. This seems okay because the current connection loss timeout value is higher than ZK session expiration time and session loss means that ZK has determined that our session has expired already. In both cases, a new leader election should have happened and there's no need to put the replica as 'down'.
## With this patch, the difference is that the indexing threads do not hang and the ensureReplicaInLeaderInitiatedRecovery returns immediately with a connection loss exception. A new LIR thread *is* started in this scenario. This is also fine because we were not able to mark the replica as 'down' and we aren't sure that the session has expired so it is important that we start the LIR thread to ask the replica to recover. Even if a new leader has been elected, there's no major harm done by asking the replica to recover.

So, net-net this patch doesn't seem to introduce any new problems in the system.

> ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss
> ------------------------------------------------------------------------------------
>
>                 Key: SOLR-7819
>                 URL: https://issues.apache.org/jira/browse/SOLR-7819
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.2, 5.2.1
>            Reporter: Shalin Shekhar Mangar
>              Labels: Jepsen
>             Fix For: 5.3, Trunk
>
>         Attachments: SOLR-7819.patch
>
>
> SOLR-7245 added a retryOnConnLoss parameter to ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads do not hang during a partition on ZK operations. However, some of those changes were unintentionally reverted by SOLR-7336 in 5.2.
> I found this while running Jepsen tests on 5.2.1 where a hung update managed to put a leader into a 'down' state (I'm still investigating and will open a separate issue about this problem).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org