You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Timothy Potter (JIRA)" <ji...@apache.org> on 2014/07/24 22:12:39 UTC
[jira] [Updated] (SOLR-6236) Need an optional fallback mechanism for selecting a leader when all replicas are in leader-initiated recovery.

     [ https://issues.apache.org/jira/browse/SOLR-6236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timothy Potter updated SOLR-6236:
---------------------------------

    Attachment: SOLR-6236.patch

Here's a rough sketch of progress so far on this issue ... a few key points about this problem and the patch:

At a high-level, the issue boils down to giving SolrCloud operators a way to either a) manually force a leader to be elected, or b) set an optional configuration property that triggers the force leader behavior after seeing so many failed recoveries due to no leader. So this can be considered an optional availablity-over-consistency mode with respect to leader-failover.

This patch solves the case where replicas in leader-initiated recovery keep failing to recover because there is no leader. This could occur if a replica is put into leader-initiated recovery but then the leader dies before the replica(s) recovers. Currently, the replica will never get out of recovery, i.e. it will end up in a loop of recovery -> recovery_failed. It will never become the leader and the shard will be offline. This is most likely to occur with collections using RF=2.

The patch leverages a special value for the leader-initiated recovery znode: "force_leader", which can be set manually (using ZK directly) or automatically once a configurable threshold of recovery failures is exceeded *forceLeaderFailedRecoveryThreshold*. 

Once a replica sees that is is in the "force_leader" state, it will try to force itself to become the leader. As Mark mentions above, this behavior is disabled by default.

The HttpPartitionTest has been updated to verify this scenario, see: testRf3WithForcedLeaderElection(). The idea there is to partition off replicas and then kill the leader before the partitioned replicas have recovered. Next, I simulate a sys admin manually setting the leader-initiated recovery znode for one of the replicas to "force_leader", at which point the replica forces itself to become the leader and then the other replica recovers against it. To be clear, you'll see that doc #2 is lost in this test and only 1 & 3 exist in the recovered replicas, which is why this behavior must be disabled by default.

When relying on *forceLeaderFailedRecoveryThreshold* > 0 to have the force leader process kick-in automatically, we have to pick one of the replicas in leader-initiated recovery. For this, I chose to pick the replica with the latest ctime on the leader-initiated znode, i.e. the one that was added to recovery last. I also check to make sure the node hosting that replica is "live". See: ElectionContext#checkForceLeader

That covers replicas in leader-initiated recovery but you can also end up without the ability to select a leader when replicas are not in leader-initiated recovery. For instance, imagine a scenario where a replica loses its connection to ZK and once the connection is re-established, that replica will put itself into the down state and then try to recover. If the leader is lost before the Zk connection problem on the replica is sorted out, then the replica will end up failing to recover. For now, I'm just adding the leader-initiated znode for that replica which then activates the process described above, or a sys admin could do that manually as well. That said, I still need to flush this scenario out in more detail.

So at this point, the patch gives us something to discuss (or at the very least poke fun at) to harden the leader election process when replicas need to recover and there is no leader to recover from.

> Need an optional fallback mechanism for selecting a leader when all replicas are in leader-initiated recovery.
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6236
>                 URL: https://issues.apache.org/jira/browse/SOLR-6236
>             Project: Solr
>          Issue Type: Improvement
>          Components: SolrCloud
>            Reporter: Timothy Potter
>            Assignee: Timothy Potter
>         Attachments: SOLR-6236.patch
>
>
> Offshoot from discussion in SOLR-6235, key points are:
> Tim: In ElectionContext, when running shouldIBeLeader, the node will choose to not be the leader if it is in LIR. However, this could lead to no leader. My thinking there is the state is bad enough that we would need manual intervention to clear one of the LIR znodes to allow a replica to get past this point. But maybe we can do better here?
> Shalin: Good question. With careful use of minRf, the user can retry operations and maintain consistency even if we arbitrarily elect a leader in this case. But most people won't use minRf and don't care about consistency as much as availability. For them there should be a way to get out of this mess easily. We can have a collection property (boolean + timeout value) to force elect a leader even if all shards were in LIR. What do you think?
> Mark: Indeed, it's a current limitation that you can have all nodes in a shard thinking they cannot be leader, even when all of them are available. This is not required by the distributed model we have at all, it's just a consequence of being over restrictive on the initial implementation - if all known replicas are participating, you should be able to get a leader. So I'm not sure if this case should be optional. But iff not all known replicas are participating and you still want to force a leader, that should be optional - I think it should default to false though. I think the system should default to reasonable data safety in these cases.
> How best to solve this, I'm not quite sure, but happy to look at a patch. How do you plan on monitoring and taking action? Via the Overseer? It seems tricky to do it from the replicas.
> Tim: We have a similar issue where a replica attempting to be the leader needs to wait a while to see other replicas before declaring itself the leader, see ElectionContext around line 200:
> int leaderVoteWait = cc.getZkController().getLeaderVoteWait();
> if (!weAreReplacement)
> { waitForReplicasToComeUp(weAreReplacement, leaderVoteWait); }
> So one quick idea might be to have the code that checks if it's in LIR see if all replicas are in LIR and if so, wait out the leaderVoteWait period and check again. If all are still in LIR, then move on with becoming the leader (in the spirit of availability).
> {quote}
> But iff not all known replicas are participating and you still want to force a leader, that should be optional - I think it should default to false though. I think the system should default to reasonable data safety in these cases.
> {quote}
> Shalin: That's the same case as the leaderVoteWait situation and we do go ahead after that amount of time even if all replicas aren't participating. Therefore, I think that we should handle it the same way. But to help people who care about consistency over availability, there should be a configurable property which bans this auto-promotion completely.
> In any case, we should switch to coreNodeName instead of coreName and open an issue to improve the leader election part.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org