You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Erick Erickson (JIRA)" <ji...@apache.org> on 2014/12/17 04:48:13 UTC
[jira] [Updated] (SOLR-6691) REBALANCELEADERS needs to change the leader election queue.

     [ https://issues.apache.org/jira/browse/SOLR-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Erick Erickson updated SOLR-6691:
---------------------------------
    Attachment: BalanceLeaderTester.java
                SOLR-6691.patch

OK, I think this is finally working as I expect. The attached java file is a stand-alone program that stresses the heck out of shard leader election. The idea is that you fire it up against a collection and it
1> takes the initial state
2> tries to issue the preferred leader command to a randome replica on each shard.
3> issues the rebalanceleaders comand
4> verifies that all the shard leader election queues have one entry for all the nodes that were there originally.
5> verifies that the actual leader is the preferred leader
6> goes to <2>.

Note that the guts of this test are in the new unit test.

I had to change the leader election code to get all this predictable, and that makes me a little nervous given how difficult that all was to get working in the first place so this makes me a little nervous, but the external test code beats _all_ the leader election code up pretty fiercely which gives me hope.

So I have a couple of options here:
1> go ahead and check it in. 5.0 appears to be receding here so it has some time to bake before release
2> check it in to trunk and let it bake there for a while, perhaps until after 5.0 is cut, then merge and bake.

Opinions?

> REBALANCELEADERS needs to change the leader election queue.
> -----------------------------------------------------------
>
>                 Key: SOLR-6691
>                 URL: https://issues.apache.org/jira/browse/SOLR-6691
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>         Attachments: BalanceLeaderTester.java, SOLR-6691.patch
>
>
> The original code (SOLR-6517) assumed that changes in the clusterstate after issuing a command to the overseer to change the leader indicated that the leader was successfully changed. Fortunately, Noble clued me in that this isn't the case and that the potential leader needs to insert itself in the leader election queue before trigging the change leader command.
> Inserting themselves in the front of the queue should probably happen in BALANCESHARDUNIQUE when the preferredLeader property is assigned as well.
> [~noble.paul] Do evil things happen if a node joins at the head but it's _already_ in the queue? These ephemeral nodes in the queue are watching each other. So if node1 is the leader you have
> node1 <- node2 <- node3 <- node4
> where <- means "watches".
> Now, if node3 puts itself at the head of the list, you have
> {code}
> node1 <- node2
>       <- node3 <- node4
> {code}
> I _think_ when I was looking at this it all "just worked". 
> 1> node 1 goes down. Nodes 2 and 3 duke it out but there's code to insure that node3 becomes the leader and node2 inserts itself at then end so it's watching node 4.
> 2> node 2 goes down, nobody gets notified and it doesn't matter.
> 3> node 3 goes down, node 4 gets notified and starts watching node 2 by inserting itself at the end of the list.
> 4> node 4 goes down, nobody gets notified and it doesn't matter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org