You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Mark Robert Miller (Jira)" <ji...@apache.org> on 2021/10/03 23:24:00 UTC
[jira] [Comment Edited] (SOLR-15672) Leader Election is flawed.

    [ https://issues.apache.org/jira/browse/SOLR-15672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423739#comment-17423739 ] 

Mark Robert Miller edited comment on SOLR-15672 at 10/3/21, 11:23 PM:
----------------------------------------------------------------------

h2. Suggestions
 * LeaderElector class is modified to contain all leader election code. Largely, this would mostly mean moving the ElectionContext into LeadElector. You would no longer create more than one LeaderElector for a given shard election. It would have start or join to enter an election and close to stop or cancel it. Call start or join again to go back into the election. You would no longer ever create new LeaderElector instances for a shard once you had one.
 * You could move the actual LeaderElector and runLeaderProcess code out of the Watcher event threads and simply have those event threads trigger the correct methods on the LeaderElector. The LeaderElector could have something like a single thread Executor that actually executes any logic that needs to occur in a different thread. If something was already in action, you could request it to stop and when it did, run the requested logic on that thread.
 * You could remove Watchers when appropriate. This is useful in many other instances beyond leader election. Often, when object is closed (or in this case cancelled), the Watcher involved with that object may still be active. close/cancel could call removeWatcher.
 * You could simply the various methods involved to be easier to understand. Most of these methods have very high common metrics for evaluating understandably and test-ability.
 * You could add better targeted testing using code coverage tools, mockito, awaitability. Failure and connection-loss/expiration behavior is not tested well by our suite of tests and even with effort on that, its difficult to verify election code is correct in all cases.
 * You could pull in Curator for the raw ZK election itself. It can be tricky to just use Curator in a single location as it adds another ZK client which has some ramifications and to a less important degree, costs. If its for an isolated enough situation though, that may be just fine. In a larger move, Curator could be brought in an incremental fashion by keeping our ZkSolrClient and backing it by a Curator instance that can also be obtained from that client for direct curator usage and recipes.


was (Author: markrmiller):
h2. Suggestions
 * LeaderElector class is modified to contain all leader election code. Largely, this would mostly mean moving the ElectionContext into LeadElector. You would no longer create more than one LeaderElector for a given shard election. It would have start or join to enter an election and close to stop or cancel it. Call start or join again to go back into the election.
 * You move the actual LeaderElector and runLeaderProcess code out of the Watcher event threads and simply have those event threads trigger the correct methods on the LeaderElector. The LeaderElector could have something like a single thread Executor that actually executes any logic that needs to occur in a different thread.
 * You could remove Watchers when appropriate. This is useful in many other instances beyond leader election. Often, when object is closed (or in this case cancelled), the Watcher involved with that object may still be active. close/cancel could call removeWatcher.
 * You could simply the various methods involved to be easier to understand. Most of these methods have very high common metrics for evaluating understandably and test-ability.
 * You could add better targeted testing using code coverage tools, mockito, awaitability. Failure and connection-loss/expiration behavior is not tested well by our suite of tests and even with effort on that, its difficult to verify election code is correct in all cases.
 * You could pull in Curator for the raw ZK election itself. It can be tricky to just use Curator in a single location as it adds another ZK client which has some ramifications and to a less important degree, costs. If its for an isolated enough situation though, that may be just fine. In a larger move, Curator could be brought in an incremental fashion by keeping our ZkSolrClient and backing it by a Curator instance that can also be obtained from that client for direct curator usage and recipes.

> Leader Election is flawed. 
> ---------------------------
>
>                 Key: SOLR-15672
>                 URL: https://issues.apache.org/jira/browse/SOLR-15672
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Mark Robert Miller
>            Priority: Major
>
> Filing this not as a work item I’m assigning my to myself, but to note a open issue where some notes can accumulate. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org