You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Scott Blum (JIRA)" <ji...@apache.org> on 2016/02/19 21:32:18 UTC

[jira] [Comment Edited] (SOLR-8697) Fix LeaderElector issues

    [ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15154817#comment-15154817 ] 

Scott Blum edited comment on SOLR-8697 at 2/19/16 8:31 PM:
-----------------------------------------------------------

I think part of the general problem with a lot of the ZK-interacting code is a lack of clean separation of concerns.  The relationships between LeaderElector and the various ElectionContext subclasses are pretty gnarly and incestuous.  DistributedQueue had a similar kind of design problem before I extracted the app specific gnarly parts into OverseerTaskQueue.

Have we considered trying to migrate to, say, Apache Curator (full disclosure: I'm a committer)?  There are a lot of advantages to using third party libs for some of these common patterns like distributed queue, leader election, or even observing changes in a tree.  Those components tend to be reusable, better documented, with cleaner APIs, and have a natural resistance to spaghetti invasion.  (Examples: OverseerNodePrioritizer and RebalanceLeaders are intricately tied to implementation details of LeaderElector.)

A clean, reusable leader election component (with its own tests) that could simply be used in a few different contexts seems like a good place to be longer term.

That said, I hope this patch can simply clean up some up the existing bugs without being too disruptive.



was (Author: dragonsinth):
I think part of the general problem with a lot of the ZK-interacting code is a lack of clean separation of concerns.  The relationships between LeaderElector and the various ElectionContext subclasses are pretty gnarly and incestuous.  DistributedQueue had a similar kind of design problem before I extracted the app specific gnarly parts into OverseerTaskQueue.

Have we considered trying to migrate to, say, Apache Curator (full disclosure: I'm a committer)?  There are a lot of advantages to using third party libs for some of this common patterns like distributed queue, leader election, or even observing changes in a tree.  Those components tend to be reusable, better documented, with cleaner APIs, and have a natural resistance to spaghetti invasion.  (Examples: OverseerNodePrioritizer and RebalanceLeaders are intricately tied to implementation details of LeaderElector.)

A clean, reusable leader election component (with its own tests) that could simply be used in a few different contexts seems like a good place to be longer term.

That said, I hope this patch can simply clean up some up the existing bugs without being too disruptive.


> Fix LeaderElector issues
> ------------------------
>
>                 Key: SOLR-8697
>                 URL: https://issues.apache.org/jira/browse/SOLR-8697
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.4.1
>            Reporter: Scott Blum
>              Labels: patch, reliability, solrcloud
>         Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect node has not yet disappeared, and the new process fails to set appropriate watches.
> NOTE: this is only reproducible if the new node is able to start up and join the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org