You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sling.apache.org by "Stefan Egli (JIRA)" <ji...@apache.org> on 2016/01/25 14:53:39 UTC

[jira] [Commented] (SLING-5435) Decouple processes that depend on cluster leader elections from the cluster leader elections.

    [ https://issues.apache.org/jira/browse/SLING-5435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115218#comment-15115218 ] 

Stefan Egli commented on SLING-5435:
------------------------------------

[~ianeboston], parts of this concern has already been worked upon. During discussions around discovery.oak and discovery.etcd it became clear that both of these had the problem of propagating the 'leader change' information faster than changes propagate in the repository. Which opened up the problem of getting notified about a leader change before the last changes of a perhaps crashed/shutdown instance have been seen by all other remaining instances (this is just one example, another is threading with the {{TopologyEvent}} itself).

This lead to the conclusion that such 'fast leader detection mechanisms' require additional synchronization with the repository. 

For the new discovery.oak this has been implemented as a separate (spi) interface called {{ClusterSyncService}} which can be enabled/disabled via configuration. So you can already run discovery.oak with a fast leader detector without synchronization - except that the application then has to deal with the missing synchronization one way or another.

Sounds like what might be missing is some kind of generic support for the case where this synchronization is disabled from the discovery mechanism. Perhaps what might be useful is to group the {{TopologyEventListeners}} into those that want synchronization and those that explicitly don't want it?

> Decouple processes that depend on cluster leader elections from the cluster leader elections.
> ---------------------------------------------------------------------------------------------
>
>                 Key: SLING-5435
>                 URL: https://issues.apache.org/jira/browse/SLING-5435
>             Project: Sling
>          Issue Type: Improvement
>          Components: General
>            Reporter: Ian Boston
>
> Currently there are many processes in Sling that must complete before a Sling Discovery cluster leader election is declared complete. These processes include things like transferring all Jobs from the old leader to the new leader and waiting for the data to appear visible on the new leader. This introduces an additional overhead to the leader election process which introduces a higher than desirable timeout for elections and heartbeat. This higher than desirable timeout precludes the use of more efficient election and distributed consensus algorithms as implemented in Etcd, Zookeeper or implementations of RAFT.
> If the election could be declared complete leaving individual components to manage their own post election operations (ie decoupling those processes from the election), then faster election or alternative Discovery implementations such as the one implemented on etcd could be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)