You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sling.apache.org by "Stefan Egli (JIRA)" <ji...@apache.org> on 2015/05/05 11:08:08 UTC
[jira] [Comment Edited] (SLING-4627) TOPOLOGY_CHANGED in an eventually consistent repository

    [ https://issues.apache.org/jira/browse/SLING-4627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14528138#comment-14528138 ] 

Stefan Egli edited comment on SLING-4627 at 5/5/15 9:08 AM:
------------------------------------------------------------

I've added [^SLING-4627.patch] with a suggestion of a {{o.a.s.discovery.commons.providers.spi.ConsistencyService}} and its integration into the {{ViewStateManager}} and would appreciate reviews/comments. Note that {{ConsistencyService}} is a plain interface at this stage and passed to {{ViewStateManager}} in the constructor – it can however still be implemented via a service and typically be injected by the {{DiscoveryService}} (which would typically instantiate the {{ViewStateManager}}).

Implementation of {{ConsistencyService}} is yet to follow – the intended algorithm:

 * rule #1: the algorithm worries about effects of eventual consistency (ie its delays) of the underlying, local repository vs a potentially quicker discovery detection – thus it applies only to the local cluster and does not worry about changes outside of it
 * rule #2: [every TopologyEventListener must stop doing leader/topology-dependent actions, esp towards the repository, as soon as it receives TOPOLOGY_CHANGING|https://issues.apache.org/jira/browse/SLING-3432?focusedCommentId=14492494&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14492494]
 * rule #3: the ViewStateManager makes use of rule #2: on any TOPOLOGY_CHANGING that affects the local cluster it writes a 'sync token' to a well-known location in a conflict-free way: the sync token signals anyone interested, that the corresponding instance *has sent* the corresponding topology event. When peers are able to 'read' this 'sync token' they can safely assume they have read all changes of that instance too (since the repository has to guarantee ordering of one instance's updates). The sync token has the following structure:
 ** path: {{/var/discovery/commons/syncTokens/<slingId>}}
 ** property: {{status}}
 ** value: {{changing=<localClusterSyncId>}}, {{changed=<localClusterSyncId>}}
 * rule #4: the {{localClusterSyncId}} is defined in {{BaseTopologyView.getLocalClusterSyncId()}} and is thus implementation specific – but it has to uniquely identify a particular incarnation of a local clusterview change (see javadoc for more details, in discovery.impl this could have been the voting id)
 * rule #5: whenever an instance leaves the local cluster (detected using SLING-4665) the {{ViewStateManager}} invokes this {{ConsistencyService}} to wait for all syncTokens to become visible (either changing or changed with current, expected {{localClusterSyncId}})
 * rule #6: there are three cases which require special handling – all can be dealt with by waiting a minimal amount of time – suggestion is to use a {{minEventDelay}}. The actual handling of this however is in {{ConsistencyService}}, so implementation specific. Here's the famous 3:
 ** the crashed instance's cannot have any write-backlog – it just crashed. Any remaining half-finished work will be rolled back (in the oak case). Yet, the last writes that it did towards the repository might not yet be read by the local instance – to account for this, we need a : {{minEventDelay}}
 ** if the instance is the only one left in the cluster, there's no sync token to wait for – hence it would immediately declare synched. But in this case too it needs to make sure the potential read-backlog is processed. For this, no surprise, we also use {{minEventDelay}}
 ** if the instance that left is still alive (eg in a SLING-3432-like pseudo-partitioning case that can happen with any discovery implementation), the instance' repository writes would not be synched in any way with the remaining instances (as it left the cluster view). Even though it would eventually receive a TOPOLOGY_CHANGING, and according to rule #2 stop writing critical stuff, in the meantime it could still do this. Thus here too, a minimal wait time for the partitioned-away instance to take note of a change is required - thus yet again {{minEventDelay}}

[~cziegeler], [~marett], wdyt?


was (Author: egli):
I've added [^SLING-4627.patch] with a suggestion of a {{o.a.s.discovery.commons.providers.spi.ConsistencyService}} and its integration into the {{ViewStateManager}} and would appreciate reviews/comments. Note that {{ConsistencyService}} is a plain interface at this stage and passed to {{ViewStateManager}} in the constructor – it can however still be implemented via a service and typically by injected by the {{DiscoveryService}} (which would typically instantiate the {{ViewStateManager}}).

Implementation of {{ConsistencyService}} is yet to follow – the intended algorithm:

 * rule #1: the algorithm worries about effects of eventual consistency (ie its delays) of the underlying, local repository vs a potentially quicker discovery detection – thus it applies only to the local cluster and does not worry about changes outside of it
 * rule #2: [every TopologyEventListener must stop doing leader/topology-dependent actions, esp towards the repository, as soon as it receives TOPOLOGY_CHANGING|https://issues.apache.org/jira/browse/SLING-3432?focusedCommentId=14492494&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14492494]
 * rule #3: the ViewStateManager makes use of rule #2: on any TOPOLOGY_CHANGING that affects the local cluster it writes a 'sync token' to a well-known location in a conflict-free way: the sync token signals anyone interested, that the corresponding instance *has sent* the corresponding topology event. When peers are able to 'read' this 'sync token' they can safely assume they have read all changes of that instance too (since the repository has to guarantee ordering of one instance's updates). The sync token has the following structure:
 ** path: {{/var/discovery/commons/syncTokens/<slingId>}}
 ** property: {{status}}
 ** value: {{changing=<localClusterSyncId>}}, {{changed=<localClusterSyncId>}}
 * rule #4: the {{localClusterSyncId}} is defined in {{BaseTopologyView.getLocalClusterSyncId()}} and is thus implementation specific – but it has to uniquely identify a particular incarnation of a local clusterview change (see javadoc for more details, in discovery.impl this could have been the voting id)
 * rule #5: whenever an instance leaves the local cluster (detected using SLING-4665) the {{ViewStateManager}} invokes this {{ConsistencyService}} to wait for all syncTokens to become visible (either changing or changed with current, expected {{localClusterSyncId}})
 * rule #6: there are three cases which require special handling – all can be dealt with by waiting a minimal amount of time – suggestion is to use a {{minEventDelay}}. The actual handling of this however is in {{ConsistencyService}}, so implementation specific. Here's the famous 3:
 ** the crashed instance's cannot have any write-backlog – it just crashed. Any remaining half-finished work will be rolled back (in the oak case). Yet, the last writes that it did towards the repository might not yet be read by the local instance – to account for this, we need a : {{minEventDelay}}
 ** if the instance is the only one left in the cluster, there's no sync token to wait for – hence it would immediately declare synched. But in this case too it needs to make sure the potential read-backlog is processed. For this, no surprise, we also use {{minEventDelay}}
 ** if the instance that left is still alive (eg in a SLING-3432-like pseudo-partitioning case that can happen with any discovery implementation), the instance' repository writes would not be synched in any way with the remaining instances (as it left the cluster view). Even though it would eventually receive a TOPOLOGY_CHANGING, and according to rule #2 stop writing critical stuff, in the meantime it could still do this. Thus here too, a minimal wait time for the partitioned-away instance to take note of a change is required - thus yet again {{minEventDelay}}

[~cziegeler], [~marett], wdyt?

> TOPOLOGY_CHANGED in an eventually consistent repository
> -------------------------------------------------------
>
>                 Key: SLING-4627
>                 URL: https://issues.apache.org/jira/browse/SLING-4627
>             Project: Sling
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>            Priority: Critical
>         Attachments: SLING-4627.patch, SLING-4627.patch
>
>
> This is a parent ticket describing the +coordination effort needed between properly sending TOPOLOGY_CHANGED when running ontop of an eventually consistent repository+. These findings are independent of the implementation details used inside the discovery implementation, so apply to discovery.impl, discovery.etcd/.zookeeper/.oak etc. Tickets to implement this for specific implementation are best created separately (eg sub-task or related..). Also note that this assumes immediately sending TOPOLOGY_CHANGING as described [in SLING-3432|https://issues.apache.org/jira/browse/SLING-3432?focusedCommentId=14492494&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14492494]
> h5. The spectrum of possible TOPOLOGY_CHANGED events include the following scenarios:
> || scenario || classification || action ||
> | A. change is completely outside of local cluster | (/) uncritical | changes outside the cluster are considered uncritical for this exercise. |
> | B. a new instance joins the local cluster, this new instance is by contract not the leader (leader must be stable \[0\]) | (/) uncritical | a join of an instance is uncritical due to the fact that it merely joins the cluster and has thus no 'backlog' of changes that might be propagating through the (eventually consistent) repository. |
> | C. a non-leader *leaves* the local cluster | (x) *critical* | changes that were written by the leaving instance might still not be *seen* by all surviving (ie it can be that discovery is faster than the repository) and this must be assured before sending out TOPOLOGY_CHANGED. This is because the leaving instance could have written changes that are *topology dependent* and thus those changes must first be settled in the repository before continuing with a *new topology*. |
> | D. the leader *leaves* the local cluster (and thus a new leader is elected) | (x)(x) *very critical* | same as C except that this is more critical due to the fact that the leader left |
> | E. -the leader of the local cluster changes (without leaving)- this is not supported by contract (leader must be stable \[0\]) | (/) -irrelevant- | |
> So both C and D are about an instance leaving. And as mentioned above the survivors must assure they have read all changes of the leavers. There are two parts to this:
> * the leaver could have pending writes that are not yet in mongoD: I don't think this is the case. The only thing that can remain could be an uncommitted branch and that would be rolled back afaik.
> ** Exception to this is a partition: where the leaver didn't actually crash but is still hooked to the repository. *For this I'm not sure how it can be solved* yet.
> * the survivers could however not yet have read all changes (pending in the background read) and one way to make sure they did is to have each surviving instance write a (pseudo-) sync token to the repository. Once all survivors have seen this sync token of all other survivors, the assumption is that all pending changes are "flushed" through the eventually consistent repository and that it is safe to send out a TOPOLOGY_CHANGED event. 
> * this sync token must be *conflict free* and could be eg: {{/var/discovery/oak/clusterInstances/<slingId>/syncTokens/<newViewId>}} - where {{newViewId}} is defined by whatever discovery mechanism is used
> * a special case is when only one instance is remaining. It can then not wait for any other survivor to send a sync token. In that case sync tokens would not work. All it could then possibly do is to wait for a certain time (which should be larger than any expected background-read duration)
> [~mreutegg], [~chetanm] can you pls confirm/comment on the above "flush/sync token" approach? Thx!
> /cc [~marett]
> \[0\] - see [getLeader() in ClusterView|https://github.com/apache/sling/blob/trunk/bundles/extensions/discovery/api/src/main/java/org/apache/sling/discovery/ClusterView.java]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)