You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Sam Tunnicliffe <sa...@beobal.com> on 2022/08/22 12:45:06 UTC

[DISCUSS] CEP-21: Transactional Cluster Metadata

Hi,

I'd like to open discussion about this CEP: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21:+Transactional+Cluster+Metadata>
Cluster metadata in Cassandra comprises a number of disparate elements including, but not limited to, distributed schema, topology and token ownership. Following the general design principles of Cassandra, the mechanisms for coordinating updates to cluster state have favoured eventual consistency, with probabilisitic delivery via gossip being a prime example. Undoubtedly, this approach has benefits, not least in terms of resilience, particularly in highly fluid distributed environments. However, this is not the reality of most Cassandra deployments, where the total number of nodes is relatively small (i.e. in the low thousands) and the rate of change tends to be low.

Historically, a significant proportion of issues affecting operators and users of Cassandra have been due, at least in part, to a lack of strongly consistent cluster metadata. In response to this, we propose a design which aims to provide linearizability of metadata changes whilst ensuring that the effects of those changes are made visible to all nodes in a strongly consistent manner. At its core, it is also pluggable, enabling Cassandra-derived projects to supply their own implementations if desired.

In addition to the CEP document itself, we aim to publish a working prototype of the proposed design. Obviously, this does not implement the entire proposal and there are several parts which remain only partially complete. It does include the core of the system, including a good deal of test infrastructure, so may serve as both illustration of the design and a starting point for real implementation.

Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

Posted by Derek Chen-Becker <de...@chen-becker.org>.

On Tue, Aug 23, 2022 at 2:51 AM Sam Tunnicliffe <sa...@beobal.com> wrote:

>
> Regular read/write operations should not be halted, even by a total
> failure of the metadata service. There should be no situations where the a
> previously stable database becomes entirely unavailable due to a CMS
> failure. The worst case is where there is some unavailability due to
> permanent failure of multiple nodes where those nodes happen to represent a
> majority of the CMS. In this scenario, the CMS would need to be recovered
> before the down nodes could be replaced, so it's possible it would extend
> the period of unavailabilty, though not necessarily by much.
>

This seems like a reasonable tradeoff. The current approach tries to
achieve better availability at the risk of loss of consistency, but even
then has failure modes that require manual intervention. What I really like
about the proposal is that it's a path toward separating responsibilities
of the cluster into well-defined sub-services.

Cheers,

Derek




>
>
> On 23 Aug 2022, at 05:42, Jeff Jirsa <jj...@gmail.com> wrote:
>
> “ The proposed mechanism for dealing with both of these failure types is
> to enable a manual operator override mode. This would allow operators to
> inject metadata changes (potentially overriding the complete metadata
> state) directly on any and all nodes in a cluster. At the most extreme end
> of the spectrum, this could allow an unrecoverably corrupt state to be
> rectified by composing a custom snapshot of cluster metadata and uploading
> it to all nodes in the cluster”
>
> What do you expect this to look like in practice? JSON representation of
> the ring? Would reads and writes have halted? In what situations would the
> database be entirely unavailable?
>
>
>
> On Aug 22, 2022, at 11:15 AM, Derek Chen-Becker <de...@chen-becker.org>
> wrote:
>
> 
> This looks really interesting; thanks for putting this together! Just so
> I'm clear on CEP nomenclature, having external management of metadata as a
> non-goal doesn't preclude some future use, correct? Coincidentally, I'm
> working on my ApacheCon talk on improving modularity in Cassandra and one
> of the ideas I'm discussing is pluggably (?) replacing gossip with
> something(s) that allow us to externalize some of the complexity of
> maintaining consistency. I need to digest the proposal you've made, but I
> don't see the two ideas being at odds on my first read.
>
> Cheers,
>
> Derek
>
> On Mon, Aug 22, 2022 at 6:45 AM Sam Tunnicliffe <sa...@beobal.com> wrote:
>
>> Hi,
>>
>> I'd like to open discussion about this CEP:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata
>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21:+Transactional+Cluster+Metadata>
>>
>> Cluster metadata in Cassandra comprises a number of disparate elements
>> including, but not limited to, distributed schema, topology and token
>> ownership. Following the general design principles of Cassandra, the
>> mechanisms for coordinating updates to cluster state have favoured eventual
>> consistency, with probabilisitic delivery via gossip being a prime example.
>> Undoubtedly, this approach has benefits, not least in terms of resilience,
>> particularly in highly fluid distributed environments. However, this is not
>> the reality of most Cassandra deployments, where the total number of nodes
>> is relatively small (i.e. in the low thousands) and the rate of change
>> tends to be low.
>>
>> Historically, a significant proportion of issues affecting operators and
>> users of Cassandra have been due, at least in part, to a lack of strongly
>> consistent cluster metadata. In response to this, we propose a design which
>> aims to provide linearizability of metadata changes whilst ensuring that
>> the effects of those changes are made visible to all nodes in a strongly
>> consistent manner. At its core, it is also pluggable, enabling
>> Cassandra-derived projects to supply their own implementations if desired.
>> In addition to the CEP document itself, we aim to publish a working
>> prototype of the proposed design. Obviously, this does not implement the
>> entire proposal and there are several parts which remain only partially
>> complete. It does include the core of the system, including a good deal of
>> test infrastructure, so may serve as both illustration of the design and a
>> starting point for real implementation.
>>
>>
>
> --
> +---------------------------------------------------------------+
> | Derek Chen-Becker                                             |
> | GPG Key available at https://keybase.io/dchenbecker and       |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---------------------------------------------------------------+
>
>
>

-- 
+---------------------------------------------------------------+
| Derek Chen-Becker                                             |
| GPG Key available at https://keybase.io/dchenbecker and       |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---------------------------------------------------------------+

Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

Posted by Sam Tunnicliffe <sa...@beobal.com>.

> What do you expect this to look like in practice? JSON representation of the ring? Would reads and writes have halted? In what situations would the database be entirely unavailable? 

The format is pretty much TDB I'm afraid. A JSON representation, or one in an equivalent textual format, is certainly feasible though. The metadata state will include more than just the ring: schema, contact points/membership of the CMS itself, state of pending operations are all part of it. Tools analogous to the old sstable2json/json2sstable could be a minimal solution, but I'm sure we can do better.

Regular read/write operations should not be halted, even by a total failure of the metadata service. There should be no situations where the a previously stable database becomes entirely unavailable due to a CMS failure. The worst case is where there is some unavailability due to permanent failure of multiple nodes where those nodes happen to represent a majority of the CMS. In this scenario, the CMS would need to be recovered before the down nodes could be replaced, so it's possible it would extend the period of unavailabilty, though not necessarily by much.


> On 23 Aug 2022, at 05:42, Jeff Jirsa <jj...@gmail.com> wrote:
> 
> “ The proposed mechanism for dealing with both of these failure types is to enable a manual operator override mode. This would allow operators to inject metadata changes (potentially overriding the complete metadata state) directly on any and all nodes in a cluster. At the most extreme end of the spectrum, this could allow an unrecoverably corrupt state to be rectified by composing a custom snapshot of cluster metadata and uploading it to all nodes in the cluster”
> 
> What do you expect this to look like in practice? JSON representation of the ring? Would reads and writes have halted? In what situations would the database be entirely unavailable? 
> 
> 
> 
>> On Aug 22, 2022, at 11:15 AM, Derek Chen-Becker <de...@chen-becker.org> wrote:
>> 
>> 
>> This looks really interesting; thanks for putting this together! Just so I'm clear on CEP nomenclature, having external management of metadata as a non-goal doesn't preclude some future use, correct? Coincidentally, I'm working on my ApacheCon talk on improving modularity in Cassandra and one of the ideas I'm discussing is pluggably (?) replacing gossip with something(s) that allow us to externalize some of the complexity of maintaining consistency. I need to digest the proposal you've made, but I don't see the two ideas being at odds on my first read. 
>> 
>> Cheers,
>> 
>> Derek
>> 
>> On Mon, Aug 22, 2022 at 6:45 AM Sam Tunnicliffe <sam@beobal.com <ma...@beobal.com>> wrote:
>> Hi,
>> 
>> I'd like to open discussion about this CEP: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21:+Transactional+Cluster+Metadata>	
>> Cluster metadata in Cassandra comprises a number of disparate elements including, but not limited to, distributed schema, topology and token ownership. Following the general design principles of Cassandra, the mechanisms for coordinating updates to cluster state have favoured eventual consistency, with probabilisitic delivery via gossip being a prime example. Undoubtedly, this approach has benefits, not least in terms of resilience, particularly in highly fluid distributed environments. However, this is not the reality of most Cassandra deployments, where the total number of nodes is relatively small (i.e. in the low thousands) and the rate of change tends to be low.  
>> 
>> Historically, a significant proportion of issues affecting operators and users of Cassandra have been due, at least in part, to a lack of strongly consistent cluster metadata. In response to this, we propose a design which aims to provide linearizability of metadata changes whilst ensuring that the effects of those changes are made visible to all nodes in a strongly consistent manner. At its core, it is also pluggable, enabling Cassandra-derived projects to supply their own implementations if desired.
>> 
>> In addition to the CEP document itself, we aim to publish a working prototype of the proposed design. Obviously, this does not implement the entire proposal and there are several parts which remain only partially complete. It does include the core of the system, including a good deal of test infrastructure, so may serve as both illustration of the design and a starting point for real implementation. 
>> 
>> 
>> 
>> -- 
>> +---------------------------------------------------------------+
>> | Derek Chen-Becker                                             |
>> | GPG Key available at https://keybase.io/dchenbecker <https://keybase.io/dchenbecker> and       |
>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org <https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org> |
>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>> +---------------------------------------------------------------+
>>

Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

Posted by Jeff Jirsa <jj...@gmail.com>.

“ The proposed mechanism for dealing with both of these failure types is to enable a manual operator override mode. This would allow operators to inject metadata changes (potentially overriding the complete metadata state) directly on any and all nodes in a cluster. At the most extreme end of the spectrum, this could allow an unrecoverably corrupt state to be rectified by composing a custom snapshot of cluster metadata and uploading it to all nodes in the cluster”

What do you expect this to look like in practice? JSON representation of the ring? Would reads and writes have halted? In what situations would the database be entirely unavailable? 



> On Aug 22, 2022, at 11:15 AM, Derek Chen-Becker <de...@chen-becker.org> wrote:
> 
> 
> This looks really interesting; thanks for putting this together! Just so I'm clear on CEP nomenclature, having external management of metadata as a non-goal doesn't preclude some future use, correct? Coincidentally, I'm working on my ApacheCon talk on improving modularity in Cassandra and one of the ideas I'm discussing is pluggably (?) replacing gossip with something(s) that allow us to externalize some of the complexity of maintaining consistency. I need to digest the proposal you've made, but I don't see the two ideas being at odds on my first read. 
> 
> Cheers,
> 
> Derek
> 
>> On Mon, Aug 22, 2022 at 6:45 AM Sam Tunnicliffe <sa...@beobal.com> wrote:
>> Hi,
>> 
>> I'd like to open discussion about this CEP: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata	
>> Cluster metadata in Cassandra comprises a number of disparate elements including, but not limited to, distributed schema, topology and token ownership. Following the general design principles of Cassandra, the mechanisms for coordinating updates to cluster state have favoured eventual consistency, with probabilisitic delivery via gossip being a prime example. Undoubtedly, this approach has benefits, not least in terms of resilience, particularly in highly fluid distributed environments. However, this is not the reality of most Cassandra deployments, where the total number of nodes is relatively small (i.e. in the low thousands) and the rate of change tends to be low.  
>> 
>> Historically, a significant proportion of issues affecting operators and users of Cassandra have been due, at least in part, to a lack of strongly consistent cluster metadata. In response to this, we propose a design which aims to provide linearizability of metadata changes whilst ensuring that the effects of those changes are made visible to all nodes in a strongly consistent manner. At its core, it is also pluggable, enabling Cassandra-derived projects to supply their own implementations if desired.
>> 
>> In addition to the CEP document itself, we aim to publish a working prototype of the proposed design. Obviously, this does not implement the entire proposal and there are several parts which remain only partially complete. It does include the core of the system, including a good deal of test infrastructure, so may serve as both illustration of the design and a starting point for real implementation. 
>> 
> 
> 
> -- 
> +---------------------------------------------------------------+
> | Derek Chen-Becker                                             |
> | GPG Key available at https://keybase.io/dchenbecker and       |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---------------------------------------------------------------+
>

Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

Posted by Sam Tunnicliffe <sa...@beobal.com>.

Thanks! 
The core of the proposal is around the sequencing metadata changes and ensuring that they're delivered to/processed by nodes in the right order and at the right time. The actual mechanisms for imposing that order and for maintaining the log are pretty simple to implement. We envision using the existing Paxos machinery by default, but swapping that for an alternative implemention would not be difficult.


> On 22 Aug 2022, at 19:14, Derek Chen-Becker <de...@chen-becker.org> wrote:
> 
> This looks really interesting; thanks for putting this together! Just so I'm clear on CEP nomenclature, having external management of metadata as a non-goal doesn't preclude some future use, correct? Coincidentally, I'm working on my ApacheCon talk on improving modularity in Cassandra and one of the ideas I'm discussing is pluggably (?) replacing gossip with something(s) that allow us to externalize some of the complexity of maintaining consistency. I need to digest the proposal you've made, but I don't see the two ideas being at odds on my first read. 
> 
> Cheers,
> 
> Derek
> 
> On Mon, Aug 22, 2022 at 6:45 AM Sam Tunnicliffe <sam@beobal.com <ma...@beobal.com>> wrote:
> Hi,
> 
> I'd like to open discussion about this CEP: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21:+Transactional+Cluster+Metadata>	
> Cluster metadata in Cassandra comprises a number of disparate elements including, but not limited to, distributed schema, topology and token ownership. Following the general design principles of Cassandra, the mechanisms for coordinating updates to cluster state have favoured eventual consistency, with probabilisitic delivery via gossip being a prime example. Undoubtedly, this approach has benefits, not least in terms of resilience, particularly in highly fluid distributed environments. However, this is not the reality of most Cassandra deployments, where the total number of nodes is relatively small (i.e. in the low thousands) and the rate of change tends to be low.  
> 
> Historically, a significant proportion of issues affecting operators and users of Cassandra have been due, at least in part, to a lack of strongly consistent cluster metadata. In response to this, we propose a design which aims to provide linearizability of metadata changes whilst ensuring that the effects of those changes are made visible to all nodes in a strongly consistent manner. At its core, it is also pluggable, enabling Cassandra-derived projects to supply their own implementations if desired.
> 
> In addition to the CEP document itself, we aim to publish a working prototype of the proposed design. Obviously, this does not implement the entire proposal and there are several parts which remain only partially complete. It does include the core of the system, including a good deal of test infrastructure, so may serve as both illustration of the design and a starting point for real implementation. 
> 
> 
> 
> -- 
> +---------------------------------------------------------------+
> | Derek Chen-Becker                                             |
> | GPG Key available at https://keybase.io/dchenbecker <https://keybase.io/dchenbecker> and       |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org <https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org> |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---------------------------------------------------------------+
>

Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

Posted by Derek Chen-Becker <de...@chen-becker.org>.

This looks really interesting; thanks for putting this together! Just so
I'm clear on CEP nomenclature, having external management of metadata as a
non-goal doesn't preclude some future use, correct? Coincidentally, I'm
working on my ApacheCon talk on improving modularity in Cassandra and one
of the ideas I'm discussing is pluggably (?) replacing gossip with
something(s) that allow us to externalize some of the complexity of
maintaining consistency. I need to digest the proposal you've made, but I
don't see the two ideas being at odds on my first read.

Cheers,

Derek

On Mon, Aug 22, 2022 at 6:45 AM Sam Tunnicliffe <sa...@beobal.com> wrote:

> Hi,
>
> I'd like to open discussion about this CEP:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata
> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21:+Transactional+Cluster+Metadata>
>
> Cluster metadata in Cassandra comprises a number of disparate elements
> including, but not limited to, distributed schema, topology and token
> ownership. Following the general design principles of Cassandra, the
> mechanisms for coordinating updates to cluster state have favoured eventual
> consistency, with probabilisitic delivery via gossip being a prime example.
> Undoubtedly, this approach has benefits, not least in terms of resilience,
> particularly in highly fluid distributed environments. However, this is not
> the reality of most Cassandra deployments, where the total number of nodes
> is relatively small (i.e. in the low thousands) and the rate of change
> tends to be low.
>
> Historically, a significant proportion of issues affecting operators and
> users of Cassandra have been due, at least in part, to a lack of strongly
> consistent cluster metadata. In response to this, we propose a design which
> aims to provide linearizability of metadata changes whilst ensuring that
> the effects of those changes are made visible to all nodes in a strongly
> consistent manner. At its core, it is also pluggable, enabling
> Cassandra-derived projects to supply their own implementations if desired.
> In addition to the CEP document itself, we aim to publish a working
> prototype of the proposed design. Obviously, this does not implement the
> entire proposal and there are several parts which remain only partially
> complete. It does include the core of the system, including a good deal of
> test infrastructure, so may serve as both illustration of the design and a
> starting point for real implementation.
>
>

-- 
+---------------------------------------------------------------+
| Derek Chen-Becker                                             |
| GPG Key available at https://keybase.io/dchenbecker and       |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---------------------------------------------------------------+

Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

Posted by Sam Tunnicliffe <sa...@beobal.com>.

Hi Jacek, 

Thanks for the great questions, they certainly all relate to things we considered, so I hope I can answer them in a coherent way!

>  will it be a replica group explicitly associated with each event

Some (but not all) events are explicitly associated with a replica group (or more accurately with the superset of present and future replica groups). Where this is the case, the acknowlegment of these events by a majority of the involved replicas is required for the "larger" process of which the event is part, to make progress. This is not a global watermark though, which would halt progress across the cluster, it only affects the specific multistep operation it is part of. 

In such an operation (say a bootstrap or decommission), only one actor is actually in control of moving forward through the process, all the other nodes in the cluster simply apply the relevant metadata updates locally. It is the progress of this primary actor which is gated on the acknowledgments. In the bootstrap case, the joining node itself drives the process. 

The joining node will submit the first event to the CMS, which hopefully is accepted (because it would violate no invariants on the cluster metadata) and becomes committed. That joining node will then await notification from the CMS that a majority of the relevant peers have acked the event. Until it receives that, it will not submit the event representing the next step in the operation. By the same mechanism, it will not perform other aspects of its bootstrap until the preceding metadata change is acked (i.e. it won't initiate streaming until the step which adds it to the write groups - making it a pending node - is acked).

Other metadata changes can certainly be occurring while this is going on. Another joining node may start submitting similar events, and as long as the operation is permitted, that process will progress concurrently. In order to ensure that these multistep operations are safe to execute concurrently, we reject submissions which affect ranges already being affected by an in-flight operation. Essentially, you can safely run concurrent bootstraps provided the nodes involved do not share replicated ranges.

> For multistep actions - are they going to be added all or none? If they are added one by one, can they be interleaved with other multistep actions?

As you can see from the above, the steps are committed to the log one at a time and multistep operations can be interleaved. 

However, at the point of executing the first step, the plan for executing the rest of the steps is known. We persist this in-progress operation plan in the cluster metadata as an effect of executing the first step - note that this is different from actually committing the steps to the log itself, the pending steps do not yet have an order assigned (and may never get one). This persistence of in-progress operations is to enable an operation to be resumed if the node driving it were to restart part way through. 

> What if a node(s) failure prevents progress over the log? For example, we are unable to get a majority of nodes which process an event so we cannot move forward. We cannot remove those nodes though, because the removal will be later in the log and we cannot make progress.

It's the specific operation that's unable to make progress, but other metadata updates can proceed. To make this concrete: you're trying to join a new node to the cluster, but are unable to because some affected replicas are down and so cannot acknowledge one of the steps. If the replicas are temporarily down, bringing them back up would be sufficient to resume the join. If they are permanently unavailable, in order to preserve consistency, you need to cancel the ongoing join, replace them and restart the join from scratch.    

Cancelling an in-progress operation like a join is a matter of reverting the metadata changes made by any of the steps which have already been committed, including the persistence of the aforementioned pending steps. In the proposal, we've suggested an operator should be involved in this, but that would be something trivial like running a nodetool command to submit the necessary event to the log. It may be possible to automate that, but I would prefer to omit it initially, not least to keep the scope manageable. Of course, nothing would preclude an external monitoring system to run the nodetool command if it's trusted to accurately detect such failures. 

> This sounds like an implementation of everywhere replication strategy, doesn't it?

It does sound similar, but fear not, it isn't quite the same. The "everywhere" here is limited to the CMS nodes, which are only a small subset of the cluster. Essentially, it just means that all the (current) CMS members own and replicate the entire event log and so when a node joins the CMS it bootstraps the entirety of the current log state (note: this needn't be the full log history, just a snapshot plus subsequent entries). 

> On 6 Sep 2022, at 15:24, Jacek Lewandowski <le...@gmail.com> wrote:
> 
> Hi Sam, this is a great idea and a really well described CEP!
> 
> I have some questions, perhaps they reflect my weak understanding, but maybe you can answer:
> Is it going to work so that each node reads the log individually and try to catch up in a way that it applies a transition locally once the previous change is confirmed on the majority of the affected nodes, right? If so, will it be a replica group explicitly associated with each event (explicitly mentioned nodes which are affected by the change and a list of those which already applied the change, so that each node individually can make a decision whether to move forward?). If so, can the node skip a transformation which does not affect it and move forward thus making another change concurrently?
> 
> 
> What if a node(s) failure prevents progress over the log? For example, we are unable to get a majority of nodes which process an event so we cannot move forward. We cannot remove those nodes though, because the removal will be later in the log and we cannot make progress. I've read about manual intervention but maybe it can be avoided in some cases for example by adding no more than one pending event to the log?
> 
> For multistep actions - are they going to be added all or none? If they are added one by one, can they be interleaved with other multistep actions?
> 
> Reconfiguration itself occurs using the process that is analogous to "regular" bootstrap and also uses Paxos as a linearizability mechanism, except for there is no concept of "token" ownership in CMS; all CMS nodes own an entire range from MIN to MAX token. This means that during bootstrap, we do not have to split ranges, or have some nodes "lose" a part of the ring...
> 
> This sounds like an implementation of everywhere replication strategy, doesn't it?
> 
> 
> - - -- --- ----- -------- -------------
> Jacek Lewandowski
> 
> 
> On Tue, Sep 6, 2022 at 9:19 AM Sam Tunnicliffe <sam@beobal.com <ma...@beobal.com>> wrote:
> >
> >
> >
> > On 5 Sep 2022, at 22:02, Henrik Ingo <henrik.ingo@datastax.com <ma...@datastax.com>> wrote:
> >
> > Mostly I just wanted to ack that at least someone read the doc (somewhat superficially sure, but some parts with thought...)
> >
> >
> > Thanks, it's a lot to digest, so we appreciate that people are working through it. 
> >>
> >> One pre-feature that we would include in the preceding minor release is a node level switch to disable all operations that modify cluster metadata state. This would include schema changes as well as topology-altering events like move, decommission or (gossip-based) bootstrap and would be activated on all nodes for the duration of the major upgrade. If this switch were accessible via internode messaging, activating it for an upgrade could be automated. When an upgraded node starts up, it could send a request to disable metadata changes to any peer still running the old version. This would cost a few redundant messages, but simplify things operationally.
> >>
> >> Although this approach would necessitate an additional minor version upgrade, this is not without precedent and we believe that the benefits outweigh the costs of additional operational overhead.
> >
> >
> > Sounds like a great idea, and probably necessary in practice?
> >  
> >
> >
> > Although I think we _could_ manage without this, it would certainly simplify this and future upgrades.
> >>
> >> If this part of the proposal is accepted, we could also include further messaging protocol changes in the minor release, as these would largely constitute additional verbs which would be implemented with no-op verb handlers initially. This would simplify the major version code, as it would not need to gate the sending of asynchronous replication messages on the receiver's release version. During the migration, it may be useful to have a way to directly inject gossip messages into the cluster, in case the states of the yet-to-be upgraded nodes become inconsistent. This isn't intended, so such a tool may never be required, but we have seen that gossip propagation can be difficult to reason about at times.
> >
> >
> > Others will know the code better and I understand that adding new no-op verbs can be considered safe... But instinctively a bit hesitant on this one. Surely adding a few if statements to the upgraded version isn't that big of a deal?
> >
> > Also, it should make sense to minimize the dependencies from the previous major version (without CEP-21) to the new major version (with CEP-21). If a bug is found, it's much easier to fix code in the new major version than the old and supposedly stable one.
> >
> >
> > Yep, agreed. Adding verb handlers in advance may not buy us very much, so may not be worth the risk of additionally perturbing the stable system. I would say that having a means to directly manipulate gossip state during the upgrade would be a useful safety net in case something unforeseen occurs and we need to dig ourselves out of a hole. The precise scope of the feature & required changes are not something we've given extensive thought to yet, so we'd want to assess that carefully before proceeding.
> >
> > henrik
> >
> > --
> > Henrik Ingo
> > +358 40 569 7354
> >

Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

Posted by Jacek Lewandowski <le...@gmail.com>.

Hi Sam, this is a great idea and a really well described CEP!

I have some questions, perhaps they reflect my weak understanding, but
maybe you can answer:
Is it going to work so that each node reads the log individually and try to
catch up in a way that it applies a transition locally once the previous
change is confirmed on the majority of the affected nodes, right? If so,
will it be a replica group explicitly associated with each event
(explicitly mentioned nodes which are affected by the change and a list of
those which already applied the change, so that each node individually can
make a decision whether to move forward?). If so, can the node skip a
transformation which does not affect it and move forward thus making
another change concurrently?


What if a node(s) failure prevents progress over the log? For example, we
are unable to get a majority of nodes which process an event so we cannot
move forward. We cannot remove those nodes though, because the removal will
be later in the log and we cannot make progress. I've read about manual
intervention but maybe it can be avoided in some cases for example by
adding no more than one pending event to the log?

For multistep actions - are they going to be added all or none? If they are
added one by one, can they be interleaved with other multistep actions?

Reconfiguration itself occurs using the process that is analogous to
> "regular" bootstrap and also uses Paxos as a linearizability mechanism,
> except for there is no concept of "token" ownership in CMS; all CMS nodes
> own an entire range from MIN to MAX token. This means that during
> bootstrap, we do not have to split ranges, or have some nodes "lose" a part
> of the ring...


This sounds like an implementation of everywhere replication strategy,
doesn't it?


- - -- --- ----- -------- -------------
Jacek Lewandowski


On Tue, Sep 6, 2022 at 9:19 AM Sam Tunnicliffe <sa...@beobal.com> wrote:
>
>
>
> On 5 Sep 2022, at 22:02, Henrik Ingo <he...@datastax.com> wrote:
>
> Mostly I just wanted to ack that at least someone read the doc (somewhat
superficially sure, but some parts with thought...)
>
>
> Thanks, it's a lot to digest, so we appreciate that people are working
through it.
>>
>> One pre-feature that we would include in the preceding minor release is
a node level switch to disable all operations that modify cluster metadata
state. This would include schema changes as well as topology-altering
events like move, decommission or (gossip-based) bootstrap and would be
activated on all nodes for the duration of the major upgrade. If this
switch were accessible via internode messaging, activating it for an
upgrade could be automated. When an upgraded node starts up, it could send
a request to disable metadata changes to any peer still running the old
version. This would cost a few redundant messages, but simplify things
operationally.
>>
>> Although this approach would necessitate an additional minor version
upgrade, this is not without precedent and we believe that the benefits
outweigh the costs of additional operational overhead.
>
>
> Sounds like a great idea, and probably necessary in practice?
>
>
>
> Although I think we _could_ manage without this, it would certainly
simplify this and future upgrades.
>>
>> If this part of the proposal is accepted, we could also include further
messaging protocol changes in the minor release, as these would largely
constitute additional verbs which would be implemented with no-op verb
handlers initially. This would simplify the major version code, as it would
not need to gate the sending of asynchronous replication messages on the
receiver's release version. During the migration, it may be useful to have
a way to directly inject gossip messages into the cluster, in case the
states of the yet-to-be upgraded nodes become inconsistent. This isn't
intended, so such a tool may never be required, but we have seen that
gossip propagation can be difficult to reason about at times.
>
>
> Others will know the code better and I understand that adding new no-op
verbs can be considered safe... But instinctively a bit hesitant on this
one. Surely adding a few if statements to the upgraded version isn't that
big of a deal?
>
> Also, it should make sense to minimize the dependencies from the previous
major version (without CEP-21) to the new major version (with CEP-21). If a
bug is found, it's much easier to fix code in the new major version than
the old and supposedly stable one.
>
>
> Yep, agreed. Adding verb handlers in advance may not buy us very much, so
may not be worth the risk of additionally perturbing the stable system. I
would say that having a means to directly manipulate gossip state during
the upgrade would be a useful safety net in case something unforeseen
occurs and we need to dig ourselves out of a hole. The precise scope of the
feature & required changes are not something we've given extensive thought
to yet, so we'd want to assess that carefully before proceeding.
>
> henrik
>
> --
> Henrik Ingo
> +358 40 569 7354
>

Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

Posted by Sam Tunnicliffe <sa...@beobal.com>.


> On 5 Sep 2022, at 22:02, Henrik Ingo <he...@datastax.com> wrote:
> 
> Mostly I just wanted to ack that at least someone read the doc (somewhat superficially sure, but some parts with thought...)
> 

Thanks, it's a lot to digest, so we appreciate that people are working through it. 
> One pre-feature that we would include in the preceding minor release is a node level switch to disable all operations that modify cluster metadata state. This would include schema changes as well as topology-altering events like move, decommission or (gossip-based) bootstrap and would be activated on all nodes for the duration of the major upgrade. If this switch were accessible via internode messaging, activating it for an upgrade could be automated. When an upgraded node starts up, it could send a request to disable metadata changes to any peer still running the old version. This would cost a few redundant messages, but simplify things operationally.
> Although this approach would necessitate an additional minor version upgrade, this is not without precedent and we believe that the benefits outweigh the costs of additional operational overhead.
> 
> Sounds like a great idea, and probably necessary in practice?
>  

Although I think we _could_ manage without this, it would certainly simplify this and future upgrades.
> If this part of the proposal is accepted, we could also include further messaging protocol changes in the minor release, as these would largely constitute additional verbs which would be implemented with no-op verb handlers initially. This would simplify the major version code, as it would not need to gate the sending of asynchronous replication messages on the receiver's release version. During the migration, it may be useful to have a way to directly inject gossip messages into the cluster, in case the states of the yet-to-be upgraded nodes become inconsistent. This isn't intended, so such a tool may never be required, but we have seen that gossip propagation can be difficult to reason about at times.
> 
> Others will know the code better and I understand that adding new no-op verbs can be considered safe... But instinctively a bit hesitant on this one. Surely adding a few if statements to the upgraded version isn't that big of a deal?
> 
> Also, it should make sense to minimize the dependencies from the previous major version (without CEP-21) to the new major version (with CEP-21). If a bug is found, it's much easier to fix code in the new major version than the old and supposedly stable one.
> 

Yep, agreed. Adding verb handlers in advance may not buy us very much, so may not be worth the risk of additionally perturbing the stable system. I would say that having a means to directly manipulate gossip state during the upgrade would be a useful safety net in case something unforeseen occurs and we need to dig ourselves out of a hole. The precise scope of the feature & required changes are not something we've given extensive thought to yet, so we'd want to assess that carefully before proceeding.

> henrik
> 
> -- 
> Henrik Ingo
> +358 40 569 7354 <tel:358405697354>
>  <https://www.datastax.com/>   <https://twitter.com/DataStaxEng>   <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>   <https://www.linkedin.com/in/heingo/>

Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

Posted by Henrik Ingo <he...@datastax.com>.

Mostly I just wanted to ack that at least someone read the doc (somewhat
superficially sure, but some parts with thought...)

One pre-feature that we would include in the preceding minor release is a
> node level switch to disable all operations that modify cluster metadata
> state. This would include schema changes as well as topology-altering
> events like move, decommission or (gossip-based) bootstrap and would be
> activated on all nodes for the duration of the *major* upgrade. If this
> switch were accessible via internode messaging, activating it for an
> upgrade could be automated. When an upgraded node starts up, it could send
> a request to disable metadata changes to any peer still running the old
> version. This would cost a few redundant messages, but simplify things
> operationally.
>
> Although this approach would necessitate an additional minor version
> upgrade, this is not without precedent and we believe that the benefits
> outweigh the costs of additional operational overhead.
>

Sounds like a great idea, and probably necessary in practice?


> If this part of the proposal is accepted, we could also include further
> messaging protocol changes in the minor release, as these would largely
> constitute additional verbs which would be implemented with no-op verb
> handlers initially. This would simplify the major version code, as it would
> not need to gate the sending of asynchronous replication messages on the
> receiver's release version. During the migration, it may be useful to have
> a way to directly inject gossip messages into the cluster, in case the
> states of the yet-to-be upgraded nodes become inconsistent. This isn't
> intended, so such a tool may never be required, but we have seen that
> gossip propagation can be difficult to reason about at times.
>

Others will know the code better and I understand that adding new no-op
verbs can be considered safe... But instinctively a bit hesitant on this
one. Surely adding a few if statements to the upgraded version isn't that
big of a deal?

Also, it should make sense to minimize the dependencies from the previous
major version (without CEP-21) to the new major version (with CEP-21). If a
bug is found, it's much easier to fix code in the new major version than
the old and supposedly stable one.

henrik

-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

Posted by Sam Tunnicliffe <sa...@beobal.com>.

Good catch, I'll update the doc.

Thanks, 
Sam

> On 24 Aug 2022, at 10:24, Claude Warren, Jr via dev <de...@cassandra.apache.org> wrote:
> 
> Should 
> 
> (**) It may seem counterintuitive, that A is being written to even after we've stopped reading from it. This is done in order to guarantee that by the time we stop writing to the node giving up the range, there is no coordinator that may attempt reading from it without learning about at least the epoch where it is not a part of a read set. In other words, we have to keep writing until there's any chance there might be a reader.
> 
> instead read:
> 
> (**) It may seem counterintuitive, that A is being written to even after we've stopped reading from it. This is done in order to guarantee that by the time we stop writing to the node giving up the range, there is no coordinator that may attempt reading from it without learning about at least the epoch where it is not a part of a read set. In other words, we have to keep writing while there's any chance there might be a reader.
> 
> On Tue, Aug 23, 2022 at 7:13 PM Mick Semb Wever <mck@apache.org <ma...@apache.org>> wrote:
> 
> 
> I just want to say I’m really excited about this work. It’s one of the last remaining major inadequacies of the project that makes it hard for people to deploy, and hard for us to develop.
> 
> 
> 
> Second this. And what a solid write up Sam - it's a real joy reading this CEP.

Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

Posted by "Claude Warren, Jr via dev" <de...@cassandra.apache.org>.

Should

(**) It may seem counterintuitive, that A is being written to even after
> we've stopped reading from it. This is done in order to guarantee that by
> the time we stop writing to the node giving up the range, there is no
> coordinator that may attempt reading from it without learning about *at
> least* the epoch where it is not a part of a read set. In other words, we
> have to keep writing until there's any chance there might be a reader.

instead read:

(**) It may seem counterintuitive, that A is being written to even after
we've stopped reading from it. This is done in order to guarantee that by
the time we stop writing to the node giving up the range, there is no
coordinator that may attempt reading from it without learning about *at
least* the epoch where it is not a part of a read set. In other words, we
have to keep writing *while* there's any chance there might be a reader.

On Tue, Aug 23, 2022 at 7:13 PM Mick Semb Wever <mc...@apache.org> wrote:

>
>
> I just want to say I’m really excited about this work. It’s one of the
>> last remaining major inadequacies of the project that makes it hard for
>> people to deploy, and hard for us to develop.
>>
>>
>
> Second this. And what a solid write up Sam - it's a real joy reading this
> CEP.
>

Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

Posted by Mick Semb Wever <mc...@apache.org>.

I just want to say I’m really excited about this work. It’s one of the last
> remaining major inadequacies of the project that makes it hard for people
> to deploy, and hard for us to develop.
>
>

Second this. And what a solid write up Sam - it's a real joy reading this
CEP.

Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

Posted by Benedict <be...@apache.org>.

I just want to say I’m really excited about this work. It’s one of the last remaining major inadequacies of the project that makes it hard for people to deploy, and hard for us to develop.

Can’t wait for it to be fixed.

> On 22 Aug 2022, at 13:45, Sam Tunnicliffe <sa...@beobal.com> wrote:
> Hi,
> 
> I'd like to open discussion about this CEP: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata	
> Cluster metadata in Cassandra comprises a number of disparate elements including, but not limited to, distributed schema, topology and token ownership. Following the general design principles of Cassandra, the mechanisms for coordinating updates to cluster state have favoured eventual consistency, with probabilisitic delivery via gossip being a prime example. Undoubtedly, this approach has benefits, not least in terms of resilience, particularly in highly fluid distributed environments. However, this is not the reality of most Cassandra deployments, where the total number of nodes is relatively small (i.e. in the low thousands) and the rate of change tends to be low.  
> 
> Historically, a significant proportion of issues affecting operators and users of Cassandra have been due, at least in part, to a lack of strongly consistent cluster metadata. In response to this, we propose a design which aims to provide linearizability of metadata changes whilst ensuring that the effects of those changes are made visible to all nodes in a strongly consistent manner. At its core, it is also pluggable, enabling Cassandra-derived projects to supply their own implementations if desired.
> 
> In addition to the CEP document itself, we aim to publish a working prototype of the proposed design. Obviously, this does not implement the entire proposal and there are several parts which remain only partially complete. It does include the core of the system, including a good deal of test infrastructure, so may serve as both illustration of the design and a starting point for real implementation.