You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by "Claude Warren, Jr via dev" <de...@cassandra.apache.org> on 2022/10/20 11:24:56 UTC

CEP-21 and complete cluster replacement.

After CEP-21 would it be possible to take a cluster of 6 nodes, spin up 6
new nodes to duplicate the 6 existing nodes and then spin down the original
6 nodes.  Basically, I am thinking of the case where a cluster is running
version x.y.z and want to run x.y.z+1, can they spin up an equal number of
x.y.z+1 systems and replace the old ones without shutting down the cluster?

We currently try something like this where we spin up 1 system and then
drop 1 system until all the old nodes are replaced.  This process
frequently runs into streaming failures while bootstrapping.

Any insights would be appreciated.

Claude

Re: CEP-21 and complete cluster replacement.

Posted by Alex Petrov <al...@coffeenco.de>.
Since it might have sounded differently, most of the things I wrote are something that CEP-21 _enables_ us to do. 

But CEP-21 will just (more or less) make cluster operations consistent. The rest of the things - are just features that we will implement on top of it. We will need people to adopt new tooling to make most of the operations I describe available.

On Thu, Oct 20, 2022, at 4:42 PM, Alex Petrov wrote:
> > by default C* does prohibit concurrent bootstraps (behaviour which can be overridden with the cassandra.consistent.rangemovement system property). But there's nothing to stop you fully bootstrapping additional nodes in series, then removing them in the same way.
> 
> I think there are multiple important things in which CEP-21 actually might be helpful. Right now, in a 5-node cluster with RF-3, each node is holding a range that is between its own token and its predecessor in the ring, along with RF-1 ranges replicated from the neighbours. 
> 
> What CEP-21 will allow us to do is to make _some_ RF-sized subset of 5 nodes we have in the cluster be owners of an arbitrary range. That will _also_ mean that you can add a 6th node, that owns nothing at first, and bootstrap it as a participant for read/write quorums of the same ranges node A is a read/write replica of, and, in the next step, remove A as a read/write replica.
> 
> I believe such approach would still be incredibly costly (i.e. you will have to re-stream entire data), but if there are other means are available for sharing disk or sstables that would lower the cost for you, this might even work as a lower-risk upgrade option, even though I think most operators won't be using this. What could be widely beneficial is having an ability to test new version as a canary in a write-survey mode, and then add it as a read replica, but for a small subset of data (effectively decreasing availability of this particular range by extending its RF).
> 
> > What you will be able to do post CEP-21, is to run concurrent bootstraps of nodes which don't share ranges
> 
> I think we can do even better: we can take an arbitrary range, and split it into N parts, effectively making all N items bootstrappable in parallel. I also think (however I haven't checked if that's truly the case) that we can prepare the plan in which we can allow executing StartJoin for all nodes, while the range is locked, but block execution of `MidJoin` for any of the nodes until StartJoin for all of them is executed and, similarly, throttling FinishJoin before MidJoin is executed for all the nodes. In other words, I think there might be a bit of a room for flexibility, the question is what way will be the most beneficial. 
> 
> On Thu, Oct 20, 2022, at 3:33 PM, Sam Tunnicliffe wrote:
>> > Add A' to the cluster with the same keyspace as A.
>> 
>> Can you clarify what you mean here?
>> 
>> > Currently these operations have to be performed in sequence.  My understanding is that you can't add more than one node at a time.  
>> 
>> To ensure consistency guarantees are honoured, by default C* does prohibit concurrent bootstraps (behaviour which can be overridden with the cassandra.consistent.rangemovement system property). But there's nothing to stop you fully bootstrapping additional nodes in series, then removing them in the same way.
>> 
>> Why you would want to do this, or to use bootstrap and remove for this at all rather than upgrading in place isn't clear to me though, doing it this way just adds a streaming overhead that doesn't otherwise exist.
>> 
>> What you will be able to do post CEP-21, is to run concurrent bootstraps of nodes which don't share ranges. This is a definite an improvement on the status quo, but it's only an initial step. CEP-21 is intended to lay the foundations for further improvements down the line.
>> 
>> 
>>> On 20 Oct 2022, at 14:04, Claude Warren, Jr via dev <de...@cassandra.apache.org> wrote:
>>> 
>>> My understanding of our process is (assuming we have 3 nodes A,B,C):
>>>  * Add A' to the cluster with the same keyspace as A.
>>>  * Remove A from the cluster.
>>>  * Add B' to the cluster
>>>  * Remove B from the cluster
>>>  * Add C' to the cluster
>>>  * Remove C from the cluster.
>>> Currently these operations have to be performed in sequence.  My understanding is that you can't add more than one node at a time.  What we would like to do is do this is 3 steps:
>>>  * Add A', B', C' to the cluster.
>>>  * Wait for all 3 to be accepted and functioning.
>>>  * Remove A, B, C from the cluster.
>>> Does CEP-21 make this possible?
>>> 
>>> On Thu, Oct 20, 2022 at 1:43 PM Sam Tunnicliffe <sa...@beobal.com> wrote:
>>>> I'm not sure I 100% understand the question, but the things covered in CEP-21 won't enable you to as an operator to bootstrap all your new nodes without fully joining, then perform an atomic CAS to replace the existing members. CEP-21 alone also won't solve all cross-version streaming issues, which is one reason performing topology-modifying operations like bootstrap & decommission during an upgrade are not generally considered a good idea.
>>>> 
>>>> Transactional metadata will make the bootstrapping (and decommissioning) experience a whole lot more stable and predictable, so in the short term I would expect the recommended rolling approach to upgrades would improve significantly. 
>>>> 
>>>> 
>>>> > On 20 Oct 2022, at 12:24, Claude Warren, Jr via dev <de...@cassandra.apache.org> wrote:
>>>> > 
>>>> > After CEP-21 would it be possible to take a cluster of 6 nodes, spin up 6 new nodes to duplicate the 6 existing nodes and then spin down the original 6 nodes.  Basically, I am thinking of the case where a cluster is running version x.y.z and want to run x.y.z+1, can they spin up an equal number of x.y.z+1 systems and replace the old ones without shutting down the cluster?
>>>> > 
>>>> > We currently try something like this where we spin up 1 system and then drop 1 system until all the old nodes are replaced.  This process frequently runs into streaming failures while bootstrapping.
>>>> > 
>>>> > Any insights would be appreciated.
>>>> > 
>>>> > Claude
>>>> 
> 

Re: CEP-21 and complete cluster replacement.

Posted by Sam Tunnicliffe <sa...@beobal.com>.
Right, all of the things you describe will be possible post CEP-21, just not immediately. My point is that CEP-21 has a specific scope and a lot of the great planned improvements necessarily fall outside of that.

> On 20 Oct 2022, at 15:42, Alex Petrov <al...@coffeenco.de> wrote:
> 
> > by default C* does prohibit concurrent bootstraps (behaviour which can be overridden with the cassandra.consistent.rangemovement system property). But there's nothing to stop you fully bootstrapping additional nodes in series, then removing them in the same way.
> 
> I think there are multiple important things in which CEP-21 actually might be helpful. Right now, in a 5-node cluster with RF-3, each node is holding a range that is between its own token and its predecessor in the ring, along with RF-1 ranges replicated from the neighbours. 
> 
> What CEP-21 will allow us to do is to make _some_ RF-sized subset of 5 nodes we have in the cluster be owners of an arbitrary range. That will _also_ mean that you can add a 6th node, that owns nothing at first, and bootstrap it as a participant for read/write quorums of the same ranges node A is a read/write replica of, and, in the next step, remove A as a read/write replica.
> 
> I believe such approach would still be incredibly costly (i.e. you will have to re-stream entire data), but if there are other means are available for sharing disk or sstables that would lower the cost for you, this might even work as a lower-risk upgrade option, even though I think most operators won't be using this. What could be widely beneficial is having an ability to test new version as a canary in a write-survey mode, and then add it as a read replica, but for a small subset of data (effectively decreasing availability of this particular range by extending its RF).
> 
> > What you will be able to do post CEP-21, is to run concurrent bootstraps of nodes which don't share ranges
> 
> I think we can do even better: we can take an arbitrary range, and split it into N parts, effectively making all N items bootstrappable in parallel. I also think (however I haven't checked if that's truly the case) that we can prepare the plan in which we can allow executing StartJoin for all nodes, while the range is locked, but block execution of `MidJoin` for any of the nodes until StartJoin for all of them is executed and, similarly, throttling FinishJoin before MidJoin is executed for all the nodes. In other words, I think there might be a bit of a room for flexibility, the question is what way will be the most beneficial. 
> 
> On Thu, Oct 20, 2022, at 3:33 PM, Sam Tunnicliffe wrote:
>> > Add A' to the cluster with the same keyspace as A.
>> 
>> Can you clarify what you mean here?
>> 
>> > Currently these operations have to be performed in sequence.  My understanding is that you can't add more than one node at a time.  
>> 
>> To ensure consistency guarantees are honoured, by default C* does prohibit concurrent bootstraps (behaviour which can be overridden with the cassandra.consistent.rangemovement system property). But there's nothing to stop you fully bootstrapping additional nodes in series, then removing them in the same way.
>> 
>> Why you would want to do this, or to use bootstrap and remove for this at all rather than upgrading in place isn't clear to me though, doing it this way just adds a streaming overhead that doesn't otherwise exist.
>> 
>> What you will be able to do post CEP-21, is to run concurrent bootstraps of nodes which don't share ranges. This is a definite an improvement on the status quo, but it's only an initial step. CEP-21 is intended to lay the foundations for further improvements down the line.
>> 
>> 
>>> On 20 Oct 2022, at 14:04, Claude Warren, Jr via dev <dev@cassandra.apache.org <ma...@cassandra.apache.org>> wrote:
>>> 
>>> My understanding of our process is (assuming we have 3 nodes A,B,C):
>>> Add A' to the cluster with the same keyspace as A.
>>> Remove A from the cluster.
>>> Add B' to the cluster
>>> Remove B from the cluster
>>> Add C' to the cluster
>>> Remove C from the cluster.
>>> Currently these operations have to be performed in sequence.  My understanding is that you can't add more than one node at a time.  What we would like to do is do this is 3 steps:
>>> Add A', B', C' to the cluster.
>>> Wait for all 3 to be accepted and functioning.
>>> Remove A, B, C from the cluster.
>>> Does CEP-21 make this possible?
>>> 
>>> On Thu, Oct 20, 2022 at 1:43 PM Sam Tunnicliffe <sam@beobal.com <ma...@beobal.com>> wrote:
>>> I'm not sure I 100% understand the question, but the things covered in CEP-21 won't enable you to as an operator to bootstrap all your new nodes without fully joining, then perform an atomic CAS to replace the existing members. CEP-21 alone also won't solve all cross-version streaming issues, which is one reason performing topology-modifying operations like bootstrap & decommission during an upgrade are not generally considered a good idea.
>>> 
>>> Transactional metadata will make the bootstrapping (and decommissioning) experience a whole lot more stable and predictable, so in the short term I would expect the recommended rolling approach to upgrades would improve significantly. 
>>> 
>>> 
>>> > On 20 Oct 2022, at 12:24, Claude Warren, Jr via dev <dev@cassandra.apache.org <ma...@cassandra.apache.org>> wrote:
>>> > 
>>> > After CEP-21 would it be possible to take a cluster of 6 nodes, spin up 6 new nodes to duplicate the 6 existing nodes and then spin down the original 6 nodes.  Basically, I am thinking of the case where a cluster is running version x.y.z and want to run x.y.z+1, can they spin up an equal number of x.y.z+1 systems and replace the old ones without shutting down the cluster?
>>> > 
>>> > We currently try something like this where we spin up 1 system and then drop 1 system until all the old nodes are replaced.  This process frequently runs into streaming failures while bootstrapping.
>>> > 
>>> > Any insights would be appreciated.
>>> > 
>>> > Claude


Re: CEP-21 and complete cluster replacement.

Posted by Alex Petrov <al...@coffeenco.de>.
> by default C* does prohibit concurrent bootstraps (behaviour which can be overridden with the cassandra.consistent.rangemovement system property). But there's nothing to stop you fully bootstrapping additional nodes in series, then removing them in the same way.

I think there are multiple important things in which CEP-21 actually might be helpful. Right now, in a 5-node cluster with RF-3, each node is holding a range that is between its own token and its predecessor in the ring, along with RF-1 ranges replicated from the neighbours. 

What CEP-21 will allow us to do is to make _some_ RF-sized subset of 5 nodes we have in the cluster be owners of an arbitrary range. That will _also_ mean that you can add a 6th node, that owns nothing at first, and bootstrap it as a participant for read/write quorums of the same ranges node A is a read/write replica of, and, in the next step, remove A as a read/write replica.

I believe such approach would still be incredibly costly (i.e. you will have to re-stream entire data), but if there are other means are available for sharing disk or sstables that would lower the cost for you, this might even work as a lower-risk upgrade option, even though I think most operators won't be using this. What could be widely beneficial is having an ability to test new version as a canary in a write-survey mode, and then add it as a read replica, but for a small subset of data (effectively decreasing availability of this particular range by extending its RF).

> What you will be able to do post CEP-21, is to run concurrent bootstraps of nodes which don't share ranges

I think we can do even better: we can take an arbitrary range, and split it into N parts, effectively making all N items bootstrappable in parallel. I also think (however I haven't checked if that's truly the case) that we can prepare the plan in which we can allow executing StartJoin for all nodes, while the range is locked, but block execution of `MidJoin` for any of the nodes until StartJoin for all of them is executed and, similarly, throttling FinishJoin before MidJoin is executed for all the nodes. In other words, I think there might be a bit of a room for flexibility, the question is what way will be the most beneficial. 

On Thu, Oct 20, 2022, at 3:33 PM, Sam Tunnicliffe wrote:
> > Add A' to the cluster with the same keyspace as A.
> 
> Can you clarify what you mean here?
> 
> > Currently these operations have to be performed in sequence.  My understanding is that you can't add more than one node at a time.  
> 
> To ensure consistency guarantees are honoured, by default C* does prohibit concurrent bootstraps (behaviour which can be overridden with the cassandra.consistent.rangemovement system property). But there's nothing to stop you fully bootstrapping additional nodes in series, then removing them in the same way.
> 
> Why you would want to do this, or to use bootstrap and remove for this at all rather than upgrading in place isn't clear to me though, doing it this way just adds a streaming overhead that doesn't otherwise exist.
> 
> What you will be able to do post CEP-21, is to run concurrent bootstraps of nodes which don't share ranges. This is a definite an improvement on the status quo, but it's only an initial step. CEP-21 is intended to lay the foundations for further improvements down the line.
> 
> 
>> On 20 Oct 2022, at 14:04, Claude Warren, Jr via dev <de...@cassandra.apache.org> wrote:
>> 
>> My understanding of our process is (assuming we have 3 nodes A,B,C):
>>  * Add A' to the cluster with the same keyspace as A.
>>  * Remove A from the cluster.
>>  * Add B' to the cluster
>>  * Remove B from the cluster
>>  * Add C' to the cluster
>>  * Remove C from the cluster.
>> Currently these operations have to be performed in sequence.  My understanding is that you can't add more than one node at a time.  What we would like to do is do this is 3 steps:
>>  * Add A', B', C' to the cluster.
>>  * Wait for all 3 to be accepted and functioning.
>>  * Remove A, B, C from the cluster.
>> Does CEP-21 make this possible?
>> 
>> On Thu, Oct 20, 2022 at 1:43 PM Sam Tunnicliffe <sa...@beobal.com> wrote:
>>> I'm not sure I 100% understand the question, but the things covered in CEP-21 won't enable you to as an operator to bootstrap all your new nodes without fully joining, then perform an atomic CAS to replace the existing members. CEP-21 alone also won't solve all cross-version streaming issues, which is one reason performing topology-modifying operations like bootstrap & decommission during an upgrade are not generally considered a good idea.
>>> 
>>> Transactional metadata will make the bootstrapping (and decommissioning) experience a whole lot more stable and predictable, so in the short term I would expect the recommended rolling approach to upgrades would improve significantly. 
>>> 
>>> 
>>> > On 20 Oct 2022, at 12:24, Claude Warren, Jr via dev <de...@cassandra.apache.org> wrote:
>>> > 
>>> > After CEP-21 would it be possible to take a cluster of 6 nodes, spin up 6 new nodes to duplicate the 6 existing nodes and then spin down the original 6 nodes.  Basically, I am thinking of the case where a cluster is running version x.y.z and want to run x.y.z+1, can they spin up an equal number of x.y.z+1 systems and replace the old ones without shutting down the cluster?
>>> > 
>>> > We currently try something like this where we spin up 1 system and then drop 1 system until all the old nodes are replaced.  This process frequently runs into streaming failures while bootstrapping.
>>> > 
>>> > Any insights would be appreciated.
>>> > 
>>> > Claude
>>> 

Re: CEP-21 and complete cluster replacement.

Posted by Sam Tunnicliffe <sa...@beobal.com>.
> Add A' to the cluster with the same keyspace as A.

Can you clarify what you mean here?

> Currently these operations have to be performed in sequence.  My understanding is that you can't add more than one node at a time.  

To ensure consistency guarantees are honoured, by default C* does prohibit concurrent bootstraps (behaviour which can be overridden with the cassandra.consistent.rangemovement system property). But there's nothing to stop you fully bootstrapping additional nodes in series, then removing them in the same way.

Why you would want to do this, or to use bootstrap and remove for this at all rather than upgrading in place isn't clear to me though, doing it this way just adds a streaming overhead that doesn't otherwise exist.

What you will be able to do post CEP-21, is to run concurrent bootstraps of nodes which don't share ranges. This is a definite an improvement on the status quo, but it's only an initial step. CEP-21 is intended to lay the foundations for further improvements down the line.


> On 20 Oct 2022, at 14:04, Claude Warren, Jr via dev <de...@cassandra.apache.org> wrote:
> 
> My understanding of our process is (assuming we have 3 nodes A,B,C):
> Add A' to the cluster with the same keyspace as A.
> Remove A from the cluster.
> Add B' to the cluster
> Remove B from the cluster
> Add C' to the cluster
> Remove C from the cluster.
> Currently these operations have to be performed in sequence.  My understanding is that you can't add more than one node at a time.  What we would like to do is do this is 3 steps:
> Add A', B', C' to the cluster.
> Wait for all 3 to be accepted and functioning.
> Remove A, B, C from the cluster.
> Does CEP-21 make this possible?
> 
> On Thu, Oct 20, 2022 at 1:43 PM Sam Tunnicliffe <sam@beobal.com <ma...@beobal.com>> wrote:
> I'm not sure I 100% understand the question, but the things covered in CEP-21 won't enable you to as an operator to bootstrap all your new nodes without fully joining, then perform an atomic CAS to replace the existing members. CEP-21 alone also won't solve all cross-version streaming issues, which is one reason performing topology-modifying operations like bootstrap & decommission during an upgrade are not generally considered a good idea.
> 
> Transactional metadata will make the bootstrapping (and decommissioning) experience a whole lot more stable and predictable, so in the short term I would expect the recommended rolling approach to upgrades would improve significantly. 
> 
> 
> > On 20 Oct 2022, at 12:24, Claude Warren, Jr via dev <dev@cassandra.apache.org <ma...@cassandra.apache.org>> wrote:
> > 
> > After CEP-21 would it be possible to take a cluster of 6 nodes, spin up 6 new nodes to duplicate the 6 existing nodes and then spin down the original 6 nodes.  Basically, I am thinking of the case where a cluster is running version x.y.z and want to run x.y.z+1, can they spin up an equal number of x.y.z+1 systems and replace the old ones without shutting down the cluster?
> > 
> > We currently try something like this where we spin up 1 system and then drop 1 system until all the old nodes are replaced.  This process frequently runs into streaming failures while bootstrapping.
> > 
> > Any insights would be appreciated.
> > 
> > Claude
> 


Re: CEP-21 and complete cluster replacement.

Posted by "Claude Warren, Jr via dev" <de...@cassandra.apache.org>.
My understanding of our process is (assuming we have 3 nodes A,B,C):

   - Add A' to the cluster with the same keyspace as A.
   - Remove A from the cluster.
   - Add B' to the cluster
   - Remove B from the cluster
   - Add C' to the cluster
   - Remove C from the cluster.

Currently these operations have to be performed in sequence.  My
understanding is that you can't add more than one node at a time.  What we
would like to do is do this is 3 steps:

   - Add A', B', C' to the cluster.
   - Wait for all 3 to be accepted and functioning.
   - Remove A, B, C from the cluster.

Does CEP-21 make this possible?

On Thu, Oct 20, 2022 at 1:43 PM Sam Tunnicliffe <sa...@beobal.com> wrote:

> I'm not sure I 100% understand the question, but the things covered in
> CEP-21 won't enable you to as an operator to bootstrap all your new nodes
> without fully joining, then perform an atomic CAS to replace the existing
> members. CEP-21 alone also won't solve all cross-version streaming issues,
> which is one reason performing topology-modifying operations like bootstrap
> & decommission during an upgrade are not generally considered a good idea.
>
> Transactional metadata will make the bootstrapping (and decommissioning)
> experience a whole lot more stable and predictable, so in the short term I
> would expect the recommended rolling approach to upgrades would improve
> significantly.
>
>
> > On 20 Oct 2022, at 12:24, Claude Warren, Jr via dev <
> dev@cassandra.apache.org> wrote:
> >
> > After CEP-21 would it be possible to take a cluster of 6 nodes, spin up
> 6 new nodes to duplicate the 6 existing nodes and then spin down the
> original 6 nodes.  Basically, I am thinking of the case where a cluster is
> running version x.y.z and want to run x.y.z+1, can they spin up an equal
> number of x.y.z+1 systems and replace the old ones without shutting down
> the cluster?
> >
> > We currently try something like this where we spin up 1 system and then
> drop 1 system until all the old nodes are replaced.  This process
> frequently runs into streaming failures while bootstrapping.
> >
> > Any insights would be appreciated.
> >
> > Claude
>
>

Re: CEP-21 and complete cluster replacement.

Posted by Sam Tunnicliffe <sa...@beobal.com>.
I'm not sure I 100% understand the question, but the things covered in CEP-21 won't enable you to as an operator to bootstrap all your new nodes without fully joining, then perform an atomic CAS to replace the existing members. CEP-21 alone also won't solve all cross-version streaming issues, which is one reason performing topology-modifying operations like bootstrap & decommission during an upgrade are not generally considered a good idea.

Transactional metadata will make the bootstrapping (and decommissioning) experience a whole lot more stable and predictable, so in the short term I would expect the recommended rolling approach to upgrades would improve significantly. 


> On 20 Oct 2022, at 12:24, Claude Warren, Jr via dev <de...@cassandra.apache.org> wrote:
> 
> After CEP-21 would it be possible to take a cluster of 6 nodes, spin up 6 new nodes to duplicate the 6 existing nodes and then spin down the original 6 nodes.  Basically, I am thinking of the case where a cluster is running version x.y.z and want to run x.y.z+1, can they spin up an equal number of x.y.z+1 systems and replace the old ones without shutting down the cluster?
> 
> We currently try something like this where we spin up 1 system and then drop 1 system until all the old nodes are replaced.  This process frequently runs into streaming failures while bootstrapping.
> 
> Any insights would be appreciated.
> 
> Claude