You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Richard Dawe <ri...@messagesystems.com> on 2015/09/09 16:52:40 UTC

Should replica placement change after a topology change?

Good afternoon,

I am investigating various topology changes, and their effect on replica placement. As far as I can tell, replica placement is not changing after I’ve changed the topology and run nodetool repair + cleanup. I followed the procedure described at http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_switch_snitch.html

Here is my test scenario:

  1.  Cassandra 2.0.15
  2.  6 nodes, initially set up with SimpleSnitch, vnodes enabled, all in one data centre.
  3.  Keyspace set up with SimpleStrategy, replication factor 3.
  4.  Four rows inserted into table in keyspace, integer primary key, text value.
  5.  I shut down the cluster, switch to GossipingPropertyFileSnitch. I set up nodes 1+2 to RAC1, 3+4 to RAC2, 5+6 to RAC3, all in data centre DC1.
  6.  Restart C* on all nodes.
  7.  Run a nodetool repair plus cleanup.
  8.  Change the keyspace to use replication strategy NetworkTopologyStrategy, RF 3 in DC1.
  9.  Run a nodetool repair plus cleanup.

To determine the token range ownership, I used “nodetool ring <keyspace>” and “nodetool info –T <keyspace>”. I saved the output of those commands with the original topology, after changing the topology, after repairing, after changing the replication strategy, and then again after repairing. In no cases did the tokens change. It looks like nodetool ring and nodetool info –T show the owner but not the replicas for a particular range.

I was expecting the replica placement to change. Because the racks were assigned in groups (rather than alternating), I was expecting the original replica placement with SimpleStrategy to be non-optimal after switching to NetworkTopologyStrategy. E.g.: if some data was replicated to nodes 1, 2 and 3, then after the topology change there would be 2 replicas in RAC1, 1 in RAC2 and none in RAC3. And hence when the repair ran, it would remove one replica from RAC1 and make sure that there was a replica in RAC3.

However, when I did a query using cqlsh at consistency QUORUM, I saw that it was hitting two replicas in the same rack, and a replica in a different rack. This suggests that the replica placement did not change after the topology change.

Am I missing something?

Is there some way I can see which nodes have a replica for a given token range?

Any help/insight appreciated.

Thanks, best regards, Rich


Re: Should replica placement change after a topology change?

Posted by Robert Coli <rc...@eventbrite.com>.
On Wed, Sep 16, 2015 at 3:39 AM, Richard Dawe <ri...@messagesystems.com>
wrote:

> In that mixed non-EC2/EC2 environment, with GossipingPropertyFileSnitch,
> it seems like you would need to simulate what Ec2Snitch does, and manually
> configure GPFS to treat each Availability Zone as a rack.
>

Yes, you configure GPFS with the same identifiers EC2Snitch would use, and
then rack awareness takes care of the rest.

=Rob

Re: Should replica placement change after a topology change?

Posted by Richard Dawe <ri...@messagesystems.com>.
Hi Rob,

On 11/09/2015 18:27, "Robert Coli" <rc...@eventbrite.com>> wrote:
On Fri, Sep 11, 2015 at 7:24 AM, Richard Dawe <ri...@messagesystems.com>> wrote:
Thanks, Nate and Rob. We are going to have to migrate some installations from SimpleSnitch to Ec2Snitch, others to GossipingPropertyFileSnitch. Your help is much appreciated!

If I were operating in a hybrid ec2/non-ec2 environment, I'd use GPFS everywhere, FWIW.

Right now we don’t have this mix — it’s either EC2 or non-EC2 — but who knows what the future holds?

In that mixed non-EC2/EC2 environment, with GossipingPropertyFileSnitch, it seems like you would need to simulate what Ec2Snitch does, and manually configure GPFS to treat each Availability Zone as a rack.

Thanks, best regards, Rich


Re: Should replica placement change after a topology change?

Posted by Robert Coli <rc...@eventbrite.com>.
On Fri, Sep 11, 2015 at 7:24 AM, Richard Dawe <ri...@messagesystems.com>
wrote:

> Thanks, Nate and Rob. We are going to have to migrate some installations
> from SimpleSnitch to Ec2Snitch, others to GossipingPropertyFileSnitch. Your
> help is much appreciated!
>

If I were operating in a hybrid ec2/non-ec2 environment, I'd use GPFS
everywhere, FWIW.

=Rob

Re: Should replica placement change after a topology change?

Posted by Richard Dawe <ri...@messagesystems.com>.
Thanks, Nate and Rob. We are going to have to migrate some installations from SimpleSnitch to Ec2Snitch, others to GossipingPropertyFileSnitch. Your help is much appreciated!

Best regards, Rich

On 10/09/2015 20:33, "Nate McCall" <na...@thelastpickle.com>> wrote:


So if you have a topology that would change if you switched from SimpleStrategy to NetworkTopologyStrategy plus multiple racks, it sounds like a different migration strategy would be needed?

I am imagining:

  1.  Switch to a different snitch, and the keyspace from SimpleStrategy to NTS but keep it all in one rack. So effectively the same topology, but with a different snitch.
  2.  Set up a new data centre with the desired topology.
  3.  Change the keyspace to have replicas in the new DC.
  4.  Rebuild all the nodes in the new DC.
  5.  Flip all your clients over to the new DC.
  6.  Decommission your original DC.

That would work, yes. I would add :

- 4.5. Repair all nodes.

I can confirm that the above process works (definitely include Rob's repair suggestion, though). It is really the only way we've found to safely go from SimpleSnitch to rack-aware NTS.

The same process works/is required for SimpleSnitch to Ec2Snitch fwiw.


Re: Should replica placement change after a topology change?

Posted by Robert Coli <rc...@eventbrite.com>.
On Thu, Sep 10, 2015 at 12:33 PM, Nate McCall <na...@thelastpickle.com>
wrote:

> I can confirm that the above process works (definitely include Rob's
>> repair suggestion, though). It is really the only way we've found to safely
>> go from SimpleSnitch to rack-aware NTS.
>>
>
> The same process works/is required for SimpleSnitch to Ec2Snitch fwiw.
>

I have safely gone from SimpleSnitch/Strategy to NTS/Ec2Snitch by doing a
NOOP in terms of replica placement, a few times.

This was before vnodes... I feel like vnodes may be a meaningful
impediment, they certainly make checking all ranges before and after much
more involved...

=Rob

Re: Should replica placement change after a topology change?

Posted by Nate McCall <na...@thelastpickle.com>.
>
>
> So if you have a topology that would change if you switched from
>> SimpleStrategy to NetworkTopologyStrategy plus multiple racks, it sounds
>> like a different migration strategy would be needed?
>>
>> I am imagining:
>>
>>    1. Switch to a different snitch, and the keyspace from SimpleStrategy
>>    to NTS but keep it all in one rack. So effectively the same topology, but
>>    with a different snitch.
>>    2. Set up a new data centre with the desired topology.
>>    3. Change the keyspace to have replicas in the new DC.
>>    4. Rebuild all the nodes in the new DC.
>>    5. Flip all your clients over to the new DC.
>>    6. Decommission your original DC.
>>
>> That would work, yes. I would add :
>
> - 4.5. Repair all nodes.
>

I can confirm that the above process works (definitely include Rob's repair
suggestion, though). It is really the only way we've found to safely go
from SimpleSnitch to rack-aware NTS.

The same process works/is required for SimpleSnitch to Ec2Snitch fwiw.




-- 
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Should replica placement change after a topology change?

Posted by Robert Coli <rc...@eventbrite.com>.
On Thu, Sep 10, 2015 at 8:55 AM, Richard Dawe <ri...@messagesystems.com>
wrote:

> So if you have a topology that would change if you switched from
> SimpleStrategy to NetworkTopologyStrategy plus multiple racks, it sounds
> like a different migration strategy would be needed?
>
> I am imagining:
>
>    1. Switch to a different snitch, and the keyspace from SimpleStrategy
>    to NTS but keep it all in one rack. So effectively the same topology, but
>    with a different snitch.
>    2. Set up a new data centre with the desired topology.
>    3. Change the keyspace to have replicas in the new DC.
>    4. Rebuild all the nodes in the new DC.
>    5. Flip all your clients over to the new DC.
>    6. Decommission your original DC.
>
> That would work, yes. I would add :

- 4.5. Repair all nodes.

But really, avoid getting in this situation in the first place... :D

=Rob

Re: Should replica placement change after a topology change?

Posted by Richard Dawe <ri...@messagesystems.com>.
Hi Robert,

Firstly, thank you very much for you help. I have some comments inline below.

On 10/09/2015 01:26, "Robert Coli" <rc...@eventbrite.com>> wrote:

On Wed, Sep 9, 2015 at 7:52 AM, Richard Dawe <ri...@messagesystems.com>> wrote:
I am investigating various topology changes, and their effect on replica placement. As far as I can tell, replica placement is not changing after I’ve changed the topology and run nodetool repair + cleanup. I followed the procedure described at http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_switch_snitch.html

That's probably a good thing. I'm going to be modifying the warning in the cassandra.yaml to advise users that in practice the only change of snitch or replication strategy one can safely do is one in which replica placement does not change. It currently says that you need to repair, but there are plenty of scenarios where you lose all existing replicas for a given datum, and are therefore unable to repair. The key is that you need at least one replica to stay the same or repair is worthless. And if you only have one replica staying the same, you lose any consistency consistency contract you might have been operating under. One ALMOST NEVER ACTUALLY WANTS TO DO ANYTHING BUT A NO-OP HERE.

So if you have a topology that would change if you switched from SimpleStrategy to NetworkTopologyStrategy plus multiple racks, it sounds like a different migration strategy would be needed?

I am imagining:

  1.  Switch to a different snitch, and the keyspace from SimpleStrategy to NTS but keep it all in one rack. So effectively the same topology, but with a different snitch.
  2.  Set up a new data centre with the desired topology.
  3.  Change the keyspace to have replicas in the new DC.
  4.  Rebuild all the nodes in the new DC.
  5.  Flip all your clients over to the new DC.
  6.  Decommission your original DC.

Or something like that.


Here is my test scenario : <snip>


  1.  To determine the token range ownership, I used “nodetool ring <keyspace>” and “nodetool info –T <keyspace>”. I saved the output of those commands with the original topology, after changing the topology, after repairing, after changing the replication strategy, and then again after repairing. In no cases did the tokens change. It looks like nodetool ring and nodetool info –T show the owner but not the replicas for a particular range.

The tokens and ranges shouldn't be changing, the replica placement should be. AFAIK neither of those commands show you replica placement, they show you primary range ownership.

Use getendpoints to determine replica placement before and after.


Thanks, I will play with that when I have a chance next week.


I was expecting the replica placement to change. Because the racks were assigned in groups (rather than alternating), I was expecting the original replica placement with SimpleStrategy to be non-optimal after switching to NetworkTopologyStrategy. E.g.: if some data was replicated to nodes 1, 2 and 3, then after the topology change there would be 2 replicas in RAC1, 1 in RAC2 and none in RAC3. And hence when the repair ran, it would remove one replica from RAC1 and make sure that there was a replica in RAC3.

I would expect this to be the case.

However, when I did a query using cqlsh at consistency QUORUM, I saw that it was hitting two replicas in the same rack, and a replica in a different rack. This suggests that the replica placement did not change after the topology change.

Perhaps you are seeing the quirks of the current rack-aware implementation, explicated here?

https://issues.apache.org/jira/browse/CASSANDRA-3810


Thanks. I need to re-read that a few times to understand it.

Is there some way I can see which nodes have a replica for a given token range?

Not for a range, but for a given key with nodetool getendpoints.

I wonder if there would be value to the range... in the pre-vnode past I have merely generated a key for each range. With the number of ranges increased so dramatically by vnodes, it might be easier to have an endpoint that works on ranges...

Thank you again. Best regards, Rich


=Rob


Re: Should replica placement change after a topology change?

Posted by Robert Coli <rc...@eventbrite.com>.
On Wed, Sep 9, 2015 at 7:52 AM, Richard Dawe <ri...@messagesystems.com>
wrote:

> I am investigating various topology changes, and their effect on replica
> placement. As far as I can tell, replica placement is not changing after
> I’ve changed the topology and run nodetool repair + cleanup. I followed the
> procedure described at
> http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_switch_snitch.html
>

That's probably a good thing. I'm going to be modifying the warning in the
cassandra.yaml to advise users that in practice the only change of snitch
or replication strategy one can safely do is one in which replica placement
does not change. It currently says that you need to repair, but there are
plenty of scenarios where you lose all existing replicas for a given datum,
and are therefore unable to repair. The key is that you need at least one
replica to stay the same or repair is worthless. And if you only have one
replica staying the same, you lose any consistency consistency contract you
might have been operating under. One ALMOST NEVER ACTUALLY WANTS TO DO
ANYTHING BUT A NO-OP HERE.

Here is my test scenario : <snip>
>


>
>    1. To determine the token range ownership, I used “nodetool ring
>    <keyspace>” and “nodetool info –T <keyspace>”. I saved the output of those
>    commands with the original topology, after changing the topology, after
>    repairing, after changing the replication strategy, and then again after
>    repairing. In no cases did the tokens change. It looks like nodetool ring
>    and nodetool info –T show the owner but not the replicas for a particular
>    range.
>
> The tokens and ranges shouldn't be changing, the replica placement should
be. AFAIK neither of those commands show you replica placement, they show
you primary range ownership.

Use getendpoints to determine replica placement before and after.


> I was expecting the replica placement to change. Because the racks were
> assigned in groups (rather than alternating), I was expecting the original
> replica placement with SimpleStrategy to be non-optimal after switching to
> NetworkTopologyStrategy. E.g.: if some data was replicated to nodes 1, 2
> and 3, then after the topology change there would be 2 replicas in RAC1, 1
> in RAC2 and none in RAC3. And hence when the repair ran, it would remove
> one replica from RAC1 and make sure that there was a replica in RAC3.
>

I would expect this to be the case.


> However, when I did a query using cqlsh at consistency QUORUM, I saw that
> it was hitting two replicas in the same rack, and a replica in a different
> rack. This suggests that the replica placement did not change after the
> topology change.
>

Perhaps you are seeing the quirks of the current rack-aware implementation,
explicated here?

https://issues.apache.org/jira/browse/CASSANDRA-3810


> Is there some way I can see which nodes have a replica for a given token
> range?
>

Not for a range, but for a given key with nodetool getendpoints.

I wonder if there would be value to the range... in the pre-vnode past I
have merely generated a key for each range. With the number of ranges
increased so dramatically by vnodes, it might be easier to have an endpoint
that works on ranges...

=Rob