You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Anubhav Kale <An...@microsoft.com> on 2016/03/23 16:07:22 UTC

Rack aware question.

Hello,

Suppose we change the racks on VMs on a running cluster. (We need to do this while running on Azure, because sometimes when the VM gets moved its rack changes).

In this situation, new writes will be laid out based on new rack info on appropriate replicas. What happens for existing data ? Is that data moved around as well and does it happen if we run repair or on its own ?


Thanks !

RE: Rack aware question.

Posted by Anubhav Kale <An...@microsoft.com>.
The consistency ALL was only for my testing so there could be a logical explanation to this. We use LOCAL_QUORUM in prod.

-------- Original message --------
From: Jack Krupansky <ja...@gmail.com>
Date: 3/23/2016 4:56 PM (GMT-08:00)
To: user@cassandra.apache.org
Subject: Re: Rack aware question.

CL=ALL also means that you won't have HA (High Availability) - if even a single node goes down, you're out of business. I mean, HA is the fundamental reason for using the rack-aware policy - to assure that each replica is on a separate power supply and network connection so that data can be retrieved even when a rack-level failure occurs.

In short, if CL=ALL is acceptable, then you might as well dump the rack-aware approach, which was how you got into this situation in the first place.

-- Jack Krupansky

On Wed, Mar 23, 2016 at 7:31 PM, Anubhav Kale <An...@microsoft.com>> wrote:
I ran into the following detail from : https://wiki.apache.org/cassandra/ReadRepair<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fwiki.apache.org%2fcassandra%2fReadRepair&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7cc48559a6d9634a66a93f08d35376ca24%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=PtWmKll8TYkFCi41zyoZMNQgn8fIxR6yxTCtXW54KSo%3d>

“If a lower ConsistencyLevel than ALL was specified, this is done in the background after returning the data from the closest replica to the client; otherwise, it is done before returning the data.”

I set consistency to ALL, and now I can get data all the time.

From: Anubhav Kale [mailto:Anubhav.Kale@microsoft.com<ma...@microsoft.com>]
Sent: Wednesday, March 23, 2016 4:14 PM

To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: RE: Rack aware question.

Thanks, Read repair is what I thought must be causing this, so I experimented some more with setting read_repair_chance and dc_local_read_repair_chance on the table to 0, and then 1.

Unfortunately, the results were somewhat random depending on which node I ran the queries from. For example, when chance = 1, running query from 127.0.0.3 would sometimes return 0 results and sometimes 1. I do see digest-mismatch-kicking-off-read-repair in traces in both cases, so running out of ideas here.  If you / someone can shed light on why this could be happening, that would be great !

That said, is it expected that “read repair” or a regular “nodetool repair” will shift the data around based on new replica placement ? And, if so is the recommendation to “rebootstrap” to mainly avoid this humongous data movement ?

The rationale behind ignore_rack flag makes sense, thanks. Maybe, we should document it better ?

Thanks !

From: Paulo Motta [mailto:pauloricardomg@gmail.com]
Sent: Wednesday, March 23, 2016 3:40 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Rack aware question.

> How come 127.0.0.1 is shown as an endpoint holding the ID when its token range doesn’t contain it ? Does “nodetool ring” shows all token-ranges for a node or just the primary range ? I am thinking its only primary. Can someone confirm ?

The primary replica of id=1 is always 127.0.0.3. What changes when you change racks is that the secondary replica will move to the next replica from a different rack, either 127.0.0.1 or 127.0.0.2.

> How come queries contact 127.0.0.1 ?

in the last case, 127.0.0.1 is the next node after the primary replica from a different rack (R2), so it should be contacted

> Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ? To prove / disprove that, I stopped 127.0.0.2 and ran a query with CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed hold the data (SS Tables also show it). So, does this mean that the data actually gets moved around when racks change ?

probably during some of your queries 127.0.0.3 (the primary replica) replicated data to 127.0.0.1 with read repair. There is no automatic data move when rack is changed (at least in OSS C*, not sure if DSE has this ability)

> If we don’t want to support this ever, I’d think the ignore_rack flag should just be deprecated.

ignore_rack flag can be useful if you move your data manually, with rsync or sstableloader.

2016-03-23 19:09 GMT-03:00 Anubhav Kale <An...@microsoft.com>>:
Thanks for the pointer – appreciate it.

My test is on the latest trunk and slightly different.

I am not exactly sure if the behavior I see is expected (in which case, is the recommendation to re-bootstrap just to avoid data movement?) or is the behavior not expected and is a bug.

If we don’t want to support this ever, I’d think the ignore_rack flag should just be deprecated.

From: Robert Coli [mailto:rcoli@eventbrite.com<ma...@eventbrite.com>]
Sent: Wednesday, March 23, 2016 2:54 PM

To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Rack aware question.

Actually, I believe you are seeing the behavior described in the ticket I meant to link to, with the detailed exploration :

https://issues.apache.org/jira/browse/CASSANDRA-10238<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10238&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7741553cdb7c4ce7ee1f08d3536599a0%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=3PY62w9X94T3fCkPZVJzN2dl8eda44Yj3zBvk83faWk%3d>

=Rob


On Wed, Mar 23, 2016 at 2:06 PM, Anubhav Kale <An...@microsoft.com>> wrote:
Oh, and the query I ran was “select * from racktest.racktable where id=1”

From: Anubhav Kale [mailto:Anubhav.Kale@microsoft.com<ma...@microsoft.com>]
Sent: Wednesday, March 23, 2016 2:04 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: RE: Rack aware question.

Thanks.

To test what happens when rack of a node changes in a running cluster without doing a decommission, I did the following.

The cluster looks like below (this was run through Eclipse, therefore the IP address hack)

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R1

R2


A table was created and a row inserted as follows:

Cqlsh 127.0.0.1
>create keyspace racktest with replication = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 2 };
>create table racktest.racktable(id int, PRIMARY KEY(id));
>insert into racktest.racktable(id) values(1);

nodetool getendpoints racktest racktable 1

127.0.0.2
127.0.0.3

Nodetool ring > ring_1.txt (attached)

So far so good.

Then I changed the racks to below and restarted DSE with –Dcassandra.ignore_rack=true.
This option from my finding simply avoids the check on startup that compares the rack in system.local with the one in rack-dc.properties.

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R2

R1


nodetool getendpoints racktest racktable 1

127.0.0.2
127.0.0.3

So far so good, cqlsh returns the queries fine.

Nodetool ring > ring_2.txt (attached)

Now comes the interesting part.

I changed the racks to below and restarted DSE.

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R2

R1

R1


nodetool getendpoints racktest racktable 1

127.0.0.1
127.0.0.3

This is very interesting, cqlsh returns the queries fine. With tracing on, it’s clear that the 127.0.0.1 is being asked for data as well.

Nodetool ring > ring_3.txt (attached)

There is no change in token information in ring_* files. The token under question for id=1 (from select token(id) from racktest.racktable) is -4069959284402364209.

So, few questions because things don’t add up:


  1.  How come 127.0.0.1 is shown as an endpoint holding the ID when its token range doesn’t contain it ? Does “nodetool ring” shows all token-ranges for a node or just the primary range ? I am thinking its only primary. Can someone confirm ?
  2.  How come queries contact 127.0.0.1 ?
  3.  Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ? To prove / disprove that, I stopped 127.0.0.2 and ran a query with CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed hold the data (SS Tables also show it).
  4.  So, does this mean that the data actually gets moved around when racks change ?

Thanks !


From: Robert Coli [mailto:rcoli@eventbrite.com]
Sent: Wednesday, March 23, 2016 11:59 AM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Rack aware question.

On Wed, Mar 23, 2016 at 8:07 AM, Anubhav Kale <An...@microsoft.com>> wrote:
Suppose we change the racks on VMs on a running cluster. (We need to do this while running on Azure, because sometimes when the VM gets moved its rack changes).

In this situation, new writes will be laid out based on new rack info on appropriate replicas. What happens for existing data ? Is that data moved around as well and does it happen if we run repair or on its own ?

First, you should understand this ticket if relying on rack awareness :

https://issues.apache.org/jira/browse/CASSANDRA-3810<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-3810&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=PIEK5w9ZycRYTymQXBCQOHQ9a1BuurGDFc6J3C%2fWvwQ%3d>

Second, in general nodes cannot move between racks.

https://issues.apache.org/jira/browse/CASSANDRA-10242<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10242&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=nHX51ahp3SyGKouKb2WFtYmMQSjSNVzH%2fzvN%2fNPJzPw%3d>

Has some detailed explanations of what blows up if they do.

Note that if you want to preserve any of the data on the node, you need to :

1) bring it and have it join the ring in its new rack (during which time it will serve incorrect reads due to missing data)
2) stop it
3) run cleanup
4) run repair
5) start it again

Can't really say that I recommend this practice, but it's better than "rebootstrap it" which is the official advice. If you "rebootstrap it" you decrease unique replica count by 1, which has a nonzero chance of data-loss. The Coli Conjecture says that in practice you probably don't care about this nonzero chance of data loss if you are running your application in CL.ONE, which should be all cases where it matters.

=Rob





Re: Rack aware question.

Posted by Jonathan Haddad <jo...@jonhaddad.com>.
Agreed with Jack.

I don't think there's ever a reason to use CL=ALL in an application in
production.  I would only use it if I was debugging certain types of
consistency problems.

On Wed, Mar 23, 2016 at 4:56 PM Jack Krupansky <ja...@gmail.com>
wrote:

> CL=ALL also means that you won't have HA (High Availability) - if even a
> single node goes down, you're out of business. I mean, HA is the
> fundamental reason for using the rack-aware policy - to assure that each
> replica is on a separate power supply and network connection so that data
> can be retrieved even when a rack-level failure occurs.
>
> In short, if CL=ALL is acceptable, then you might as well dump the
> rack-aware approach, which was how you got into this situation in the first
> place.
>
> -- Jack Krupansky
>
> On Wed, Mar 23, 2016 at 7:31 PM, Anubhav Kale <An...@microsoft.com>
> wrote:
>
>> I ran into the following detail from :
>> https://wiki.apache.org/cassandra/ReadRepair
>>
>>
>>
>> “If a lower ConsistencyLevel than ALL was specified, this is done in the
>> background after returning the data from the closest replica to the client;
>> otherwise, it is done before returning the data.”
>>
>>
>>
>> I set consistency to ALL, and now I can get data all the time.
>>
>>
>>
>> *From:* Anubhav Kale [mailto:Anubhav.Kale@microsoft.com]
>> *Sent:* Wednesday, March 23, 2016 4:14 PM
>>
>> *To:* user@cassandra.apache.org
>> *Subject:* RE: Rack aware question.
>>
>>
>>
>> Thanks, Read repair is what I thought must be causing this, so I
>> experimented some more with setting read_repair_chance and
>> dc_local_read_repair_chance on the table to 0, and then 1.
>>
>>
>>
>> Unfortunately, the results were somewhat random depending on which node I
>> ran the queries from. For example, when chance = 1, running query from
>> 127.0.0.3 would sometimes return 0 results and sometimes 1. I do see
>> digest-mismatch-kicking-off-read-repair in traces in both cases, so running
>> out of ideas here.  If you / someone can shed light on why this could be
>> happening, that would be great !
>>
>>
>>
>> That said, is it expected that “read repair” or a regular “nodetool
>> repair” will shift the data around based on new replica placement ? And, if
>> so is the recommendation to “rebootstrap” to mainly avoid this humongous
>> data movement ?
>>
>>
>>
>> The rationale behind ignore_rack flag makes sense, thanks. Maybe, we
>> should document it better ?
>>
>>
>>
>> Thanks !
>>
>>
>>
>> *From:* Paulo Motta [mailto:pauloricardomg@gmail.com
>> <pa...@gmail.com>]
>> *Sent:* Wednesday, March 23, 2016 3:40 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Rack aware question.
>>
>>
>>
>> > How come 127.0.0.1 is shown as an endpoint holding the ID when its
>> token range doesn’t contain it ? Does “nodetool ring” shows all
>> token-ranges for a node or just the primary range ? I am thinking its only
>> primary. Can someone confirm ?
>>
>> The primary replica of id=1 is always 127.0.0.3. What changes when you
>> change racks is that the secondary replica will move to the next replica
>> from a different rack, either 127.0.0.1 or 127.0.0.2.
>>
>> > How come queries contact 127.0.0.1 ?
>>
>> in the last case, 127.0.0.1 is the next node after the primary replica
>> from a different rack (R2), so it should be contacted
>>
>> > Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ?
>> To prove / disprove that, I stopped 127.0.0.2 and ran a query with
>> CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed hold
>> the data (SS Tables also show it). So, does this mean that the data
>> actually gets moved around when racks change ?
>>
>> probably during some of your queries 127.0.0.3 (the primary replica)
>> replicated data to 127.0.0.1 with read repair. There is no automatic data
>> move when rack is changed (at least in OSS C*, not sure if DSE has this
>> ability)
>>
>> > If we don’t want to support this ever, I’d think the ignore_rack flag
>> should just be deprecated.
>>
>> ignore_rack flag can be useful if you move your data manually, with rsync
>> or sstableloader.
>>
>>
>>
>> 2016-03-23 19:09 GMT-03:00 Anubhav Kale <An...@microsoft.com>:
>>
>> Thanks for the pointer – appreciate it.
>>
>>
>>
>> My test is on the latest trunk and slightly different.
>>
>>
>>
>> I am not exactly sure if the behavior I see is expected (in which case,
>> is the recommendation to re-bootstrap just to avoid data movement?) or is
>> the behavior not expected and is a bug.
>>
>>
>>
>> If we don’t want to support this ever, I’d think the ignore_rack flag
>> should just be deprecated.
>>
>>
>>
>> *From:* Robert Coli [mailto:rcoli@eventbrite.com]
>> *Sent:* Wednesday, March 23, 2016 2:54 PM
>>
>>
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Rack aware question.
>>
>>
>>
>> Actually, I believe you are seeing the behavior described in the ticket I
>> meant to link to, with the detailed exploration :
>>
>>
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-10238
>> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10238&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7741553cdb7c4ce7ee1f08d3536599a0%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=3PY62w9X94T3fCkPZVJzN2dl8eda44Yj3zBvk83faWk%3d>
>>
>>
>>
>> =Rob
>>
>>
>>
>>
>>
>> On Wed, Mar 23, 2016 at 2:06 PM, Anubhav Kale <An...@microsoft.com>
>> wrote:
>>
>> Oh, and the query I ran was “select * from racktest.racktable where id=1”
>>
>>
>>
>> *From:* Anubhav Kale [mailto:Anubhav.Kale@microsoft.com]
>> *Sent:* Wednesday, March 23, 2016 2:04 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* RE: Rack aware question.
>>
>>
>>
>> Thanks.
>>
>>
>>
>> To test what happens when rack of a node changes in a running cluster
>> without doing a decommission, I did the following.
>>
>>
>>
>> The cluster looks like below (this was run through Eclipse, therefore the
>> IP address hack)
>>
>>
>>
>> *IP*
>>
>> 127.0.0.1
>>
>> 127.0.0.2
>>
>> 127.0.0.3
>>
>> *Rack*
>>
>> R1
>>
>> R1
>>
>> R2
>>
>>
>>
>> A table was created and a row inserted as follows:
>>
>>
>>
>> Cqlsh 127.0.0.1
>>
>> >create keyspace racktest with replication = { 'class' :
>> 'NetworkTopologyStrategy', 'datacenter1' : 2 };
>>
>> >create table racktest.racktable(id int, PRIMARY KEY(id));
>>
>> >insert into racktest.racktable(id) values(1);
>>
>>
>>
>> nodetool getendpoints racktest racktable 1
>>
>>
>>
>> 127.0.0.2
>>
>> 127.0.0.3
>>
>>
>>
>> Nodetool ring > ring_1.txt (attached)
>>
>>
>>
>> So far so good.
>>
>>
>>
>> Then I changed the racks to below and restarted DSE with
>> –Dcassandra.ignore_rack=true.
>>
>> This option from my finding simply avoids the check on startup that
>> compares the rack in system.local with the one in rack-dc.properties.
>>
>>
>>
>> *IP*
>>
>> 127.0.0.1
>>
>> 127.0.0.2
>>
>> 127.0.0.3
>>
>> *Rack*
>>
>> R1
>>
>> R2
>>
>> R1
>>
>>
>>
>> nodetool getendpoints racktest racktable 1
>>
>>
>>
>> 127.0.0.2
>>
>> 127.0.0.3
>>
>>
>>
>> So far so good, cqlsh returns the queries fine.
>>
>>
>>
>> Nodetool ring > ring_2.txt (attached)
>>
>>
>>
>> Now comes the interesting part.
>>
>>
>>
>> I changed the racks to below and restarted DSE.
>>
>>
>>
>> *IP*
>>
>> 127.0.0.1
>>
>> 127.0.0.2
>>
>> 127.0.0.3
>>
>> *Rack*
>>
>> R2
>>
>> R1
>>
>> R1
>>
>>
>>
>> nodetool getendpoints racktest racktable 1
>>
>>
>>
>> 127.0.0.*1*
>>
>> 127.0.0.3
>>
>>
>>
>> This is *very* interesting, cqlsh returns the queries fine. With tracing
>> on, it’s clear that the 127.0.0.1 is being asked for data as well.
>>
>>
>>
>> Nodetool ring > ring_3.txt (attached)
>>
>>
>>
>> There is no change in token information in ring_* files. The token under
>> question for id=1 (from select token(id) from racktest.racktable) is
>> -4069959284402364209.
>>
>>
>>
>> So, few questions because things don’t add up:
>>
>>
>>
>>    1. How come 127.0.0.1 is shown as an endpoint holding the ID when its
>>    token range doesn’t contain it ? Does “nodetool ring” shows all
>>    token-ranges for a node or just the primary range ? I am thinking its only
>>    primary. Can someone confirm ?
>>    2. How come queries contact 127.0.0.1 ?
>>    3. Is “getendpoints” acting odd here and the data really is on
>>    127.0.0.2 ? To prove / disprove that, I stopped 127.0.0.2 and ran a query
>>    with CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed
>>    hold the data (SS Tables also show it).
>>    4. So, does this mean that the data actually gets moved around when
>>    racks change ?
>>
>>
>>
>> Thanks !
>>
>>
>>
>>
>>
>> *From:* Robert Coli [mailto:rcoli@eventbrite.com <rc...@eventbrite.com>]
>> *Sent:* Wednesday, March 23, 2016 11:59 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Rack aware question.
>>
>>
>>
>> On Wed, Mar 23, 2016 at 8:07 AM, Anubhav Kale <An...@microsoft.com>
>> wrote:
>>
>> Suppose we change the racks on VMs on a running cluster. (We need to do
>> this while running on Azure, because sometimes when the VM gets moved its
>> rack changes).
>>
>>
>>
>> In this situation, new writes will be laid out based on new rack info on
>> appropriate replicas. What happens for existing data ? Is that data moved
>> around as well and does it happen if we run repair or on its own ?
>>
>>
>>
>> First, you should understand this ticket if relying on rack awareness :
>>
>>
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-3810
>> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-3810&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=PIEK5w9ZycRYTymQXBCQOHQ9a1BuurGDFc6J3C%2fWvwQ%3d>
>>
>>
>>
>> Second, in general nodes cannot move between racks.
>>
>>
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-10242
>> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10242&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=nHX51ahp3SyGKouKb2WFtYmMQSjSNVzH%2fzvN%2fNPJzPw%3d>
>>
>>
>>
>> Has some detailed explanations of what blows up if they do.
>>
>>
>>
>> Note that if you want to preserve any of the data on the node, you need
>> to :
>>
>>
>>
>> 1) bring it and have it join the ring in its new rack (during which time
>> it will serve incorrect reads due to missing data)
>>
>> 2) stop it
>>
>> 3) run cleanup
>>
>> 4) run repair
>>
>> 5) start it again
>>
>>
>>
>> Can't really say that I recommend this practice, but it's better than
>> "rebootstrap it" which is the official advice. If you "rebootstrap it" you
>> decrease unique replica count by 1, which has a nonzero chance of
>> data-loss. The Coli Conjecture says that in practice you probably don't
>> care about this nonzero chance of data loss if you are running your
>> application in CL.ONE, which should be all cases where it matters.
>>
>>
>>
>> =Rob
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Rack aware question.

Posted by Jack Krupansky <ja...@gmail.com>.
CL=ALL also means that you won't have HA (High Availability) - if even a
single node goes down, you're out of business. I mean, HA is the
fundamental reason for using the rack-aware policy - to assure that each
replica is on a separate power supply and network connection so that data
can be retrieved even when a rack-level failure occurs.

In short, if CL=ALL is acceptable, then you might as well dump the
rack-aware approach, which was how you got into this situation in the first
place.

-- Jack Krupansky

On Wed, Mar 23, 2016 at 7:31 PM, Anubhav Kale <An...@microsoft.com>
wrote:

> I ran into the following detail from :
> https://wiki.apache.org/cassandra/ReadRepair
>
>
>
> “If a lower ConsistencyLevel than ALL was specified, this is done in the
> background after returning the data from the closest replica to the client;
> otherwise, it is done before returning the data.”
>
>
>
> I set consistency to ALL, and now I can get data all the time.
>
>
>
> *From:* Anubhav Kale [mailto:Anubhav.Kale@microsoft.com]
> *Sent:* Wednesday, March 23, 2016 4:14 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* RE: Rack aware question.
>
>
>
> Thanks, Read repair is what I thought must be causing this, so I
> experimented some more with setting read_repair_chance and
> dc_local_read_repair_chance on the table to 0, and then 1.
>
>
>
> Unfortunately, the results were somewhat random depending on which node I
> ran the queries from. For example, when chance = 1, running query from
> 127.0.0.3 would sometimes return 0 results and sometimes 1. I do see
> digest-mismatch-kicking-off-read-repair in traces in both cases, so running
> out of ideas here.  If you / someone can shed light on why this could be
> happening, that would be great !
>
>
>
> That said, is it expected that “read repair” or a regular “nodetool
> repair” will shift the data around based on new replica placement ? And, if
> so is the recommendation to “rebootstrap” to mainly avoid this humongous
> data movement ?
>
>
>
> The rationale behind ignore_rack flag makes sense, thanks. Maybe, we
> should document it better ?
>
>
>
> Thanks !
>
>
>
> *From:* Paulo Motta [mailto:pauloricardomg@gmail.com
> <pa...@gmail.com>]
> *Sent:* Wednesday, March 23, 2016 3:40 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Rack aware question.
>
>
>
> > How come 127.0.0.1 is shown as an endpoint holding the ID when its token
> range doesn’t contain it ? Does “nodetool ring” shows all token-ranges for
> a node or just the primary range ? I am thinking its only primary. Can
> someone confirm ?
>
> The primary replica of id=1 is always 127.0.0.3. What changes when you
> change racks is that the secondary replica will move to the next replica
> from a different rack, either 127.0.0.1 or 127.0.0.2.
>
> > How come queries contact 127.0.0.1 ?
>
> in the last case, 127.0.0.1 is the next node after the primary replica
> from a different rack (R2), so it should be contacted
>
> > Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ?
> To prove / disprove that, I stopped 127.0.0.2 and ran a query with
> CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed hold
> the data (SS Tables also show it). So, does this mean that the data
> actually gets moved around when racks change ?
>
> probably during some of your queries 127.0.0.3 (the primary replica)
> replicated data to 127.0.0.1 with read repair. There is no automatic data
> move when rack is changed (at least in OSS C*, not sure if DSE has this
> ability)
>
> > If we don’t want to support this ever, I’d think the ignore_rack flag
> should just be deprecated.
>
> ignore_rack flag can be useful if you move your data manually, with rsync
> or sstableloader.
>
>
>
> 2016-03-23 19:09 GMT-03:00 Anubhav Kale <An...@microsoft.com>:
>
> Thanks for the pointer – appreciate it.
>
>
>
> My test is on the latest trunk and slightly different.
>
>
>
> I am not exactly sure if the behavior I see is expected (in which case, is
> the recommendation to re-bootstrap just to avoid data movement?) or is the
> behavior not expected and is a bug.
>
>
>
> If we don’t want to support this ever, I’d think the ignore_rack flag
> should just be deprecated.
>
>
>
> *From:* Robert Coli [mailto:rcoli@eventbrite.com]
> *Sent:* Wednesday, March 23, 2016 2:54 PM
>
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: Rack aware question.
>
>
>
> Actually, I believe you are seeing the behavior described in the ticket I
> meant to link to, with the detailed exploration :
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-10238
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10238&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7741553cdb7c4ce7ee1f08d3536599a0%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=3PY62w9X94T3fCkPZVJzN2dl8eda44Yj3zBvk83faWk%3d>
>
>
>
> =Rob
>
>
>
>
>
> On Wed, Mar 23, 2016 at 2:06 PM, Anubhav Kale <An...@microsoft.com>
> wrote:
>
> Oh, and the query I ran was “select * from racktest.racktable where id=1”
>
>
>
> *From:* Anubhav Kale [mailto:Anubhav.Kale@microsoft.com]
> *Sent:* Wednesday, March 23, 2016 2:04 PM
> *To:* user@cassandra.apache.org
> *Subject:* RE: Rack aware question.
>
>
>
> Thanks.
>
>
>
> To test what happens when rack of a node changes in a running cluster
> without doing a decommission, I did the following.
>
>
>
> The cluster looks like below (this was run through Eclipse, therefore the
> IP address hack)
>
>
>
> *IP*
>
> 127.0.0.1
>
> 127.0.0.2
>
> 127.0.0.3
>
> *Rack*
>
> R1
>
> R1
>
> R2
>
>
>
> A table was created and a row inserted as follows:
>
>
>
> Cqlsh 127.0.0.1
>
> >create keyspace racktest with replication = { 'class' :
> 'NetworkTopologyStrategy', 'datacenter1' : 2 };
>
> >create table racktest.racktable(id int, PRIMARY KEY(id));
>
> >insert into racktest.racktable(id) values(1);
>
>
>
> nodetool getendpoints racktest racktable 1
>
>
>
> 127.0.0.2
>
> 127.0.0.3
>
>
>
> Nodetool ring > ring_1.txt (attached)
>
>
>
> So far so good.
>
>
>
> Then I changed the racks to below and restarted DSE with
> –Dcassandra.ignore_rack=true.
>
> This option from my finding simply avoids the check on startup that
> compares the rack in system.local with the one in rack-dc.properties.
>
>
>
> *IP*
>
> 127.0.0.1
>
> 127.0.0.2
>
> 127.0.0.3
>
> *Rack*
>
> R1
>
> R2
>
> R1
>
>
>
> nodetool getendpoints racktest racktable 1
>
>
>
> 127.0.0.2
>
> 127.0.0.3
>
>
>
> So far so good, cqlsh returns the queries fine.
>
>
>
> Nodetool ring > ring_2.txt (attached)
>
>
>
> Now comes the interesting part.
>
>
>
> I changed the racks to below and restarted DSE.
>
>
>
> *IP*
>
> 127.0.0.1
>
> 127.0.0.2
>
> 127.0.0.3
>
> *Rack*
>
> R2
>
> R1
>
> R1
>
>
>
> nodetool getendpoints racktest racktable 1
>
>
>
> 127.0.0.*1*
>
> 127.0.0.3
>
>
>
> This is *very* interesting, cqlsh returns the queries fine. With tracing
> on, it’s clear that the 127.0.0.1 is being asked for data as well.
>
>
>
> Nodetool ring > ring_3.txt (attached)
>
>
>
> There is no change in token information in ring_* files. The token under
> question for id=1 (from select token(id) from racktest.racktable) is
> -4069959284402364209.
>
>
>
> So, few questions because things don’t add up:
>
>
>
>    1. How come 127.0.0.1 is shown as an endpoint holding the ID when its
>    token range doesn’t contain it ? Does “nodetool ring” shows all
>    token-ranges for a node or just the primary range ? I am thinking its only
>    primary. Can someone confirm ?
>    2. How come queries contact 127.0.0.1 ?
>    3. Is “getendpoints” acting odd here and the data really is on
>    127.0.0.2 ? To prove / disprove that, I stopped 127.0.0.2 and ran a query
>    with CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed
>    hold the data (SS Tables also show it).
>    4. So, does this mean that the data actually gets moved around when
>    racks change ?
>
>
>
> Thanks !
>
>
>
>
>
> *From:* Robert Coli [mailto:rcoli@eventbrite.com <rc...@eventbrite.com>]
> *Sent:* Wednesday, March 23, 2016 11:59 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Rack aware question.
>
>
>
> On Wed, Mar 23, 2016 at 8:07 AM, Anubhav Kale <An...@microsoft.com>
> wrote:
>
> Suppose we change the racks on VMs on a running cluster. (We need to do
> this while running on Azure, because sometimes when the VM gets moved its
> rack changes).
>
>
>
> In this situation, new writes will be laid out based on new rack info on
> appropriate replicas. What happens for existing data ? Is that data moved
> around as well and does it happen if we run repair or on its own ?
>
>
>
> First, you should understand this ticket if relying on rack awareness :
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-3810
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-3810&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=PIEK5w9ZycRYTymQXBCQOHQ9a1BuurGDFc6J3C%2fWvwQ%3d>
>
>
>
> Second, in general nodes cannot move between racks.
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-10242
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10242&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=nHX51ahp3SyGKouKb2WFtYmMQSjSNVzH%2fzvN%2fNPJzPw%3d>
>
>
>
> Has some detailed explanations of what blows up if they do.
>
>
>
> Note that if you want to preserve any of the data on the node, you need to
> :
>
>
>
> 1) bring it and have it join the ring in its new rack (during which time
> it will serve incorrect reads due to missing data)
>
> 2) stop it
>
> 3) run cleanup
>
> 4) run repair
>
> 5) start it again
>
>
>
> Can't really say that I recommend this practice, but it's better than
> "rebootstrap it" which is the official advice. If you "rebootstrap it" you
> decrease unique replica count by 1, which has a nonzero chance of
> data-loss. The Coli Conjecture says that in practice you probably don't
> care about this nonzero chance of data loss if you are running your
> application in CL.ONE, which should be all cases where it matters.
>
>
>
> =Rob
>
>
>
>
>
>
>

RE: Rack aware question.

Posted by Anubhav Kale <An...@microsoft.com>.
I ran into the following detail from : https://wiki.apache.org/cassandra/ReadRepair

“If a lower ConsistencyLevel than ALL was specified, this is done in the background after returning the data from the closest replica to the client; otherwise, it is done before returning the data.”

I set consistency to ALL, and now I can get data all the time.

From: Anubhav Kale [mailto:Anubhav.Kale@microsoft.com]
Sent: Wednesday, March 23, 2016 4:14 PM
To: user@cassandra.apache.org
Subject: RE: Rack aware question.

Thanks, Read repair is what I thought must be causing this, so I experimented some more with setting read_repair_chance and dc_local_read_repair_chance on the table to 0, and then 1.

Unfortunately, the results were somewhat random depending on which node I ran the queries from. For example, when chance = 1, running query from 127.0.0.3 would sometimes return 0 results and sometimes 1. I do see digest-mismatch-kicking-off-read-repair in traces in both cases, so running out of ideas here.  If you / someone can shed light on why this could be happening, that would be great !

That said, is it expected that “read repair” or a regular “nodetool repair” will shift the data around based on new replica placement ? And, if so is the recommendation to “rebootstrap” to mainly avoid this humongous data movement ?

The rationale behind ignore_rack flag makes sense, thanks. Maybe, we should document it better ?

Thanks !

From: Paulo Motta [mailto:pauloricardomg@gmail.com]
Sent: Wednesday, March 23, 2016 3:40 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Rack aware question.

> How come 127.0.0.1 is shown as an endpoint holding the ID when its token range doesn’t contain it ? Does “nodetool ring” shows all token-ranges for a node or just the primary range ? I am thinking its only primary. Can someone confirm ?

The primary replica of id=1 is always 127.0.0.3. What changes when you change racks is that the secondary replica will move to the next replica from a different rack, either 127.0.0.1 or 127.0.0.2.

> How come queries contact 127.0.0.1 ?

in the last case, 127.0.0.1 is the next node after the primary replica from a different rack (R2), so it should be contacted

> Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ? To prove / disprove that, I stopped 127.0.0.2 and ran a query with CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed hold the data (SS Tables also show it). So, does this mean that the data actually gets moved around when racks change ?

probably during some of your queries 127.0.0.3 (the primary replica) replicated data to 127.0.0.1 with read repair. There is no automatic data move when rack is changed (at least in OSS C*, not sure if DSE has this ability)

> If we don’t want to support this ever, I’d think the ignore_rack flag should just be deprecated.

ignore_rack flag can be useful if you move your data manually, with rsync or sstableloader.

2016-03-23 19:09 GMT-03:00 Anubhav Kale <An...@microsoft.com>>:
Thanks for the pointer – appreciate it.

My test is on the latest trunk and slightly different.

I am not exactly sure if the behavior I see is expected (in which case, is the recommendation to re-bootstrap just to avoid data movement?) or is the behavior not expected and is a bug.

If we don’t want to support this ever, I’d think the ignore_rack flag should just be deprecated.

From: Robert Coli [mailto:rcoli@eventbrite.com<ma...@eventbrite.com>]
Sent: Wednesday, March 23, 2016 2:54 PM

To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Rack aware question.

Actually, I believe you are seeing the behavior described in the ticket I meant to link to, with the detailed exploration :

https://issues.apache.org/jira/browse/CASSANDRA-10238<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10238&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7741553cdb7c4ce7ee1f08d3536599a0%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=3PY62w9X94T3fCkPZVJzN2dl8eda44Yj3zBvk83faWk%3d>

=Rob


On Wed, Mar 23, 2016 at 2:06 PM, Anubhav Kale <An...@microsoft.com>> wrote:
Oh, and the query I ran was “select * from racktest.racktable where id=1”

From: Anubhav Kale [mailto:Anubhav.Kale@microsoft.com<ma...@microsoft.com>]
Sent: Wednesday, March 23, 2016 2:04 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: RE: Rack aware question.

Thanks.

To test what happens when rack of a node changes in a running cluster without doing a decommission, I did the following.

The cluster looks like below (this was run through Eclipse, therefore the IP address hack)

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R1

R2


A table was created and a row inserted as follows:

Cqlsh 127.0.0.1
>create keyspace racktest with replication = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 2 };
>create table racktest.racktable(id int, PRIMARY KEY(id));
>insert into racktest.racktable(id) values(1);

nodetool getendpoints racktest racktable 1

127.0.0.2
127.0.0.3

Nodetool ring > ring_1.txt (attached)

So far so good.

Then I changed the racks to below and restarted DSE with –Dcassandra.ignore_rack=true.
This option from my finding simply avoids the check on startup that compares the rack in system.local with the one in rack-dc.properties.

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R2

R1


nodetool getendpoints racktest racktable 1

127.0.0.2
127.0.0.3

So far so good, cqlsh returns the queries fine.

Nodetool ring > ring_2.txt (attached)

Now comes the interesting part.

I changed the racks to below and restarted DSE.

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R2

R1

R1


nodetool getendpoints racktest racktable 1

127.0.0.1
127.0.0.3

This is very interesting, cqlsh returns the queries fine. With tracing on, it’s clear that the 127.0.0.1 is being asked for data as well.

Nodetool ring > ring_3.txt (attached)

There is no change in token information in ring_* files. The token under question for id=1 (from select token(id) from racktest.racktable) is -4069959284402364209.

So, few questions because things don’t add up:


  1.  How come 127.0.0.1 is shown as an endpoint holding the ID when its token range doesn’t contain it ? Does “nodetool ring” shows all token-ranges for a node or just the primary range ? I am thinking its only primary. Can someone confirm ?
  2.  How come queries contact 127.0.0.1 ?
  3.  Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ? To prove / disprove that, I stopped 127.0.0.2 and ran a query with CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed hold the data (SS Tables also show it).
  4.  So, does this mean that the data actually gets moved around when racks change ?

Thanks !


From: Robert Coli [mailto:rcoli@eventbrite.com]
Sent: Wednesday, March 23, 2016 11:59 AM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Rack aware question.

On Wed, Mar 23, 2016 at 8:07 AM, Anubhav Kale <An...@microsoft.com>> wrote:
Suppose we change the racks on VMs on a running cluster. (We need to do this while running on Azure, because sometimes when the VM gets moved its rack changes).

In this situation, new writes will be laid out based on new rack info on appropriate replicas. What happens for existing data ? Is that data moved around as well and does it happen if we run repair or on its own ?

First, you should understand this ticket if relying on rack awareness :

https://issues.apache.org/jira/browse/CASSANDRA-3810<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-3810&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=PIEK5w9ZycRYTymQXBCQOHQ9a1BuurGDFc6J3C%2fWvwQ%3d>

Second, in general nodes cannot move between racks.

https://issues.apache.org/jira/browse/CASSANDRA-10242<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10242&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=nHX51ahp3SyGKouKb2WFtYmMQSjSNVzH%2fzvN%2fNPJzPw%3d>

Has some detailed explanations of what blows up if they do.

Note that if you want to preserve any of the data on the node, you need to :

1) bring it and have it join the ring in its new rack (during which time it will serve incorrect reads due to missing data)
2) stop it
3) run cleanup
4) run repair
5) start it again

Can't really say that I recommend this practice, but it's better than "rebootstrap it" which is the official advice. If you "rebootstrap it" you decrease unique replica count by 1, which has a nonzero chance of data-loss. The Coli Conjecture says that in practice you probably don't care about this nonzero chance of data loss if you are running your application in CL.ONE, which should be all cases where it matters.

=Rob




RE: Rack aware question.

Posted by Anubhav Kale <An...@microsoft.com>.
Thanks, Read repair is what I thought must be causing this, so I experimented some more with setting read_repair_chance and dc_local_read_repair_chance on the table to 0, and then 1.

Unfortunately, the results were somewhat random depending on which node I ran the queries from. For example, when chance = 1, running query from 127.0.0.3 would sometimes return 0 results and sometimes 1. I do see digest-mismatch-kicking-off-read-repair in traces in both cases, so running out of ideas here.  If you / someone can shed light on why this could be happening, that would be great !

That said, is it expected that “read repair” or a regular “nodetool repair” will shift the data around based on new replica placement ? And, if so is the recommendation to “rebootstrap” to mainly avoid this humongous data movement ?

The rationale behind ignore_rack flag makes sense, thanks. Maybe, we should document it better ?

Thanks !

From: Paulo Motta [mailto:pauloricardomg@gmail.com]
Sent: Wednesday, March 23, 2016 3:40 PM
To: user@cassandra.apache.org
Subject: Re: Rack aware question.

> How come 127.0.0.1 is shown as an endpoint holding the ID when its token range doesn’t contain it ? Does “nodetool ring” shows all token-ranges for a node or just the primary range ? I am thinking its only primary. Can someone confirm ?

The primary replica of id=1 is always 127.0.0.3. What changes when you change racks is that the secondary replica will move to the next replica from a different rack, either 127.0.0.1 or 127.0.0.2.

> How come queries contact 127.0.0.1 ?

in the last case, 127.0.0.1 is the next node after the primary replica from a different rack (R2), so it should be contacted

> Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ? To prove / disprove that, I stopped 127.0.0.2 and ran a query with CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed hold the data (SS Tables also show it). So, does this mean that the data actually gets moved around when racks change ?

probably during some of your queries 127.0.0.3 (the primary replica) replicated data to 127.0.0.1 with read repair. There is no automatic data move when rack is changed (at least in OSS C*, not sure if DSE has this ability)

> If we don’t want to support this ever, I’d think the ignore_rack flag should just be deprecated.

ignore_rack flag can be useful if you move your data manually, with rsync or sstableloader.

2016-03-23 19:09 GMT-03:00 Anubhav Kale <An...@microsoft.com>>:
Thanks for the pointer – appreciate it.

My test is on the latest trunk and slightly different.

I am not exactly sure if the behavior I see is expected (in which case, is the recommendation to re-bootstrap just to avoid data movement?) or is the behavior not expected and is a bug.

If we don’t want to support this ever, I’d think the ignore_rack flag should just be deprecated.

From: Robert Coli [mailto:rcoli@eventbrite.com<ma...@eventbrite.com>]
Sent: Wednesday, March 23, 2016 2:54 PM

To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Rack aware question.

Actually, I believe you are seeing the behavior described in the ticket I meant to link to, with the detailed exploration :

https://issues.apache.org/jira/browse/CASSANDRA-10238<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10238&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7741553cdb7c4ce7ee1f08d3536599a0%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=3PY62w9X94T3fCkPZVJzN2dl8eda44Yj3zBvk83faWk%3d>

=Rob


On Wed, Mar 23, 2016 at 2:06 PM, Anubhav Kale <An...@microsoft.com>> wrote:
Oh, and the query I ran was “select * from racktest.racktable where id=1”

From: Anubhav Kale [mailto:Anubhav.Kale@microsoft.com<ma...@microsoft.com>]
Sent: Wednesday, March 23, 2016 2:04 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: RE: Rack aware question.

Thanks.

To test what happens when rack of a node changes in a running cluster without doing a decommission, I did the following.

The cluster looks like below (this was run through Eclipse, therefore the IP address hack)

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R1

R2


A table was created and a row inserted as follows:

Cqlsh 127.0.0.1
>create keyspace racktest with replication = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 2 };
>create table racktest.racktable(id int, PRIMARY KEY(id));
>insert into racktest.racktable(id) values(1);

nodetool getendpoints racktest racktable 1

127.0.0.2
127.0.0.3

Nodetool ring > ring_1.txt (attached)

So far so good.

Then I changed the racks to below and restarted DSE with –Dcassandra.ignore_rack=true.
This option from my finding simply avoids the check on startup that compares the rack in system.local with the one in rack-dc.properties.

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R2

R1


nodetool getendpoints racktest racktable 1

127.0.0.2
127.0.0.3

So far so good, cqlsh returns the queries fine.

Nodetool ring > ring_2.txt (attached)

Now comes the interesting part.

I changed the racks to below and restarted DSE.

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R2

R1

R1


nodetool getendpoints racktest racktable 1

127.0.0.1
127.0.0.3

This is very interesting, cqlsh returns the queries fine. With tracing on, it’s clear that the 127.0.0.1 is being asked for data as well.

Nodetool ring > ring_3.txt (attached)

There is no change in token information in ring_* files. The token under question for id=1 (from select token(id) from racktest.racktable) is -4069959284402364209.

So, few questions because things don’t add up:


  1.  How come 127.0.0.1 is shown as an endpoint holding the ID when its token range doesn’t contain it ? Does “nodetool ring” shows all token-ranges for a node or just the primary range ? I am thinking its only primary. Can someone confirm ?
  2.  How come queries contact 127.0.0.1 ?
  3.  Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ? To prove / disprove that, I stopped 127.0.0.2 and ran a query with CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed hold the data (SS Tables also show it).
  4.  So, does this mean that the data actually gets moved around when racks change ?

Thanks !


From: Robert Coli [mailto:rcoli@eventbrite.com]
Sent: Wednesday, March 23, 2016 11:59 AM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Rack aware question.

On Wed, Mar 23, 2016 at 8:07 AM, Anubhav Kale <An...@microsoft.com>> wrote:
Suppose we change the racks on VMs on a running cluster. (We need to do this while running on Azure, because sometimes when the VM gets moved its rack changes).

In this situation, new writes will be laid out based on new rack info on appropriate replicas. What happens for existing data ? Is that data moved around as well and does it happen if we run repair or on its own ?

First, you should understand this ticket if relying on rack awareness :

https://issues.apache.org/jira/browse/CASSANDRA-3810<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-3810&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=PIEK5w9ZycRYTymQXBCQOHQ9a1BuurGDFc6J3C%2fWvwQ%3d>

Second, in general nodes cannot move between racks.

https://issues.apache.org/jira/browse/CASSANDRA-10242<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10242&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=nHX51ahp3SyGKouKb2WFtYmMQSjSNVzH%2fzvN%2fNPJzPw%3d>

Has some detailed explanations of what blows up if they do.

Note that if you want to preserve any of the data on the node, you need to :

1) bring it and have it join the ring in its new rack (during which time it will serve incorrect reads due to missing data)
2) stop it
3) run cleanup
4) run repair
5) start it again

Can't really say that I recommend this practice, but it's better than "rebootstrap it" which is the official advice. If you "rebootstrap it" you decrease unique replica count by 1, which has a nonzero chance of data-loss. The Coli Conjecture says that in practice you probably don't care about this nonzero chance of data loss if you are running your application in CL.ONE, which should be all cases where it matters.

=Rob




Re: Rack aware question.

Posted by Paulo Motta <pa...@gmail.com>.
> How come 127.0.0.1 is shown as an endpoint holding the ID when its token
range doesn’t contain it ? Does “nodetool ring” shows all token-ranges for
a node or just the primary range ? I am thinking its only primary. Can
someone confirm ?

The primary replica of id=1 is always 127.0.0.3. What changes when you
change racks is that the secondary replica will move to the next replica
from a different rack, either 127.0.0.1 or 127.0.0.2.

> How come queries contact 127.0.0.1 ?

in the last case, 127.0.0.1 is the next node after the primary replica from
a different rack (R2), so it should be contacted

> Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ?
To prove / disprove that, I stopped 127.0.0.2 and ran a query with
CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed hold
the data (SS Tables also show it). So, does this mean that the data
actually gets moved around when racks change ?

probably during some of your queries 127.0.0.3 (the primary replica)
replicated data to 127.0.0.1 with read repair. There is no automatic data
move when rack is changed (at least in OSS C*, not sure if DSE has this
ability)

> If we don’t want to support this ever, I’d think the ignore_rack flag
should just be deprecated.

ignore_rack flag can be useful if you move your data manually, with rsync
or sstableloader.

2016-03-23 19:09 GMT-03:00 Anubhav Kale <An...@microsoft.com>:

> Thanks for the pointer – appreciate it.
>
>
>
> My test is on the latest trunk and slightly different.
>
>
>
> I am not exactly sure if the behavior I see is expected (in which case, is
> the recommendation to re-bootstrap just to avoid data movement?) or is the
> behavior not expected and is a bug.
>
>
>
> If we don’t want to support this ever, I’d think the ignore_rack flag
> should just be deprecated.
>
>
>
> *From:* Robert Coli [mailto:rcoli@eventbrite.com]
> *Sent:* Wednesday, March 23, 2016 2:54 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: Rack aware question.
>
>
>
> Actually, I believe you are seeing the behavior described in the ticket I
> meant to link to, with the detailed exploration :
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-10238
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10238&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7741553cdb7c4ce7ee1f08d3536599a0%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=3PY62w9X94T3fCkPZVJzN2dl8eda44Yj3zBvk83faWk%3d>
>
>
>
> =Rob
>
>
>
>
>
> On Wed, Mar 23, 2016 at 2:06 PM, Anubhav Kale <An...@microsoft.com>
> wrote:
>
> Oh, and the query I ran was “select * from racktest.racktable where id=1”
>
>
>
> *From:* Anubhav Kale [mailto:Anubhav.Kale@microsoft.com]
> *Sent:* Wednesday, March 23, 2016 2:04 PM
> *To:* user@cassandra.apache.org
> *Subject:* RE: Rack aware question.
>
>
>
> Thanks.
>
>
>
> To test what happens when rack of a node changes in a running cluster
> without doing a decommission, I did the following.
>
>
>
> The cluster looks like below (this was run through Eclipse, therefore the
> IP address hack)
>
>
>
> *IP*
>
> 127.0.0.1
>
> 127.0.0.2
>
> 127.0.0.3
>
> *Rack*
>
> R1
>
> R1
>
> R2
>
>
>
> A table was created and a row inserted as follows:
>
>
>
> Cqlsh 127.0.0.1
>
> >create keyspace racktest with replication = { 'class' :
> 'NetworkTopologyStrategy', 'datacenter1' : 2 };
>
> >create table racktest.racktable(id int, PRIMARY KEY(id));
>
> >insert into racktest.racktable(id) values(1);
>
>
>
> nodetool getendpoints racktest racktable 1
>
>
>
> 127.0.0.2
>
> 127.0.0.3
>
>
>
> Nodetool ring > ring_1.txt (attached)
>
>
>
> So far so good.
>
>
>
> Then I changed the racks to below and restarted DSE with
> –Dcassandra.ignore_rack=true.
>
> This option from my finding simply avoids the check on startup that
> compares the rack in system.local with the one in rack-dc.properties.
>
>
>
> *IP*
>
> 127.0.0.1
>
> 127.0.0.2
>
> 127.0.0.3
>
> *Rack*
>
> R1
>
> R2
>
> R1
>
>
>
> nodetool getendpoints racktest racktable 1
>
>
>
> 127.0.0.2
>
> 127.0.0.3
>
>
>
> So far so good, cqlsh returns the queries fine.
>
>
>
> Nodetool ring > ring_2.txt (attached)
>
>
>
> Now comes the interesting part.
>
>
>
> I changed the racks to below and restarted DSE.
>
>
>
> *IP*
>
> 127.0.0.1
>
> 127.0.0.2
>
> 127.0.0.3
>
> *Rack*
>
> R2
>
> R1
>
> R1
>
>
>
> nodetool getendpoints racktest racktable 1
>
>
>
> 127.0.0.*1*
>
> 127.0.0.3
>
>
>
> This is *very* interesting, cqlsh returns the queries fine. With tracing
> on, it’s clear that the 127.0.0.1 is being asked for data as well.
>
>
>
> Nodetool ring > ring_3.txt (attached)
>
>
>
> There is no change in token information in ring_* files. The token under
> question for id=1 (from select token(id) from racktest.racktable) is
> -4069959284402364209.
>
>
>
> So, few questions because things don’t add up:
>
>
>
>    1. How come 127.0.0.1 is shown as an endpoint holding the ID when its
>    token range doesn’t contain it ? Does “nodetool ring” shows all
>    token-ranges for a node or just the primary range ? I am thinking its only
>    primary. Can someone confirm ?
>    2. How come queries contact 127.0.0.1 ?
>    3. Is “getendpoints” acting odd here and the data really is on
>    127.0.0.2 ? To prove / disprove that, I stopped 127.0.0.2 and ran a query
>    with CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed
>    hold the data (SS Tables also show it).
>    4. So, does this mean that the data actually gets moved around when
>    racks change ?
>
>
>
> Thanks !
>
>
>
>
>
> *From:* Robert Coli [mailto:rcoli@eventbrite.com <rc...@eventbrite.com>]
> *Sent:* Wednesday, March 23, 2016 11:59 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Rack aware question.
>
>
>
> On Wed, Mar 23, 2016 at 8:07 AM, Anubhav Kale <An...@microsoft.com>
> wrote:
>
> Suppose we change the racks on VMs on a running cluster. (We need to do
> this while running on Azure, because sometimes when the VM gets moved its
> rack changes).
>
>
>
> In this situation, new writes will be laid out based on new rack info on
> appropriate replicas. What happens for existing data ? Is that data moved
> around as well and does it happen if we run repair or on its own ?
>
>
>
> First, you should understand this ticket if relying on rack awareness :
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-3810
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-3810&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=PIEK5w9ZycRYTymQXBCQOHQ9a1BuurGDFc6J3C%2fWvwQ%3d>
>
>
>
> Second, in general nodes cannot move between racks.
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-10242
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10242&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=nHX51ahp3SyGKouKb2WFtYmMQSjSNVzH%2fzvN%2fNPJzPw%3d>
>
>
>
> Has some detailed explanations of what blows up if they do.
>
>
>
> Note that if you want to preserve any of the data on the node, you need to
> :
>
>
>
> 1) bring it and have it join the ring in its new rack (during which time
> it will serve incorrect reads due to missing data)
>
> 2) stop it
>
> 3) run cleanup
>
> 4) run repair
>
> 5) start it again
>
>
>
> Can't really say that I recommend this practice, but it's better than
> "rebootstrap it" which is the official advice. If you "rebootstrap it" you
> decrease unique replica count by 1, which has a nonzero chance of
> data-loss. The Coli Conjecture says that in practice you probably don't
> care about this nonzero chance of data loss if you are running your
> application in CL.ONE, which should be all cases where it matters.
>
>
>
> =Rob
>
>
>
>
>

RE: Rack aware question.

Posted by Anubhav Kale <An...@microsoft.com>.
Thanks for the pointer – appreciate it.

My test is on the latest trunk and slightly different.

I am not exactly sure if the behavior I see is expected (in which case, is the recommendation to re-bootstrap just to avoid data movement?) or is the behavior not expected and is a bug.

If we don’t want to support this ever, I’d think the ignore_rack flag should just be deprecated.

From: Robert Coli [mailto:rcoli@eventbrite.com]
Sent: Wednesday, March 23, 2016 2:54 PM
To: user@cassandra.apache.org
Subject: Re: Rack aware question.

Actually, I believe you are seeing the behavior described in the ticket I meant to link to, with the detailed exploration :

https://issues.apache.org/jira/browse/CASSANDRA-10238<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10238&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7741553cdb7c4ce7ee1f08d3536599a0%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=3PY62w9X94T3fCkPZVJzN2dl8eda44Yj3zBvk83faWk%3d>

=Rob


On Wed, Mar 23, 2016 at 2:06 PM, Anubhav Kale <An...@microsoft.com>> wrote:
Oh, and the query I ran was “select * from racktest.racktable where id=1”

From: Anubhav Kale [mailto:Anubhav.Kale@microsoft.com<ma...@microsoft.com>]
Sent: Wednesday, March 23, 2016 2:04 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: RE: Rack aware question.

Thanks.

To test what happens when rack of a node changes in a running cluster without doing a decommission, I did the following.

The cluster looks like below (this was run through Eclipse, therefore the IP address hack)

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R1

R2


A table was created and a row inserted as follows:

Cqlsh 127.0.0.1
>create keyspace racktest with replication = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 2 };
>create table racktest.racktable(id int, PRIMARY KEY(id));
>insert into racktest.racktable(id) values(1);

nodetool getendpoints racktest racktable 1

127.0.0.2
127.0.0.3

Nodetool ring > ring_1.txt (attached)

So far so good.

Then I changed the racks to below and restarted DSE with –Dcassandra.ignore_rack=true.
This option from my finding simply avoids the check on startup that compares the rack in system.local with the one in rack-dc.properties.

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R2

R1


nodetool getendpoints racktest racktable 1

127.0.0.2
127.0.0.3

So far so good, cqlsh returns the queries fine.

Nodetool ring > ring_2.txt (attached)

Now comes the interesting part.

I changed the racks to below and restarted DSE.

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R2

R1

R1


nodetool getendpoints racktest racktable 1

127.0.0.1
127.0.0.3

This is very interesting, cqlsh returns the queries fine. With tracing on, it’s clear that the 127.0.0.1 is being asked for data as well.

Nodetool ring > ring_3.txt (attached)

There is no change in token information in ring_* files. The token under question for id=1 (from select token(id) from racktest.racktable) is -4069959284402364209.

So, few questions because things don’t add up:


  1.  How come 127.0.0.1 is shown as an endpoint holding the ID when its token range doesn’t contain it ? Does “nodetool ring” shows all token-ranges for a node or just the primary range ? I am thinking its only primary. Can someone confirm ?
  2.  How come queries contact 127.0.0.1 ?
  3.  Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ? To prove / disprove that, I stopped 127.0.0.2 and ran a query with CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed hold the data (SS Tables also show it).
  4.  So, does this mean that the data actually gets moved around when racks change ?

Thanks !


From: Robert Coli [mailto:rcoli@eventbrite.com]
Sent: Wednesday, March 23, 2016 11:59 AM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Rack aware question.

On Wed, Mar 23, 2016 at 8:07 AM, Anubhav Kale <An...@microsoft.com>> wrote:
Suppose we change the racks on VMs on a running cluster. (We need to do this while running on Azure, because sometimes when the VM gets moved its rack changes).

In this situation, new writes will be laid out based on new rack info on appropriate replicas. What happens for existing data ? Is that data moved around as well and does it happen if we run repair or on its own ?

First, you should understand this ticket if relying on rack awareness :

https://issues.apache.org/jira/browse/CASSANDRA-3810<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-3810&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=PIEK5w9ZycRYTymQXBCQOHQ9a1BuurGDFc6J3C%2fWvwQ%3d>

Second, in general nodes cannot move between racks.

https://issues.apache.org/jira/browse/CASSANDRA-10242<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10242&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=nHX51ahp3SyGKouKb2WFtYmMQSjSNVzH%2fzvN%2fNPJzPw%3d>

Has some detailed explanations of what blows up if they do.

Note that if you want to preserve any of the data on the node, you need to :

1) bring it and have it join the ring in its new rack (during which time it will serve incorrect reads due to missing data)
2) stop it
3) run cleanup
4) run repair
5) start it again

Can't really say that I recommend this practice, but it's better than "rebootstrap it" which is the official advice. If you "rebootstrap it" you decrease unique replica count by 1, which has a nonzero chance of data-loss. The Coli Conjecture says that in practice you probably don't care about this nonzero chance of data loss if you are running your application in CL.ONE, which should be all cases where it matters.

=Rob



Re: Rack aware question.

Posted by Robert Coli <rc...@eventbrite.com>.
Actually, I believe you are seeing the behavior described in the ticket I
meant to link to, with the detailed exploration :

https://issues.apache.org/jira/browse/CASSANDRA-10238

=Rob


On Wed, Mar 23, 2016 at 2:06 PM, Anubhav Kale <An...@microsoft.com>
wrote:

> Oh, and the query I ran was “select * from racktest.racktable where id=1”
>
>
>
> *From:* Anubhav Kale [mailto:Anubhav.Kale@microsoft.com]
> *Sent:* Wednesday, March 23, 2016 2:04 PM
> *To:* user@cassandra.apache.org
> *Subject:* RE: Rack aware question.
>
>
>
> Thanks.
>
>
>
> To test what happens when rack of a node changes in a running cluster
> without doing a decommission, I did the following.
>
>
>
> The cluster looks like below (this was run through Eclipse, therefore the
> IP address hack)
>
>
>
> *IP*
>
> 127.0.0.1
>
> 127.0.0.2
>
> 127.0.0.3
>
> *Rack*
>
> R1
>
> R1
>
> R2
>
>
>
> A table was created and a row inserted as follows:
>
>
>
> Cqlsh 127.0.0.1
>
> >create keyspace racktest with replication = { 'class' :
> 'NetworkTopologyStrategy', 'datacenter1' : 2 };
>
> >create table racktest.racktable(id int, PRIMARY KEY(id));
>
> >insert into racktest.racktable(id) values(1);
>
>
>
> nodetool getendpoints racktest racktable 1
>
>
>
> 127.0.0.2
>
> 127.0.0.3
>
>
>
> Nodetool ring > ring_1.txt (attached)
>
>
>
> So far so good.
>
>
>
> Then I changed the racks to below and restarted DSE with
> –Dcassandra.ignore_rack=true.
>
> This option from my finding simply avoids the check on startup that
> compares the rack in system.local with the one in rack-dc.properties.
>
>
>
> *IP*
>
> 127.0.0.1
>
> 127.0.0.2
>
> 127.0.0.3
>
> *Rack*
>
> R1
>
> R2
>
> R1
>
>
>
> nodetool getendpoints racktest racktable 1
>
>
>
> 127.0.0.2
>
> 127.0.0.3
>
>
>
> So far so good, cqlsh returns the queries fine.
>
>
>
> Nodetool ring > ring_2.txt (attached)
>
>
>
> Now comes the interesting part.
>
>
>
> I changed the racks to below and restarted DSE.
>
>
>
> *IP*
>
> 127.0.0.1
>
> 127.0.0.2
>
> 127.0.0.3
>
> *Rack*
>
> R2
>
> R1
>
> R1
>
>
>
> nodetool getendpoints racktest racktable 1
>
>
>
> 127.0.0.*1*
>
> 127.0.0.3
>
>
>
> This is *very* interesting, cqlsh returns the queries fine. With tracing
> on, it’s clear that the 127.0.0.1 is being asked for data as well.
>
>
>
> Nodetool ring > ring_3.txt (attached)
>
>
>
> There is no change in token information in ring_* files. The token under
> question for id=1 (from select token(id) from racktest.racktable) is
> -4069959284402364209.
>
>
>
> So, few questions because things don’t add up:
>
>
>
>    1. How come 127.0.0.1 is shown as an endpoint holding the ID when its
>    token range doesn’t contain it ? Does “nodetool ring” shows all
>    token-ranges for a node or just the primary range ? I am thinking its only
>    primary. Can someone confirm ?
>    2. How come queries contact 127.0.0.1 ?
>    3. Is “getendpoints” acting odd here and the data really is on
>    127.0.0.2 ? To prove / disprove that, I stopped 127.0.0.2 and ran a query
>    with CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed
>    hold the data (SS Tables also show it).
>    4. So, does this mean that the data actually gets moved around when
>    racks change ?
>
>
>
> Thanks !
>
>
>
>
>
> *From:* Robert Coli [mailto:rcoli@eventbrite.com <rc...@eventbrite.com>]
> *Sent:* Wednesday, March 23, 2016 11:59 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Rack aware question.
>
>
>
> On Wed, Mar 23, 2016 at 8:07 AM, Anubhav Kale <An...@microsoft.com>
> wrote:
>
> Suppose we change the racks on VMs on a running cluster. (We need to do
> this while running on Azure, because sometimes when the VM gets moved its
> rack changes).
>
>
>
> In this situation, new writes will be laid out based on new rack info on
> appropriate replicas. What happens for existing data ? Is that data moved
> around as well and does it happen if we run repair or on its own ?
>
>
>
> First, you should understand this ticket if relying on rack awareness :
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-3810
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-3810&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=PIEK5w9ZycRYTymQXBCQOHQ9a1BuurGDFc6J3C%2fWvwQ%3d>
>
>
>
> Second, in general nodes cannot move between racks.
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-10242
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10242&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=nHX51ahp3SyGKouKb2WFtYmMQSjSNVzH%2fzvN%2fNPJzPw%3d>
>
>
>
> Has some detailed explanations of what blows up if they do.
>
>
>
> Note that if you want to preserve any of the data on the node, you need to
> :
>
>
>
> 1) bring it and have it join the ring in its new rack (during which time
> it will serve incorrect reads due to missing data)
>
> 2) stop it
>
> 3) run cleanup
>
> 4) run repair
>
> 5) start it again
>
>
>
> Can't really say that I recommend this practice, but it's better than
> "rebootstrap it" which is the official advice. If you "rebootstrap it" you
> decrease unique replica count by 1, which has a nonzero chance of
> data-loss. The Coli Conjecture says that in practice you probably don't
> care about this nonzero chance of data loss if you are running your
> application in CL.ONE, which should be all cases where it matters.
>
>
>
> =Rob
>
>
>

RE: Rack aware question.

Posted by Anubhav Kale <An...@microsoft.com>.
Oh, and the query I ran was “select * from racktest.racktable where id=1”

From: Anubhav Kale [mailto:Anubhav.Kale@microsoft.com]
Sent: Wednesday, March 23, 2016 2:04 PM
To: user@cassandra.apache.org
Subject: RE: Rack aware question.

Thanks.

To test what happens when rack of a node changes in a running cluster without doing a decommission, I did the following.

The cluster looks like below (this was run through Eclipse, therefore the IP address hack)

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R1

R2


A table was created and a row inserted as follows:

Cqlsh 127.0.0.1
>create keyspace racktest with replication = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 2 };
>create table racktest.racktable(id int, PRIMARY KEY(id));
>insert into racktest.racktable(id) values(1);

nodetool getendpoints racktest racktable 1

127.0.0.2
127.0.0.3

Nodetool ring > ring_1.txt (attached)

So far so good.

Then I changed the racks to below and restarted DSE with –Dcassandra.ignore_rack=true.
This option from my finding simply avoids the check on startup that compares the rack in system.local with the one in rack-dc.properties.

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R2

R1


nodetool getendpoints racktest racktable 1

127.0.0.2
127.0.0.3

So far so good, cqlsh returns the queries fine.

Nodetool ring > ring_2.txt (attached)

Now comes the interesting part.

I changed the racks to below and restarted DSE.

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R2

R1

R1


nodetool getendpoints racktest racktable 1

127.0.0.1
127.0.0.3

This is very interesting, cqlsh returns the queries fine. With tracing on, it’s clear that the 127.0.0.1 is being asked for data as well.

Nodetool ring > ring_3.txt (attached)

There is no change in token information in ring_* files. The token under question for id=1 (from select token(id) from racktest.racktable) is -4069959284402364209.

So, few questions because things don’t add up:


  1.  How come 127.0.0.1 is shown as an endpoint holding the ID when its token range doesn’t contain it ? Does “nodetool ring” shows all token-ranges for a node or just the primary range ? I am thinking its only primary. Can someone confirm ?
  2.  How come queries contact 127.0.0.1 ?
  3.  Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ? To prove / disprove that, I stopped 127.0.0.2 and ran a query with CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed hold the data (SS Tables also show it).
  4.  So, does this mean that the data actually gets moved around when racks change ?

Thanks !


From: Robert Coli [mailto:rcoli@eventbrite.com]
Sent: Wednesday, March 23, 2016 11:59 AM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Rack aware question.

On Wed, Mar 23, 2016 at 8:07 AM, Anubhav Kale <An...@microsoft.com>> wrote:
Suppose we change the racks on VMs on a running cluster. (We need to do this while running on Azure, because sometimes when the VM gets moved its rack changes).

In this situation, new writes will be laid out based on new rack info on appropriate replicas. What happens for existing data ? Is that data moved around as well and does it happen if we run repair or on its own ?

First, you should understand this ticket if relying on rack awareness :

https://issues.apache.org/jira/browse/CASSANDRA-3810<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-3810&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=PIEK5w9ZycRYTymQXBCQOHQ9a1BuurGDFc6J3C%2fWvwQ%3d>

Second, in general nodes cannot move between racks.

https://issues.apache.org/jira/browse/CASSANDRA-10242<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10242&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=nHX51ahp3SyGKouKb2WFtYmMQSjSNVzH%2fzvN%2fNPJzPw%3d>

Has some detailed explanations of what blows up if they do.

Note that if you want to preserve any of the data on the node, you need to :

1) bring it and have it join the ring in its new rack (during which time it will serve incorrect reads due to missing data)
2) stop it
3) run cleanup
4) run repair
5) start it again

Can't really say that I recommend this practice, but it's better than "rebootstrap it" which is the official advice. If you "rebootstrap it" you decrease unique replica count by 1, which has a nonzero chance of data-loss. The Coli Conjecture says that in practice you probably don't care about this nonzero chance of data loss if you are running your application in CL.ONE, which should be all cases where it matters.

=Rob


RE: Rack aware question.

Posted by Anubhav Kale <An...@microsoft.com>.
Thanks.

To test what happens when rack of a node changes in a running cluster without doing a decommission, I did the following.

The cluster looks like below (this was run through Eclipse, therefore the IP address hack)

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R1

R2


A table was created and a row inserted as follows:

Cqlsh 127.0.0.1
>create keyspace racktest with replication = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 2 };
>create table racktest.racktable(id int, PRIMARY KEY(id));
>insert into racktest.racktable(id) values(1);

nodetool getendpoints racktest racktable 1

127.0.0.2
127.0.0.3

Nodetool ring > ring_1.txt (attached)

So far so good.

Then I changed the racks to below and restarted DSE with –Dcassandra.ignore_rack=true.
This option from my finding simply avoids the check on startup that compares the rack in system.local with the one in rack-dc.properties.

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R2

R1


nodetool getendpoints racktest racktable 1

127.0.0.2
127.0.0.3

So far so good, cqlsh returns the queries fine.

Nodetool ring > ring_2.txt (attached)

Now comes the interesting part.

I changed the racks to below and restarted DSE.

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R2

R1

R1


nodetool getendpoints racktest racktable 1

127.0.0.1
127.0.0.3

This is very interesting, cqlsh returns the queries fine. With tracing on, it’s clear that the 127.0.0.1 is being asked for data as well.

Nodetool ring > ring_3.txt (attached)

There is no change in token information in ring_* files. The token under question for id=1 (from select token(id) from racktest.racktable) is -4069959284402364209.

So, few questions because things don’t add up:


  1.  How come 127.0.0.1 is shown as an endpoint holding the ID when its token range doesn’t contain it ? Does “nodetool ring” shows all token-ranges for a node or just the primary range ? I am thinking its only primary. Can someone confirm ?
  2.  How come queries contact 127.0.0.1 ?
  3.  Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ? To prove / disprove that, I stopped 127.0.0.2 and ran a query with CONSISTENCY ALL, and it came back just fine meaning 127.0.0.1 indeed hold the data (SS Tables also show it).
  4.  So, does this mean that the data actually gets moved around when racks change ?

Thanks !


From: Robert Coli [mailto:rcoli@eventbrite.com]
Sent: Wednesday, March 23, 2016 11:59 AM
To: user@cassandra.apache.org
Subject: Re: Rack aware question.

On Wed, Mar 23, 2016 at 8:07 AM, Anubhav Kale <An...@microsoft.com>> wrote:
Suppose we change the racks on VMs on a running cluster. (We need to do this while running on Azure, because sometimes when the VM gets moved its rack changes).

In this situation, new writes will be laid out based on new rack info on appropriate replicas. What happens for existing data ? Is that data moved around as well and does it happen if we run repair or on its own ?

First, you should understand this ticket if relying on rack awareness :

https://issues.apache.org/jira/browse/CASSANDRA-3810<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-3810&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=PIEK5w9ZycRYTymQXBCQOHQ9a1BuurGDFc6J3C%2fWvwQ%3d>

Second, in general nodes cannot move between racks.

https://issues.apache.org/jira/browse/CASSANDRA-10242<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10242&data=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=nHX51ahp3SyGKouKb2WFtYmMQSjSNVzH%2fzvN%2fNPJzPw%3d>

Has some detailed explanations of what blows up if they do.

Note that if you want to preserve any of the data on the node, you need to :

1) bring it and have it join the ring in its new rack (during which time it will serve incorrect reads due to missing data)
2) stop it
3) run cleanup
4) run repair
5) start it again

Can't really say that I recommend this practice, but it's better than "rebootstrap it" which is the official advice. If you "rebootstrap it" you decrease unique replica count by 1, which has a nonzero chance of data-loss. The Coli Conjecture says that in practice you probably don't care about this nonzero chance of data loss if you are running your application in CL.ONE, which should be all cases where it matters.

=Rob


Re: Rack aware question.

Posted by Robert Coli <rc...@eventbrite.com>.
On Wed, Mar 23, 2016 at 8:07 AM, Anubhav Kale <An...@microsoft.com>
wrote:

> Suppose we change the racks on VMs on a running cluster. (We need to do
> this while running on Azure, because sometimes when the VM gets moved its
> rack changes).
>
> In this situation, new writes will be laid out based on new rack info on
> appropriate replicas. What happens for existing data ? Is that data moved
> around as well and does it happen if we run repair or on its own ?
>

First, you should understand this ticket if relying on rack awareness :

https://issues.apache.org/jira/browse/CASSANDRA-3810

Second, in general nodes cannot move between racks.

https://issues.apache.org/jira/browse/CASSANDRA-10242

Has some detailed explanations of what blows up if they do.

Note that if you want to preserve any of the data on the node, you need to :

1) bring it and have it join the ring in its new rack (during which time it
will serve incorrect reads due to missing data)
2) stop it
3) run cleanup
4) run repair
5) start it again

Can't really say that I recommend this practice, but it's better than
"rebootstrap it" which is the official advice. If you "rebootstrap it" you
decrease unique replica count by 1, which has a nonzero chance of
data-loss. The Coli Conjecture says that in practice you probably don't
care about this nonzero chance of data loss if you are running your
application in CL.ONE, which should be all cases where it matters.

=Rob

Re: Rack aware question.

Posted by Clint Martin <cl...@coolfiretechnologies.com>.
I could be wrong on this since I've never actually attempted what you are
asking. Based on my understanding of how replica assignment is done, I
don't think that just changing the rack on an existing node is a good idea.

Changing racks for a node that already contains data would result in that
data not being readable, since the replica assignment and ownership would
not coincide with the data that physically exists on the node. Repair could
help by streaming data from other replicas. You would also need to do a
clean up to remove data that the node is no longer a replica for. This
would essentially be the same as boot strapping a new node after a
catastrophic failure.

If you need to move a node's rack you should remove it from the cluster
properly and add it again as a new /clean node with the new rack
assignment. That way you are not in a degraded state while the data
replicates

Clint
On Mar 23, 2016 11:07 AM, "Anubhav Kale" <An...@microsoft.com> wrote:

> Hello,
>
> Suppose we change the racks on VMs on a running cluster. (We need to do
> this while running on Azure, because sometimes when the VM gets moved its
> rack changes).
>
> In this situation, new writes will be laid out based on new rack info on
> appropriate replicas. What happens for existing data ? Is that data moved
> around as well and does it happen if we run repair or on its own ?
>
>
> Thanks !
>