You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Nikolai Grigoriev <ng...@gmail.com> on 2014/11/20 16:52:15 UTC

coordinator selection in remote DC

Hi,

There is something odd I have observed when testing a configuration with
two DC for the first time. I wanted to do a simple functional test to prove
myself (and my pessimistic colleagues ;) ) that it works.

I have a test cluster of 6 nodes, 3 in each DC, and a keyspace that is
replicated as follows:

CREATE KEYSPACE xxxxxxx WITH replication = {

  'class': 'NetworkTopologyStrategy',

  'DC2': '3',

  'DC1': '3'

};


I have disabled the traffic compression between DCs to get more accurate
numbers.

I have set up a bunch of IP accounting rules on each node so they count the
outgoing traffic from this node to each other node. I had rules for
different ports but, of course, but it is mostly about port 7000 (or 7001)
when talking about inter-node traffic. Anyway, I have a table that shows
the traffic from any node to any node's port 7000.

I have ran a test with DCAwareRoundRobinPolicy and the client talking only
to DC1 nodes. Everything looks fine - the client has sent identical amount
of data to each of 3 nodes in DC1. These nodes inside of DC1 (I was writing
with LOCAL_ONE consistency) have sent similar amount of data to each other
that represents exactly two extra replicas.

However, when I look at the traffic from the nodes in DC1 to the nodes in
DC1 the picture is different:

  10.3.45.156

10.3.45.159

dpt:7000

117,273,075

10.3.45.156

10.3.45.160

dpt:7000

228,326,091

10.3.45.156

10.3.45.161

dpt:7000

46,924,339

10.3.45.157

10.3.45.159

dpt:7000

118,978,269

10.3.45.157

10.3.45.160

dpt:7000

230,444,929

10.3.45.157

10.3.45.161

dpt:7000

47,394,179

10.3.45.158

10.3.45.159

dpt:7000

113,969,248

10.3.45.158

10.3.45.160

dpt:7000

225,844,838

10.3.45.158

10.3.45.161

dpt:7000

46,338,939

Nodes 10.3.45.156-158 are in DC1, .159-161 - in DC2. As you can see, each
of nodes in DC1 has sent different amount of traffic to the remote nodes:
117Mb, 228Mb and 46Mb respectively. Both DC have one rack.

So, here is my question. How does node select the node in remote DC to send
the message to? I did a quick sweep through the code and I could only find
the sorting by proximity (checking the rack and DC). So, considering that
for each request I fire the targets are all 3 nodes in the remote DC, the
list will contain all 3 nodes in DC2. And, if I understood correctly, the
first node from the list is picked to send the message.

So, it seems to me that there is no any kind of round-robin-type logic is
applied when selecting the target node to forward the write to from the
list of targets in remote DC.

If this is true (and the numbers kind of show it is, right?), then probably
the list with equal proximity should be shuffled randomly? Or, instead of
picking the first target, a random one should be picked?


-- 
Nikolai Grigoriev

Re: coordinator selection in remote DC

Posted by Nikolai Grigoriev <ng...@gmail.com>.

Hmmm...I am using:

endpoint_snitch: com.datastax.bdp.snitch.DseDelegateSnitch

which is using:

delegated_snitch: org.apache.cassandra.locator.PropertyFileSnitch

(for this specific test cluster)

I did not check the code - is this snitch on by default and, maybe, used as
wrapper for configured endpoint_snitch?

It would explain the difference in the inter-DC traffic for sure. Also it
would not affect the local DC traffic as all nodes are replicas for the
data anyway.


On Thu, Nov 20, 2014 at 12:03 PM, Tyler Hobbs <ty...@datastax.com> wrote:

> The difference is likely due to the DynamicEndpointSnitch (aka dynamic
> snitch), which picks replicas to send messages to based on recently
> observed latency and self-reported load (accounting for compactions,
> repair, etc).  If you want to confirm this, you can disable the dynamic
> snitch by adding this line to cassandra.yaml: "dynamic_snitch: false".
>
> On Thu, Nov 20, 2014 at 9:52 AM, Nikolai Grigoriev <ng...@gmail.com>
> wrote:
>
>> Hi,
>>
>> There is something odd I have observed when testing a configuration with
>> two DC for the first time. I wanted to do a simple functional test to prove
>> myself (and my pessimistic colleagues ;) ) that it works.
>>
>> I have a test cluster of 6 nodes, 3 in each DC, and a keyspace that is
>> replicated as follows:
>>
>> CREATE KEYSPACE xxxxxxx WITH replication = {
>>
>>   'class': 'NetworkTopologyStrategy',
>>
>>   'DC2': '3',
>>
>>   'DC1': '3'
>>
>> };
>>
>>
>> I have disabled the traffic compression between DCs to get more accurate
>> numbers.
>>
>> I have set up a bunch of IP accounting rules on each node so they count
>> the outgoing traffic from this node to each other node. I had rules for
>> different ports but, of course, but it is mostly about port 7000 (or 7001)
>> when talking about inter-node traffic. Anyway, I have a table that shows
>> the traffic from any node to any node's port 7000.
>>
>> I have ran a test with DCAwareRoundRobinPolicy and the client talking
>> only to DC1 nodes. Everything looks fine - the client has sent identical
>> amount of data to each of 3 nodes in DC1. These nodes inside of DC1 (I was
>> writing with LOCAL_ONE consistency) have sent similar amount of data to
>> each other that represents exactly two extra replicas.
>>
>> However, when I look at the traffic from the nodes in DC1 to the nodes in
>> DC1 the picture is different:
>>
>>   10.3.45.156
>>
>> 10.3.45.159
>>
>> dpt:7000
>>
>> 117,273,075
>>
>> 10.3.45.156
>>
>> 10.3.45.160
>>
>> dpt:7000
>>
>> 228,326,091
>>
>> 10.3.45.156
>>
>> 10.3.45.161
>>
>> dpt:7000
>>
>> 46,924,339
>>
>> 10.3.45.157
>>
>> 10.3.45.159
>>
>> dpt:7000
>>
>> 118,978,269
>>
>> 10.3.45.157
>>
>> 10.3.45.160
>>
>> dpt:7000
>>
>> 230,444,929
>>
>> 10.3.45.157
>>
>> 10.3.45.161
>>
>> dpt:7000
>>
>> 47,394,179
>>
>> 10.3.45.158
>>
>> 10.3.45.159
>>
>> dpt:7000
>>
>> 113,969,248
>>
>> 10.3.45.158
>>
>> 10.3.45.160
>>
>> dpt:7000
>>
>> 225,844,838
>>
>> 10.3.45.158
>>
>> 10.3.45.161
>>
>> dpt:7000
>>
>> 46,338,939
>>
>> Nodes 10.3.45.156-158 are in DC1, .159-161 - in DC2. As you can see, each
>> of nodes in DC1 has sent different amount of traffic to the remote nodes:
>> 117Mb, 228Mb and 46Mb respectively. Both DC have one rack.
>>
>> So, here is my question. How does node select the node in remote DC to
>> send the message to? I did a quick sweep through the code and I could only
>> find the sorting by proximity (checking the rack and DC). So, considering
>> that for each request I fire the targets are all 3 nodes in the remote DC,
>> the list will contain all 3 nodes in DC2. And, if I understood correctly,
>> the first node from the list is picked to send the message.
>>
>> So, it seems to me that there is no any kind of round-robin-type logic is
>> applied when selecting the target node to forward the write to from the
>> list of targets in remote DC.
>>
>> If this is true (and the numbers kind of show it is, right?), then
>> probably the list with equal proximity should be shuffled randomly? Or,
>> instead of picking the first target, a random one should be picked?
>>
>>
>> --
>> Nikolai Grigoriev
>>
>>
>
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>



-- 
Nikolai Grigoriev
(514) 772-5178

Re: coordinator selection in remote DC

Posted by Tyler Hobbs <ty...@datastax.com>.

>
> I did not check the code - is this snitch on by default and, maybe, used
> as wrapper for configured endpoint_snitch?


Yes, the dynamic snitch wraps whatever endpoint_snitch you configure, and
it's on by default.


> As I understand it, the dynamic snitch also has a default slight
> preference for the "primary" replica. This might be implicated here? While
> I don't especially like it and it's probably obsolete in the age of
> speculative retry, I believe its purpose is to increase cache heat on the
> primary replica.
>

It has a preference for the underlying snitch's replica preference.  It
only override's the underlying snitch's decision if the "badness score"
makes those replicas a significantly worse choice.  It's true that part of
the purpose is to reduce cache duplication across replicas, but speculative
retry is orthogonal to that.  (Speculative retry is simply about reducing
99th percentile latencies.)

Re: coordinator selection in remote DC

Posted by Robert Coli <rc...@eventbrite.com>.

On Thu, Nov 20, 2014 at 9:03 AM, Tyler Hobbs <ty...@datastax.com> wrote:

> The difference is likely due to the DynamicEndpointSnitch (aka dynamic
> snitch), which picks replicas to send messages to based on recently
> observed latency and self-reported load (accounting for compactions,
> repair, etc).  If you want to confirm this, you can disable the dynamic
> snitch by adding this line to cassandra.yaml: "dynamic_snitch: false".
>

As I understand it, the dynamic snitch also has a default slight preference
for the "primary" replica. This might be implicated here? While I don't
especially like it and it's probably obsolete in the age of speculative
retry, I believe its purpose is to increase cache heat on the primary
replica.

=Rob

Re: coordinator selection in remote DC

Posted by Tyler Hobbs <ty...@datastax.com>.

The difference is likely due to the DynamicEndpointSnitch (aka dynamic
snitch), which picks replicas to send messages to based on recently
observed latency and self-reported load (accounting for compactions,
repair, etc).  If you want to confirm this, you can disable the dynamic
snitch by adding this line to cassandra.yaml: "dynamic_snitch: false".

On Thu, Nov 20, 2014 at 9:52 AM, Nikolai Grigoriev <ng...@gmail.com>
wrote:

> Hi,
>
> There is something odd I have observed when testing a configuration with
> two DC for the first time. I wanted to do a simple functional test to prove
> myself (and my pessimistic colleagues ;) ) that it works.
>
> I have a test cluster of 6 nodes, 3 in each DC, and a keyspace that is
> replicated as follows:
>
> CREATE KEYSPACE xxxxxxx WITH replication = {
>
>   'class': 'NetworkTopologyStrategy',
>
>   'DC2': '3',
>
>   'DC1': '3'
>
> };
>
>
> I have disabled the traffic compression between DCs to get more accurate
> numbers.
>
> I have set up a bunch of IP accounting rules on each node so they count
> the outgoing traffic from this node to each other node. I had rules for
> different ports but, of course, but it is mostly about port 7000 (or 7001)
> when talking about inter-node traffic. Anyway, I have a table that shows
> the traffic from any node to any node's port 7000.
>
> I have ran a test with DCAwareRoundRobinPolicy and the client talking only
> to DC1 nodes. Everything looks fine - the client has sent identical amount
> of data to each of 3 nodes in DC1. These nodes inside of DC1 (I was writing
> with LOCAL_ONE consistency) have sent similar amount of data to each other
> that represents exactly two extra replicas.
>
> However, when I look at the traffic from the nodes in DC1 to the nodes in
> DC1 the picture is different:
>
>   10.3.45.156
>
> 10.3.45.159
>
> dpt:7000
>
> 117,273,075
>
> 10.3.45.156
>
> 10.3.45.160
>
> dpt:7000
>
> 228,326,091
>
> 10.3.45.156
>
> 10.3.45.161
>
> dpt:7000
>
> 46,924,339
>
> 10.3.45.157
>
> 10.3.45.159
>
> dpt:7000
>
> 118,978,269
>
> 10.3.45.157
>
> 10.3.45.160
>
> dpt:7000
>
> 230,444,929
>
> 10.3.45.157
>
> 10.3.45.161
>
> dpt:7000
>
> 47,394,179
>
> 10.3.45.158
>
> 10.3.45.159
>
> dpt:7000
>
> 113,969,248
>
> 10.3.45.158
>
> 10.3.45.160
>
> dpt:7000
>
> 225,844,838
>
> 10.3.45.158
>
> 10.3.45.161
>
> dpt:7000
>
> 46,338,939
>
> Nodes 10.3.45.156-158 are in DC1, .159-161 - in DC2. As you can see, each
> of nodes in DC1 has sent different amount of traffic to the remote nodes:
> 117Mb, 228Mb and 46Mb respectively. Both DC have one rack.
>
> So, here is my question. How does node select the node in remote DC to
> send the message to? I did a quick sweep through the code and I could only
> find the sorting by proximity (checking the rack and DC). So, considering
> that for each request I fire the targets are all 3 nodes in the remote DC,
> the list will contain all 3 nodes in DC2. And, if I understood correctly,
> the first node from the list is picked to send the message.
>
> So, it seems to me that there is no any kind of round-robin-type logic is
> applied when selecting the target node to forward the write to from the
> list of targets in remote DC.
>
> If this is true (and the numbers kind of show it is, right?), then
> probably the list with equal proximity should be shuffled randomly? Or,
> instead of picking the first target, a random one should be picked?
>
>
> --
> Nikolai Grigoriev
>
>


-- 
Tyler Hobbs
DataStax <http://datastax.com/>