You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Gianluca Borello <gi...@sysdig.com> on 2016/02/26 03:51:09 UTC

Unexpected high internode network activity

Hello,

We have a Cassandra 2.1.9 cluster on EC2 for one of our live applications.
There's a total of 21 nodes across 3 AWS availability zones, c3.2xlarge
instances.

The configuration is pretty standard, we use the default settings that come
with the datastax AMI and the driver in our application is configured to
use lz4 compression. The keyspace where all the activity happens has RF 3
and we read and write at quorum to get strong consistency.

While analyzing our monthly bill, we noticed that the amount of network
traffic related to Cassandra was significantly higher than expected. After
breaking it down by port, it seems like over any given time, the internode
network activity is 6-7 times higher than the traffic on port 9042, whereas
we would expect something around 2-3 times, given the replication factor
and the consistency level of our queries.

For example, this is the network traffic broken down by port and direction
over a few minutes, measured as sum of each node:

Port 9042 from client to cluster (write queries): 1 GB
Port 9042 from cluster to client (read queries): 1.5 GB
Port 7000: 35 GB, which must be divided by two because the traffic is
always directed to another instance of the cluster, so that makes it 17.5
GB generated traffic

The traffic on port 9042 completely matches our expectations, we do about
100k write operations writing 10KB binary blobs for each query, and a bit
more reads on the same data.

According to our calculations, in the worst case, when the coordinator of
the query is not a replica for the data, this should generate about (1 +
1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more.

Also, hinted handoffs are disabled and nodes are healthy over the period of
observation, and I get the same numbers across pretty much every time
window, even including an entire 24 hours period.

I tried to replicate this problem in a test environment so I connected a
client to a test cluster done in a bunch of Docker containers (same
parameters, essentially the only difference is the
GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I
expect, the amount of traffic on port 7000 is between 2 and 3 times the
amount of traffic on port 9042 and the queries are pretty much the same
ones.

Before doing more analysis, I was wondering if someone has an explanation
on this problem, since perhaps we are missing something obvious here?

Thanks

Re: Unexpected high internode network activity

Posted by Gianluca Borello <gi...@sysdig.com>.

Thank you for your reply.

- Repairs are not running on the cluster, in fact we've been "slacking"
when it comes to repair, mainly because we never manually delete our data
as it's always TTLed and we haven't had major failures or outages that
required repairing data (I know that's not a good reason anyway)

- We are not using server-to-server encryption

- internode_compression is set to all, and the application driver is lz4

- I just did a "nodetool flush && service cassandra restart" on one node of
the affected cluster and let it run for a few minutes, and these are the
statistics (all the nodes get the same ratio of network activity on port
9042 and port 7000, so pardon my raw estimates below in assuming that the
activity of a single node can reflect the activity of the whole cluster):

9042 traffic: 400 MB (split between 200 MB reads and 200 MB writes)
7000 traffic: 5 GB (counted twice by iftop, so 2.5 GB)

$ nodetool netstats -H
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 10167
Mismatch (Blocking): 210
Mismatch (Background): 151
Pool Name                    Active   Pending      Completed
Commands                        n/a         0         422986
Responses                       n/a         0         403144

If I do the same on a test cluster (with less activity and nodes but same
RF and configuration), I get, always for a single node:

9042 traffic: 250 MB (split between 100 MB reads and 150 MB writes)
7000 traffic: 1 GB (counted twice by iftop, so 500 MB)

$ nodetool netstats -H
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 6668
Mismatch (Blocking): 159
Mismatch (Background): 43
Pool Name                    Active   Pending      Completed
Commands                        n/a         0         125202
Responses                       n/a         0         141708

So, once again, in one cluster the internode activity is ~7 times the 9042
one, whereas in the test one is ~2, which is expected.

Thanks

On Fri, Feb 26, 2016 at 10:04 AM, Nate McCall <na...@thelastpickle.com>
wrote:

>
>> Unfortunately, these numbers still don't match at all.
>>
>> And yes, the cluster is in a single DC and since I am using the EC2
>> snitch, replicas are AZ aware.
>>
>>
> Are repairs running on the cluster?
>
> Other thoughts:
> - is internode_compression set to 'all' in cassandra.yaml (should be 'all'
> by default, but worth checking since you are using lz4 on the client)?
> - are you using server-to-server encryption ?
>
> You can compare the output of nodetool netstats on the test cluster with
> the AWS cluster as well to see if anything sticks out.
>
>
> --
> -----------------
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>

Re: Unexpected high internode network activity

Posted by Nate McCall <na...@thelastpickle.com>.

>
>
> Unfortunately, these numbers still don't match at all.
>
> And yes, the cluster is in a single DC and since I am using the EC2
> snitch, replicas are AZ aware.
>
>
Are repairs running on the cluster?

Other thoughts:
- is internode_compression set to 'all' in cassandra.yaml (should be 'all'
by default, but worth checking since you are using lz4 on the client)?
- are you using server-to-server encryption ?

You can compare the output of nodetool netstats on the test cluster with
the AWS cluster as well to see if anything sticks out.


-- 
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Unexpected high internode network activity

Posted by Gianluca Borello <gi...@sysdig.com>.

I understand your point about the billing, but billing here was merely
the triggering factor that had me start analyzing the traffic in the first
place.

At the moment, I'm not considering the numbers on my bill anymore but
simply the numbers that I am measuring with iftop on each node of the
cluster, and if I measure the total traffic on port 7000 I see 35 GB in the
example above, and since each byte is counted twice by iftop (because I'm
running on every node) the cluster generated 17.5 GB of unique network
activity, and I am trying to explain that number in relation to the traffic
I'm seeing on port 9042, billing aside.

Unfortunately, these numbers still don't match at all.

And yes, the cluster is in a single DC and since I am using the EC2 snitch,
replicas are AZ aware.

Thanks

On Thursday, February 25, 2016, daemeon reiydelle <da...@gmail.com>
wrote:

> Hmm. From the AWS FAQ:
>
> *Q: If I have two instances in different availability zones, how will I be
> charged for regional data transfer?*
>
> Each instance is charged for its data in and data out. Therefore, if data
> is transferred between these two instances, it is charged out for the first
> instance and in for the second instance.
>
>
> I really am not seeing this factored into your numbers fully. If data
> transfer is only twice as much as expected, the above billing would seem to
> put the numbers in line. Since (I assume) you have one copy in EACH AZ (dc
> aware but really dc=az) I am not seeing the bandwidth as that much out of
> line.
>
>
>
> *.......*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*
>
> On Thu, Feb 25, 2016 at 11:00 PM, Gianluca Borello <gianluca@sysdig.com
> <javascript:_e(%7B%7D,'cvml','gianluca@sysdig.com');>> wrote:
>
>> It is indeed very intriguing and I really hope to learn more from the
>> experience of this mailing list. To address your points:
>>
>> - The theory that full data is coming from replicas during reads is not
>> enough to explain the situation. In my scenario, over a time window I had
>> 17.5 GB of intra node activity (port 7000) for 1 GB of writes and 1.5 GB of
>> reads (measured on port 9042), so even if both reads and writes affected
>> all replicas, I would have (1 + 1.5) * 3 = 7.5 GB, still leaving 10 GB on
>> port 7000 unaccounted
>>
>> - We are doing regular backups the standard way, using periodic snapshots
>> and synchronizing them to S3. This traffic is not part of the anomalous
>> traffic we're seeing above, since this one goes on port 80 and it's clearly
>> visible with a separate bpf filter, and its magnitude is far lower than
>> that anyway
>>
>> Thanks
>>
>> On Thu, Feb 25, 2016 at 9:03 PM, daemeon reiydelle <daemeonr@gmail.com
>> <javascript:_e(%7B%7D,'cvml','daemeonr@gmail.com');>> wrote:
>>
>>> Intriguing. It's enough data to look like full data is coming from the
>>> replicants instead of digests when the read of the copy occurs. Are you
>>> doing backup/dr? Are directories copied regularly and over the network or ?
>>>
>>>
>>> *.......*
>>>
>>>
>>>
>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>>
>>> On Thu, Feb 25, 2016 at 8:12 PM, Gianluca Borello <gianluca@sysdig.com
>>> <javascript:_e(%7B%7D,'cvml','gianluca@sysdig.com');>> wrote:
>>>
>>>> Thank you for your reply.
>>>>
>>>> To answer your points:
>>>>
>>>> - I fully agree on the write volume, in fact my isolated tests confirm
>>>> your estimation
>>>>
>>>> - About the read, I agree as well, but the volume of data is still much
>>>> higher
>>>>
>>>> - I am writing to one single keyspace with RF 3, there's just one
>>>> keyspace
>>>>
>>>> - I am not using any indexes, the column families are very simple
>>>>
>>>> - I am aware of the double count, in fact, I measured the traffic on
>>>> port 9042 at the client side (so just counted once) and I divided by two
>>>> the traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All
>>>> the measurements have been done with iftop with proper bpf filters on the
>>>> port and the total traffic matches what I see in cloudwatch (divided by two)
>>>>
>>>> So unfortunately I still don't have any ideas about what's going on and
>>>> why I'm seeing 17 GB of internode traffic instead of ~ 5-6.
>>>>
>>>> On Thursday, February 25, 2016, daemeon reiydelle <daemeonr@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','daemeonr@gmail.com');>> wrote:
>>>>
>>>>> If read & write at quorum then you write 3 copies of the data then
>>>>> return to the caller; when reading you read one copy (assume it is not on
>>>>> the coordinator), and 1 digest (because read at quorum is 2, not 3).
>>>>>
>>>>> When you insert, how many keyspaces get written to? (Are you using
>>>>> e.g. inverted indices?) That is my guess, that your db has about 1.8 bytes
>>>>> written for every byte inserted.
>>>>>
>>>>> Every byte you write is counted also as a read (system a sends 1gb to
>>>>> system b, so system b receives 1gb). You would not be charged if intra AZ,
>>>>> but inter AZ and inter DC will get that double count.
>>>>>
>>>>> So, my guess is reverse indexes, and you forgot to include receive and
>>>>> transmit.
>>>>> 
>>>>>
>>>>>
>>>>> *.......*
>>>>>
>>>>>
>>>>>
>>>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>>>>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>>>>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>>>>
>>>>> On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello <gianluca@sysdig.com
>>>>> > wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> We have a Cassandra 2.1.9 cluster on EC2 for one of our live
>>>>>> applications. There's a total of 21 nodes across 3 AWS availability zones,
>>>>>> c3.2xlarge instances.
>>>>>>
>>>>>> The configuration is pretty standard, we use the default settings
>>>>>> that come with the datastax AMI and the driver in our application is
>>>>>> configured to use lz4 compression. The keyspace where all the activity
>>>>>> happens has RF 3 and we read and write at quorum to get strong consistency.
>>>>>>
>>>>>> While analyzing our monthly bill, we noticed that the amount of
>>>>>> network traffic related to Cassandra was significantly higher than
>>>>>> expected. After breaking it down by port, it seems like over any given
>>>>>> time, the internode network activity is 6-7 times higher than the traffic
>>>>>> on port 9042, whereas we would expect something around 2-3 times, given the
>>>>>> replication factor and the consistency level of our queries.
>>>>>>
>>>>>> For example, this is the network traffic broken down by port and
>>>>>> direction over a few minutes, measured as sum of each node:
>>>>>>
>>>>>> Port 9042 from client to cluster (write queries): 1 GB
>>>>>> Port 9042 from cluster to client (read queries): 1.5 GB
>>>>>> Port 7000: 35 GB, which must be divided by two because the traffic is
>>>>>> always directed to another instance of the cluster, so that makes it 17.5
>>>>>> GB generated traffic
>>>>>>
>>>>>> The traffic on port 9042 completely matches our expectations, we do
>>>>>> about 100k write operations writing 10KB binary blobs for each query, and a
>>>>>> bit more reads on the same data.
>>>>>>
>>>>>> According to our calculations, in the worst case, when the
>>>>>> coordinator of the query is not a replica for the data, this should
>>>>>> generate about (1 + 1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is
>>>>>> quite a lot more.
>>>>>>
>>>>>> Also, hinted handoffs are disabled and nodes are healthy over the
>>>>>> period of observation, and I get the same numbers across pretty much every
>>>>>> time window, even including an entire 24 hours period.
>>>>>>
>>>>>> I tried to replicate this problem in a test environment so I
>>>>>> connected a client to a test cluster done in a bunch of Docker containers
>>>>>> (same parameters, essentially the only difference is the
>>>>>> GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I
>>>>>> expect, the amount of traffic on port 7000 is between 2 and 3 times the
>>>>>> amount of traffic on port 9042 and the queries are pretty much the same
>>>>>> ones.
>>>>>>
>>>>>> Before doing more analysis, I was wondering if someone has an
>>>>>> explanation on this problem, since perhaps we are missing something obvious
>>>>>> here?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: Unexpected high internode network activity

Posted by daemeon reiydelle <da...@gmail.com>.

Hmm. From the AWS FAQ:

*Q: If I have two instances in different availability zones, how will I be
charged for regional data transfer?*

Each instance is charged for its data in and data out. Therefore, if data
is transferred between these two instances, it is charged out for the first
instance and in for the second instance.


I really am not seeing this factored into your numbers fully. If data
transfer is only twice as much as expected, the above billing would seem to
put the numbers in line. Since (I assume) you have one copy in EACH AZ (dc
aware but really dc=az) I am not seeing the bandwidth as that much out of
line.



*.......*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Feb 25, 2016 at 11:00 PM, Gianluca Borello <gi...@sysdig.com>
wrote:

> It is indeed very intriguing and I really hope to learn more from the
> experience of this mailing list. To address your points:
>
> - The theory that full data is coming from replicas during reads is not
> enough to explain the situation. In my scenario, over a time window I had
> 17.5 GB of intra node activity (port 7000) for 1 GB of writes and 1.5 GB of
> reads (measured on port 9042), so even if both reads and writes affected
> all replicas, I would have (1 + 1.5) * 3 = 7.5 GB, still leaving 10 GB on
> port 7000 unaccounted
>
> - We are doing regular backups the standard way, using periodic snapshots
> and synchronizing them to S3. This traffic is not part of the anomalous
> traffic we're seeing above, since this one goes on port 80 and it's clearly
> visible with a separate bpf filter, and its magnitude is far lower than
> that anyway
>
> Thanks
>
> On Thu, Feb 25, 2016 at 9:03 PM, daemeon reiydelle <da...@gmail.com>
> wrote:
>
>> Intriguing. It's enough data to look like full data is coming from the
>> replicants instead of digests when the read of the copy occurs. Are you
>> doing backup/dr? Are directories copied regularly and over the network or ?
>>
>>
>> *.......*
>>
>>
>>
>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>
>> On Thu, Feb 25, 2016 at 8:12 PM, Gianluca Borello <gi...@sysdig.com>
>> wrote:
>>
>>> Thank you for your reply.
>>>
>>> To answer your points:
>>>
>>> - I fully agree on the write volume, in fact my isolated tests confirm
>>> your estimation
>>>
>>> - About the read, I agree as well, but the volume of data is still much
>>> higher
>>>
>>> - I am writing to one single keyspace with RF 3, there's just one
>>> keyspace
>>>
>>> - I am not using any indexes, the column families are very simple
>>>
>>> - I am aware of the double count, in fact, I measured the traffic on
>>> port 9042 at the client side (so just counted once) and I divided by two
>>> the traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All
>>> the measurements have been done with iftop with proper bpf filters on the
>>> port and the total traffic matches what I see in cloudwatch (divided by two)
>>>
>>> So unfortunately I still don't have any ideas about what's going on and
>>> why I'm seeing 17 GB of internode traffic instead of ~ 5-6.
>>>
>>> On Thursday, February 25, 2016, daemeon reiydelle <da...@gmail.com>
>>> wrote:
>>>
>>>> If read & write at quorum then you write 3 copies of the data then
>>>> return to the caller; when reading you read one copy (assume it is not on
>>>> the coordinator), and 1 digest (because read at quorum is 2, not 3).
>>>>
>>>> When you insert, how many keyspaces get written to? (Are you using e.g.
>>>> inverted indices?) That is my guess, that your db has about 1.8 bytes
>>>> written for every byte inserted.
>>>>
>>>> Every byte you write is counted also as a read (system a sends 1gb to
>>>> system b, so system b receives 1gb). You would not be charged if intra AZ,
>>>> but inter AZ and inter DC will get that double count.
>>>>
>>>> So, my guess is reverse indexes, and you forgot to include receive and
>>>> transmit.
>>>> 
>>>>
>>>>
>>>> *.......*
>>>>
>>>>
>>>>
>>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>>>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>>>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>>>
>>>> On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello <gi...@sysdig.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We have a Cassandra 2.1.9 cluster on EC2 for one of our live
>>>>> applications. There's a total of 21 nodes across 3 AWS availability zones,
>>>>> c3.2xlarge instances.
>>>>>
>>>>> The configuration is pretty standard, we use the default settings that
>>>>> come with the datastax AMI and the driver in our application is configured
>>>>> to use lz4 compression. The keyspace where all the activity happens has RF
>>>>> 3 and we read and write at quorum to get strong consistency.
>>>>>
>>>>> While analyzing our monthly bill, we noticed that the amount of
>>>>> network traffic related to Cassandra was significantly higher than
>>>>> expected. After breaking it down by port, it seems like over any given
>>>>> time, the internode network activity is 6-7 times higher than the traffic
>>>>> on port 9042, whereas we would expect something around 2-3 times, given the
>>>>> replication factor and the consistency level of our queries.
>>>>>
>>>>> For example, this is the network traffic broken down by port and
>>>>> direction over a few minutes, measured as sum of each node:
>>>>>
>>>>> Port 9042 from client to cluster (write queries): 1 GB
>>>>> Port 9042 from cluster to client (read queries): 1.5 GB
>>>>> Port 7000: 35 GB, which must be divided by two because the traffic is
>>>>> always directed to another instance of the cluster, so that makes it 17.5
>>>>> GB generated traffic
>>>>>
>>>>> The traffic on port 9042 completely matches our expectations, we do
>>>>> about 100k write operations writing 10KB binary blobs for each query, and a
>>>>> bit more reads on the same data.
>>>>>
>>>>> According to our calculations, in the worst case, when the coordinator
>>>>> of the query is not a replica for the data, this should generate about (1 +
>>>>> 1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more.
>>>>>
>>>>> Also, hinted handoffs are disabled and nodes are healthy over the
>>>>> period of observation, and I get the same numbers across pretty much every
>>>>> time window, even including an entire 24 hours period.
>>>>>
>>>>> I tried to replicate this problem in a test environment so I connected
>>>>> a client to a test cluster done in a bunch of Docker containers (same
>>>>> parameters, essentially the only difference is the
>>>>> GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I
>>>>> expect, the amount of traffic on port 7000 is between 2 and 3 times the
>>>>> amount of traffic on port 9042 and the queries are pretty much the same
>>>>> ones.
>>>>>
>>>>> Before doing more analysis, I was wondering if someone has an
>>>>> explanation on this problem, since perhaps we are missing something obvious
>>>>> here?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>
>>
>

Re: Unexpected high internode network activity

Posted by Gianluca Borello <gi...@sysdig.com>.

It is indeed very intriguing and I really hope to learn more from the
experience of this mailing list. To address your points:

- The theory that full data is coming from replicas during reads is not
enough to explain the situation. In my scenario, over a time window I had
17.5 GB of intra node activity (port 7000) for 1 GB of writes and 1.5 GB of
reads (measured on port 9042), so even if both reads and writes affected
all replicas, I would have (1 + 1.5) * 3 = 7.5 GB, still leaving 10 GB on
port 7000 unaccounted

- We are doing regular backups the standard way, using periodic snapshots
and synchronizing them to S3. This traffic is not part of the anomalous
traffic we're seeing above, since this one goes on port 80 and it's clearly
visible with a separate bpf filter, and its magnitude is far lower than
that anyway

Thanks

On Thu, Feb 25, 2016 at 9:03 PM, daemeon reiydelle <da...@gmail.com>
wrote:

> Intriguing. It's enough data to look like full data is coming from the
> replicants instead of digests when the read of the copy occurs. Are you
> doing backup/dr? Are directories copied regularly and over the network or ?
>
>
> *.......*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
> <%28%2B44%29%20%280%29%2020%208144%209872>*
>
> On Thu, Feb 25, 2016 at 8:12 PM, Gianluca Borello <gi...@sysdig.com>
> wrote:
>
>> Thank you for your reply.
>>
>> To answer your points:
>>
>> - I fully agree on the write volume, in fact my isolated tests confirm
>> your estimation
>>
>> - About the read, I agree as well, but the volume of data is still much
>> higher
>>
>> - I am writing to one single keyspace with RF 3, there's just one
>> keyspace
>>
>> - I am not using any indexes, the column families are very simple
>>
>> - I am aware of the double count, in fact, I measured the traffic on port
>> 9042 at the client side (so just counted once) and I divided by two the
>> traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All the
>> measurements have been done with iftop with proper bpf filters on the
>> port and the total traffic matches what I see in cloudwatch (divided by two)
>>
>> So unfortunately I still don't have any ideas about what's going on and
>> why I'm seeing 17 GB of internode traffic instead of ~ 5-6.
>>
>> On Thursday, February 25, 2016, daemeon reiydelle <da...@gmail.com>
>> wrote:
>>
>>> If read & write at quorum then you write 3 copies of the data then
>>> return to the caller; when reading you read one copy (assume it is not on
>>> the coordinator), and 1 digest (because read at quorum is 2, not 3).
>>>
>>> When you insert, how many keyspaces get written to? (Are you using e.g.
>>> inverted indices?) That is my guess, that your db has about 1.8 bytes
>>> written for every byte inserted.
>>>
>>> Every byte you write is counted also as a read (system a sends 1gb to
>>> system b, so system b receives 1gb). You would not be charged if intra AZ,
>>> but inter AZ and inter DC will get that double count.
>>>
>>> So, my guess is reverse indexes, and you forgot to include receive and
>>> transmit.
>>> 
>>>
>>>
>>> *.......*
>>>
>>>
>>>
>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>>
>>> On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello <gi...@sysdig.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> We have a Cassandra 2.1.9 cluster on EC2 for one of our live
>>>> applications. There's a total of 21 nodes across 3 AWS availability zones,
>>>> c3.2xlarge instances.
>>>>
>>>> The configuration is pretty standard, we use the default settings that
>>>> come with the datastax AMI and the driver in our application is configured
>>>> to use lz4 compression. The keyspace where all the activity happens has RF
>>>> 3 and we read and write at quorum to get strong consistency.
>>>>
>>>> While analyzing our monthly bill, we noticed that the amount of network
>>>> traffic related to Cassandra was significantly higher than expected. After
>>>> breaking it down by port, it seems like over any given time, the internode
>>>> network activity is 6-7 times higher than the traffic on port 9042, whereas
>>>> we would expect something around 2-3 times, given the replication factor
>>>> and the consistency level of our queries.
>>>>
>>>> For example, this is the network traffic broken down by port and
>>>> direction over a few minutes, measured as sum of each node:
>>>>
>>>> Port 9042 from client to cluster (write queries): 1 GB
>>>> Port 9042 from cluster to client (read queries): 1.5 GB
>>>> Port 7000: 35 GB, which must be divided by two because the traffic is
>>>> always directed to another instance of the cluster, so that makes it 17.5
>>>> GB generated traffic
>>>>
>>>> The traffic on port 9042 completely matches our expectations, we do
>>>> about 100k write operations writing 10KB binary blobs for each query, and a
>>>> bit more reads on the same data.
>>>>
>>>> According to our calculations, in the worst case, when the coordinator
>>>> of the query is not a replica for the data, this should generate about (1 +
>>>> 1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more.
>>>>
>>>> Also, hinted handoffs are disabled and nodes are healthy over the
>>>> period of observation, and I get the same numbers across pretty much every
>>>> time window, even including an entire 24 hours period.
>>>>
>>>> I tried to replicate this problem in a test environment so I connected
>>>> a client to a test cluster done in a bunch of Docker containers (same
>>>> parameters, essentially the only difference is the
>>>> GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I
>>>> expect, the amount of traffic on port 7000 is between 2 and 3 times the
>>>> amount of traffic on port 9042 and the queries are pretty much the same
>>>> ones.
>>>>
>>>> Before doing more analysis, I was wondering if someone has an
>>>> explanation on this problem, since perhaps we are missing something obvious
>>>> here?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>
>

Re: Unexpected high internode network activity

Posted by daemeon reiydelle <da...@gmail.com>.

Intriguing. It's enough data to look like full data is coming from the
replicants instead of digests when the read of the copy occurs. Are you
doing backup/dr? Are directories copied regularly and over the network or ?


*.......*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Feb 25, 2016 at 8:12 PM, Gianluca Borello <gi...@sysdig.com>
wrote:

> Thank you for your reply.
>
> To answer your points:
>
> - I fully agree on the write volume, in fact my isolated tests confirm
> your estimation
>
> - About the read, I agree as well, but the volume of data is still much
> higher
>
> - I am writing to one single keyspace with RF 3, there's just one keyspace
>
> - I am not using any indexes, the column families are very simple
>
> - I am aware of the double count, in fact, I measured the traffic on port
> 9042 at the client side (so just counted once) and I divided by two the
> traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All the
> measurements have been done with iftop with proper bpf filters on the
> port and the total traffic matches what I see in cloudwatch (divided by two)
>
> So unfortunately I still don't have any ideas about what's going on and
> why I'm seeing 17 GB of internode traffic instead of ~ 5-6.
>
> On Thursday, February 25, 2016, daemeon reiydelle <da...@gmail.com>
> wrote:
>
>> If read & write at quorum then you write 3 copies of the data then return
>> to the caller; when reading you read one copy (assume it is not on the
>> coordinator), and 1 digest (because read at quorum is 2, not 3).
>>
>> When you insert, how many keyspaces get written to? (Are you using e.g.
>> inverted indices?) That is my guess, that your db has about 1.8 bytes
>> written for every byte inserted.
>>
>> Every byte you write is counted also as a read (system a sends 1gb to
>> system b, so system b receives 1gb). You would not be charged if intra AZ,
>> but inter AZ and inter DC will get that double count.
>>
>> So, my guess is reverse indexes, and you forgot to include receive and
>> transmit.
>> 
>>
>>
>> *.......*
>>
>>
>>
>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>
>> On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello <gi...@sysdig.com>
>> wrote:
>>
>>> Hello,
>>>
>>> We have a Cassandra 2.1.9 cluster on EC2 for one of our live
>>> applications. There's a total of 21 nodes across 3 AWS availability zones,
>>> c3.2xlarge instances.
>>>
>>> The configuration is pretty standard, we use the default settings that
>>> come with the datastax AMI and the driver in our application is configured
>>> to use lz4 compression. The keyspace where all the activity happens has RF
>>> 3 and we read and write at quorum to get strong consistency.
>>>
>>> While analyzing our monthly bill, we noticed that the amount of network
>>> traffic related to Cassandra was significantly higher than expected. After
>>> breaking it down by port, it seems like over any given time, the internode
>>> network activity is 6-7 times higher than the traffic on port 9042, whereas
>>> we would expect something around 2-3 times, given the replication factor
>>> and the consistency level of our queries.
>>>
>>> For example, this is the network traffic broken down by port and
>>> direction over a few minutes, measured as sum of each node:
>>>
>>> Port 9042 from client to cluster (write queries): 1 GB
>>> Port 9042 from cluster to client (read queries): 1.5 GB
>>> Port 7000: 35 GB, which must be divided by two because the traffic is
>>> always directed to another instance of the cluster, so that makes it 17.5
>>> GB generated traffic
>>>
>>> The traffic on port 9042 completely matches our expectations, we do
>>> about 100k write operations writing 10KB binary blobs for each query, and a
>>> bit more reads on the same data.
>>>
>>> According to our calculations, in the worst case, when the coordinator
>>> of the query is not a replica for the data, this should generate about (1 +
>>> 1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more.
>>>
>>> Also, hinted handoffs are disabled and nodes are healthy over the period
>>> of observation, and I get the same numbers across pretty much every time
>>> window, even including an entire 24 hours period.
>>>
>>> I tried to replicate this problem in a test environment so I connected a
>>> client to a test cluster done in a bunch of Docker containers (same
>>> parameters, essentially the only difference is the
>>> GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I
>>> expect, the amount of traffic on port 7000 is between 2 and 3 times the
>>> amount of traffic on port 9042 and the queries are pretty much the same
>>> ones.
>>>
>>> Before doing more analysis, I was wondering if someone has an
>>> explanation on this problem, since perhaps we are missing something obvious
>>> here?
>>>
>>> Thanks
>>>
>>>
>>>
>>

Re: Unexpected high internode network activity

Posted by Gianluca Borello <gi...@sysdig.com>.

Thank you for your reply.

To answer your points:

- I fully agree on the write volume, in fact my isolated tests confirm
your estimation

- About the read, I agree as well, but the volume of data is still much
higher

- I am writing to one single keyspace with RF 3, there's just one keyspace

- I am not using any indexes, the column families are very simple

- I am aware of the double count, in fact, I measured the traffic on port
9042 at the client side (so just counted once) and I divided by two the
traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All the
measurements have been done with iftop with proper bpf filters on the
port and the total traffic matches what I see in cloudwatch (divided by two)

So unfortunately I still don't have any ideas about what's going on and why
I'm seeing 17 GB of internode traffic instead of ~ 5-6.

On Thursday, February 25, 2016, daemeon reiydelle <da...@gmail.com>
wrote:

> If read & write at quorum then you write 3 copies of the data then return
> to the caller; when reading you read one copy (assume it is not on the
> coordinator), and 1 digest (because read at quorum is 2, not 3).
>
> When you insert, how many keyspaces get written to? (Are you using e.g.
> inverted indices?) That is my guess, that your db has about 1.8 bytes
> written for every byte inserted.
>
> Every byte you write is counted also as a read (system a sends 1gb to
> system b, so system b receives 1gb). You would not be charged if intra AZ,
> but inter AZ and inter DC will get that double count.
>
> So, my guess is reverse indexes, and you forgot to include receive and
> transmit.
> 
>
>
> *.......*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*
>
> On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello <gianluca@sysdig.com
> <javascript:_e(%7B%7D,'cvml','gianluca@sysdig.com');>> wrote:
>
>> Hello,
>>
>> We have a Cassandra 2.1.9 cluster on EC2 for one of our live
>> applications. There's a total of 21 nodes across 3 AWS availability zones,
>> c3.2xlarge instances.
>>
>> The configuration is pretty standard, we use the default settings that
>> come with the datastax AMI and the driver in our application is configured
>> to use lz4 compression. The keyspace where all the activity happens has RF
>> 3 and we read and write at quorum to get strong consistency.
>>
>> While analyzing our monthly bill, we noticed that the amount of network
>> traffic related to Cassandra was significantly higher than expected. After
>> breaking it down by port, it seems like over any given time, the internode
>> network activity is 6-7 times higher than the traffic on port 9042, whereas
>> we would expect something around 2-3 times, given the replication factor
>> and the consistency level of our queries.
>>
>> For example, this is the network traffic broken down by port and
>> direction over a few minutes, measured as sum of each node:
>>
>> Port 9042 from client to cluster (write queries): 1 GB
>> Port 9042 from cluster to client (read queries): 1.5 GB
>> Port 7000: 35 GB, which must be divided by two because the traffic is
>> always directed to another instance of the cluster, so that makes it 17.5
>> GB generated traffic
>>
>> The traffic on port 9042 completely matches our expectations, we do about
>> 100k write operations writing 10KB binary blobs for each query, and a bit
>> more reads on the same data.
>>
>> According to our calculations, in the worst case, when the coordinator of
>> the query is not a replica for the data, this should generate about (1 +
>> 1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more.
>>
>> Also, hinted handoffs are disabled and nodes are healthy over the period
>> of observation, and I get the same numbers across pretty much every time
>> window, even including an entire 24 hours period.
>>
>> I tried to replicate this problem in a test environment so I connected a
>> client to a test cluster done in a bunch of Docker containers (same
>> parameters, essentially the only difference is the
>> GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I
>> expect, the amount of traffic on port 7000 is between 2 and 3 times the
>> amount of traffic on port 9042 and the queries are pretty much the same
>> ones.
>>
>> Before doing more analysis, I was wondering if someone has an explanation
>> on this problem, since perhaps we are missing something obvious here?
>>
>> Thanks
>>
>>
>>
>

Re: Unexpected high internode network activity

Posted by daemeon reiydelle <da...@gmail.com>.

If read & write at quorum then you write 3 copies of the data then return
to the caller; when reading you read one copy (assume it is not on the
coordinator), and 1 digest (because read at quorum is 2, not 3).

When you insert, how many keyspaces get written to? (Are you using e.g.
inverted indices?) That is my guess, that your db has about 1.8 bytes
written for every byte inserted.

Every byte you write is counted also as a read (system a sends 1gb to
system b, so system b receives 1gb). You would not be charged if intra AZ,
but inter AZ and inter DC will get that double count.

So, my guess is reverse indexes, and you forgot to include receive and
transmit.



*.......*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello <gi...@sysdig.com>
wrote:

> Hello,
>
> We have a Cassandra 2.1.9 cluster on EC2 for one of our live applications.
> There's a total of 21 nodes across 3 AWS availability zones, c3.2xlarge
> instances.
>
> The configuration is pretty standard, we use the default settings that
> come with the datastax AMI and the driver in our application is configured
> to use lz4 compression. The keyspace where all the activity happens has RF
> 3 and we read and write at quorum to get strong consistency.
>
> While analyzing our monthly bill, we noticed that the amount of network
> traffic related to Cassandra was significantly higher than expected. After
> breaking it down by port, it seems like over any given time, the internode
> network activity is 6-7 times higher than the traffic on port 9042, whereas
> we would expect something around 2-3 times, given the replication factor
> and the consistency level of our queries.
>
> For example, this is the network traffic broken down by port and direction
> over a few minutes, measured as sum of each node:
>
> Port 9042 from client to cluster (write queries): 1 GB
> Port 9042 from cluster to client (read queries): 1.5 GB
> Port 7000: 35 GB, which must be divided by two because the traffic is
> always directed to another instance of the cluster, so that makes it 17.5
> GB generated traffic
>
> The traffic on port 9042 completely matches our expectations, we do about
> 100k write operations writing 10KB binary blobs for each query, and a bit
> more reads on the same data.
>
> According to our calculations, in the worst case, when the coordinator of
> the query is not a replica for the data, this should generate about (1 +
> 1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more.
>
> Also, hinted handoffs are disabled and nodes are healthy over the period
> of observation, and I get the same numbers across pretty much every time
> window, even including an entire 24 hours period.
>
> I tried to replicate this problem in a test environment so I connected a
> client to a test cluster done in a bunch of Docker containers (same
> parameters, essentially the only difference is the
> GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I
> expect, the amount of traffic on port 7000 is between 2 and 3 times the
> amount of traffic on port 9042 and the queries are pretty much the same
> ones.
>
> Before doing more analysis, I was wondering if someone has an explanation
> on this problem, since perhaps we are missing something obvious here?
>
> Thanks
>
>
>