You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Bill Walters <bi...@gmail.com> on 2019/10/14 23:10:26 UTC

Elevated response times from all nodes in a data center at the same time.

Hi Everyone,

Need some suggestions regarding a peculiar issue we started facing in our
production cluster for the last couple of days.

Here are our Production environment details.

AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in
each region.
No of Nodes: 24
Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for APIs
and 2 OLAP Data centers for Analytics and Batch loads)
Instance Types: r5.8x Large
Average Node Size: 182 GB
Work Load: Read heavy
Read TPS: 22k
Cassandra version: 3.0.15
Java Version: JDK 181.
EBS Volumes: GP2 with 1TB 3000 iops.

1. We have been running in production for more than one year and our
experience with Cassandra is great. Experienced little hiccups here and
there but nothing severe.

2. But recently for the past couple of days we see a behavior where our p99
latency in our AWS us-east-1 region OLTP data center, suddenly starts
rising from 2 ms to 200 ms. It starts with one node where we see the 99th
percentile Read Request latency in Datastax Opscenter starts increasing.
And it spreads immediately, to all other 6 nodes in the data center.

3. We do not see any Read request timeouts or Exception in the our API
Splunk logs only p99 and average latency go up suddenly.

4. We have investigated CPU level usage, Disk I/O, Memory usage and Network
parameters for the nodes during this period and we are not experiencing any
sudden surge in these parameters.

5. We setup client using WhiteListPolicy to send queries to each of the 6
nodes to understand which one is bad, but we see all of them responding
with very high latency. It doesn't happen during our peak traffic period
sometime in the night.

6. We checked the system.log files on our nodes, took a thread dump and
checked for any rouge processes running on the nodes which is stealing CPU
but we are able to find nothing.

7. We even checked our the write requests coming in during this time and we
do not see any large batch operations happening.

8. Initially we tried restarting the nodes to see if the issue can be
mitigated but it kept happening, and we had to fail over API traffic to
us-west-2 region OLTP data center. After a couple of hours we failed back
and everything seems to be working.

We are baffled by this behavior, only correlation we find is the "Native
requests pending" in our Task queues when this happens.

Please let us know your suggestions on how to debug this issue. Has anyone
experienced an issue like this before.(We had issues where one node starts
acting bad due to bad EBS volume I/O read and write time, but all nodes
experiencing an issue at same time is very peculiar)

Thank You,
Bill Walters.

Re: Elevated response times from all nodes in a data center at the same time.

Posted by Reid Pinchback <rp...@tripadvisor.com>.
Something else came to mind.

You’re on AWS.  You always have to keep noisy neighbor problems in the back of the mind when you aren’t running on bare metal.

Basically, either your usage pattern during these incidents is unchanged… or it is not unchanged.  If it is unchanged, and the problem happens, then something outside your usage is causing the problem.  Hence the noisy neighbor suspicion.  If the usage has changed during that time period, then all bets are off and you need to tune your cluster to be happy under more than one usage profile.

From: Jon Haddad <jo...@jonhaddad.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Wednesday, October 16, 2019 at 11:36 AM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Cc: Bill Walters <bi...@gmail.com>
Subject: Re: Elevated response times from all nodes in a data center at the same time.

Message from External Sender
It's possible the queries you're normally running are served out of page cache, and during the latency spike you're hitting your disks. If you're using read ahead you might be hitting a throughput limit on the disks.

I've got some numbers and graphs I can share later when I'm not on my phone.

Jon

On Wed, Oct 16, 2019, 3:03 AM Alain RODRIGUEZ <ar...@gmail.com>> wrote:
Hello Bill,

I think it might be worth it to focus the effort on diagnosing the issue properly in the first place, thus I'll try to guide you through this.

First some comments on your environment:

AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in each region.
No of Nodes: 24
Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for APIs and 2 OLAP Data centers for Analytics and Batch loads)
Instance Types: r5.8x Large
Average Node Size: 182 GB
Work Load: Read heavy

When I read this, I think 'Tune the garbage collection properly!'. do you see GC pauses being a problem? The easiest way to interpret GC logs is probably to upload them there: https://gceasy.io<https://urldefense.proofpoint.com/v2/url?u=https-3A__gceasy.io&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=H2hSujsdARbPvmkMdQJPZ29Ha6qZZGndZxV4mz60j7g&s=E4fnrwsKmo92Cdh31ZAXdfZ5oqYTaPb8Y1jkx4Msz3Y&e=>. Ensure there that the 'GC throughput' is at least around 95+% (ideally 98+%). This would mean that 2 to 5% of the time each node stops to perform stop the world GCs. If that's a thing, we can help you setting GC options a bit nicer than what it is currently probably. That post would then probably be a good starting point: https://thelastpickle.com/blog/2018/04/11/gc-tuning.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__thelastpickle.com_blog_2018_04_11_gc-2Dtuning.html&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=H2hSujsdARbPvmkMdQJPZ29Ha6qZZGndZxV4mz60j7g&s=iB2ahtsSVaS3FYtzoStWHEE7k3y5XWtwMHOWzk7zP0k&e=>

Read TPS: 22k
Cassandra version: 3.0.15

Reading this, I'd recommend an upgrade to the 3.0.latest (3.0.18 at the time being)  or (personal preference) 3.11.4. There was a bunch of fixes, maybe are you hitting something that was fixed already, check changes there, see if any change could do some good for your use case: https://github.com/apache/cassandra/blob/cassandra-3.0.18/CHANGES.txt#L1-L124<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_cassandra_blob_cassandra-2D3.0.18_CHANGES.txt-23L1-2DL124&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=H2hSujsdARbPvmkMdQJPZ29Ha6qZZGndZxV4mz60j7g&s=ffsBZLksIBOhVH4DE1SKKU51kQThXrbCEV6-STcLAiE&e=>

Java Version: JDK 181.
EBS Volumes: GP2 with 1TB 3000 iops.

The GP2 IOPS depends on the dis size. If you find out at anytime that disks are not keeping up, a good way out could be to increase the disk size (despite the small dataset) to actually increase the disk. IOPS & throughput. Now this did not recently change, and it was working for you until now, thus you don't have to Increase the disk size now probably. Just be aware that GP2 with 1 TB are quite slow.

About the issue:

our p99 latency in our AWS us-east-1 region OLTP data center, suddenly starts rising from 2 ms to 200 ms. It starts with one node where we see the 99th percentile Read Request latency in Datastax Opscenter starts increasing. And it spreads immediately, to all other 6 nodes in the data center.

Well, this sounds quite bad. The first 2 things coming to my mind here are:
- Are you reading tombstones? (check logs for tombstones, trace a few queries)
- Are you reading a huge partition? (check the max partition size, compare it to the mean and ensure it is remaining below 1 GB (or even 100 MB ideally).

An inefficient read, for the reasons above or other reasons, would not necessarily impact nodes' resources but could definitely destroy performances for this query and the following ones due to the 'requests congestion'.

To try to make a sense of the current tombstones level you can look at:
- logs (grep tombstone)
- sstablemetadata gives you the % of droppable tombstones. This is an estimate and of the space that could be freed, it gives no information on whether tombstones are being read and can affect performances or not, yet it gives you an idea of the tombstones that can be generated in the workflow
- Trace queries: either trace a manual query from cqlsh with 'TRACING ON;' then sending queries similar to prod ones. Or directly using 'nodetool settraceprobability X', /!\ ensure X is really low to start with - like 0.001 or 0.0001 maybe, we probably don't need many queries to understand what happened and tracing might inflicts a big penalty to Cassandra servers in terms of performances (each of the traced queries will induce a bunch of queries to actually persist the trace In system_traces key space.

We do not see any Read request timeouts or Exception in the our API Splunk logs only p99 and average latency go up suddenly.

What's the value you use for timeouts? Also, any other exception/timeout, somewhere else than for reads?
What are the result of:

- nodetool tablestats (I think this would gather what you need to check --> nodetool tablestats | grep -e Keyspace -e Table: -e latency -e partition -e tombstones)
- watch -d nodetool tpstats (here look at any pending threads constantly higher than 0, any blocked or dropped threads)

We have investigated CPU level usage, Disk I/O, Memory usage and Network parameters for the nodes during this period and we are not experiencing any sudden surge in these parameters.

If the resources are fine, there is a bottleneck within Cassandra, that we need to find, the commands above aim at that, finding C*'s bottleneck, assuming machines can handle more load.

We setup client using WhiteListPolicy to send queries to each of the 6 nodes to understand which one is bad, but we see all of them responding with very high latency. It doesn't happen during our peak traffic period sometime in the night.

 This brings something else to my mind. The fact latency goes lower when there is a traffic increase, it can perfectly mean that each of the queries sent during the spike are really efficient, and despite you might still have some other queries being slow (even during peak hours), having that many 'efficient/quick' requests, lowers the average/pXX latencies. Does the max latency change over time?

You can here try to get a sense of this with:

- nodetool proxyhistograms
- nodetool tablehistograms <keyspace> <table> # For the table level stats

We checked the system.log files on our nodes, took a thread dump and checked for any rouge processes running on the nodes which is stealing CPU but we are able to find nothing.

From what I read/understand, resource are fine, I would put these searches aside for now. About the log file, I like to use:

- tail -fn 100 /var/log/cassandra/system.log #See current logs (if you are having the issues NOW)
- grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log # to check what happened and was wrong

For now I can't think about anything else, I hope some of those ideas will help you diagnose the problem. Once it is diagnosed, we should be able to reason about how we can fix it.

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com<ma...@thelastpickle.com>
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.thelastpickle.com&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=H2hSujsdARbPvmkMdQJPZ29Ha6qZZGndZxV4mz60j7g&s=kapaVkL0EZEzPVTpVk0GNWZFxxsL7aWHrZU-HJvK45I&e=>

Le mar. 15 oct. 2019 à 17:26, Reid Pinchback <rp...@tripadvisor.com>> a écrit :
I’d look to see if you have compactions fronting the p99’s.  If so, then go back to looking at the I/O.  Disbelieve any metrics not captured at a high resolution for a time window around the compactions, like 100ms.  You could be hitting I/O stalls where reads are blocked by the flushing of writes.  It’s short-lived when it happens, and per-minute metrics won’t provide breadcrumbs.

From: Bill Walters <bi...@gmail.com>>
Date: Monday, October 14, 2019 at 7:10 PM
To: <us...@cassandra.apache.org>>
Subject: Elevated response times from all nodes in a data center at the same time.

Hi Everyone,

Need some suggestions regarding a peculiar issue we started facing in our production cluster for the last couple of days.

Here are our Production environment details.

AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in each region.
No of Nodes: 24
Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for APIs and 2 OLAP Data centers for Analytics and Batch loads)
Instance Types: r5.8x Large
Average Node Size: 182 GB
Work Load: Read heavy
Read TPS: 22k
Cassandra version: 3.0.15
Java Version: JDK 181.
EBS Volumes: GP2 with 1TB 3000 iops.

1. We have been running in production for more than one year and our experience with Cassandra is great. Experienced little hiccups here and there but nothing severe.

2. But recently for the past couple of days we see a behavior where our p99 latency in our AWS us-east-1 region OLTP data center, suddenly starts rising from 2 ms to 200 ms. It starts with one node where we see the 99th percentile Read Request latency in Datastax Opscenter starts increasing. And it spreads immediately, to all other 6 nodes in the data center.

3. We do not see any Read request timeouts or Exception in the our API Splunk logs only p99 and average latency go up suddenly.

4. We have investigated CPU level usage, Disk I/O, Memory usage and Network parameters for the nodes during this period and we are not experiencing any sudden surge in these parameters.

5. We setup client using WhiteListPolicy to send queries to each of the 6 nodes to understand which one is bad, but we see all of them responding with very high latency. It doesn't happen during our peak traffic period sometime in the night.

6. We checked the system.log files on our nodes, took a thread dump and checked for any rouge processes running on the nodes which is stealing CPU but we are able to find nothing.

7. We even checked our the write requests coming in during this time and we do not see any large batch operations happening.

8. Initially we tried restarting the nodes to see if the issue can be mitigated but it kept happening, and we had to fail over API traffic to us-west-2 region OLTP data center. After a couple of hours we failed back and everything seems to be working.

We are baffled by this behavior, only correlation we find is the "Native requests pending" in our Task queues when this happens.

Please let us know your suggestions on how to debug this issue. Has anyone experienced an issue like this before.(We had issues where one node starts acting bad due to bad EBS volume I/O read and write time, but all nodes experiencing an issue at same time is very peculiar)

Thank You,
Bill Walters.

Re: Elevated response times from all nodes in a data center at the same time.

Posted by Jon Haddad <jo...@jonhaddad.com>.
It's possible the queries you're normally running are served out of page
cache, and during the latency spike you're hitting your disks. If you're
using read ahead you might be hitting a throughput limit on the disks.

I've got some numbers and graphs I can share later when I'm not on my
phone.

Jon

On Wed, Oct 16, 2019, 3:03 AM Alain RODRIGUEZ <ar...@gmail.com> wrote:

> Hello Bill,
>
> I think it might be worth it to focus the effort on diagnosing the issue
> properly in the first place, thus I'll try to guide you through this.
>
> First some comments on your environment:
>
> AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in
>> each region.
>> No of Nodes: 24
>> Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for
>> APIs and 2 OLAP Data centers for Analytics and Batch loads)
>> Instance Types: r5.8x Large
>> Average Node Size: 182 GB
>> Work Load: Read heavy
>>
>
> When I read this, I think 'Tune the garbage collection properly!'. do you
> see GC pauses being a problem? The easiest way to interpret GC logs is
> probably to upload them there: https://gceasy.io. Ensure there that the
> 'GC throughput' is at least around 95+% (ideally 98+%). This would mean
> that 2 to 5% of the time each node stops to perform stop the world GCs. If
> that's a thing, we can help you setting GC options a bit nicer than what it
> is currently probably. That post would then probably be a good starting
> point: https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>
>
>> Read TPS: 22k
>> Cassandra version: 3.0.15
>>
>
> Reading this, I'd recommend an upgrade to the 3.0.latest (3.0.18 at the
> time being)  or (personal preference) 3.11.4. There was a bunch of fixes,
> maybe are you hitting something that was fixed already, check changes
> there, see if any change could do some good for your use case:
> https://github.com/apache/cassandra/blob/cassandra-3.0.18/CHANGES.txt#L1-L124
>
> Java Version: JDK 181.
>> EBS Volumes: GP2 with 1TB 3000 iops.
>>
>
> The GP2 IOPS depends on the dis size. If you find out at anytime that
> disks are not keeping up, a good way out could be to increase the disk size
> (despite the small dataset) to actually increase the disk. IOPS &
> throughput. Now this did not recently change, and it was working for you
> until now, thus you don't have to Increase the disk size now probably. Just
> be aware that GP2 with 1 TB are quite slow.
>
> About the issue:
>
> our p99 latency in our AWS us-east-1 region OLTP data center, suddenly
>> starts rising from 2 ms to 200 ms. It starts with one node where we see the
>> 99th percentile Read Request latency in Datastax Opscenter starts
>> increasing. And it spreads immediately, to all other 6 nodes in the data
>> center.
>>
>
> Well, this sounds quite bad. The first 2 things coming to my mind here are:
> - Are you reading tombstones? (check logs for tombstones, trace a few
> queries)
> - Are you reading a huge partition? (check the max partition size, compare
> it to the mean and ensure it is remaining below 1 GB (or even 100 MB
> ideally).
>
> An inefficient read, for the reasons above or other reasons, would not
> necessarily impact nodes' resources but could definitely destroy
> performances for this query and the following ones due to the 'requests
> congestion'.
>
> To try to make a sense of the current tombstones level you can look at:
> - logs (grep tombstone)
> - sstablemetadata gives you the % of droppable tombstones. This is an
> estimate and of the space that could be freed, it gives no information on
> whether tombstones are being read and can affect performances or not, yet
> it gives you an idea of the tombstones that can be generated in the workflow
> - Trace queries: either trace a manual query from cqlsh with 'TRACING ON;'
> then sending queries similar to prod ones. Or directly using 'nodetool
> settraceprobability X', /!\ ensure X is really low to start with - like
> 0.001 or 0.0001 maybe, we probably don't need many queries to understand
> what happened and tracing might inflicts a big penalty to Cassandra servers
> in terms of performances (each of the traced queries will induce a bunch of
> queries to actually persist the trace In system_traces key space.
>
> We do not see any Read request timeouts or Exception in the our API Splunk
>> logs only p99 and average latency go up suddenly.
>>
>
> What's the value you use for timeouts? Also, any other exception/timeout,
> somewhere else than for reads?
> What are the result of:
>
> - nodetool tablestats (I think this would gather what you need to check
> --> nodetool tablestats | grep -e Keyspace -e Table: -e latency -e
> partition -e tombstones)
> - watch -d nodetool tpstats (here look at any pending threads constantly
> higher than 0, any blocked or dropped threads)
>
> We have investigated CPU level usage, Disk I/O, Memory usage and Network
>> parameters for the nodes during this period and we are not experiencing any
>> sudden surge in these parameters.
>
>
> If the resources are fine, there is a bottleneck within Cassandra, that we
> need to find, the commands above aim at that, finding C*'s bottleneck,
> assuming machines can handle more load.
>
> We setup client using WhiteListPolicy to send queries to each of the 6
>> nodes to understand which one is bad, but we see all of them responding
>> with very high latency. It doesn't happen during our peak traffic period
>> sometime in the night.
>
>
>  This brings something else to my mind. The fact latency goes lower when
> there is a traffic increase, it can perfectly mean that each of the queries
> sent during the spike are really efficient, and despite you might still
> have some other queries being slow (even during peak hours), having that
> many 'efficient/quick' requests, lowers the average/pXX latencies. Does the
> max latency change over time?
>
> You can here try to get a sense of this with:
>
> - nodetool proxyhistograms
> - nodetool tablehistograms <keyspace> <table> # For the table level stats
>
> We checked the system.log files on our nodes, took a thread dump and
>> checked for any rouge processes running on the nodes which is stealing CPU
>> but we are able to find nothing.
>
>
> From what I read/understand, resource are fine, I would put these searches
> aside for now. About the log file, I like to use:
>
> - tail -fn 100 /var/log/cassandra/system.log #See current logs (if you are
> having the issues NOW)
> - grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log # to check what
> happened and was wrong
>
> For now I can't think about anything else, I hope some of those ideas will
> help you diagnose the problem. Once it is diagnosed, we should be able to
> reason about how we can fix it.
>
> C*heers,
> -----------------------
> Alain Rodriguez - alain@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le mar. 15 oct. 2019 à 17:26, Reid Pinchback <rp...@tripadvisor.com>
> a écrit :
>
>> I’d look to see if you have compactions fronting the p99’s.  If so, then
>> go back to looking at the I/O.  Disbelieve any metrics not captured at a
>> high resolution for a time window around the compactions, like 100ms.  You
>> could be hitting I/O stalls where reads are blocked by the flushing of
>> writes.  It’s short-lived when it happens, and per-minute metrics won’t
>> provide breadcrumbs.
>>
>>
>>
>> *From: *Bill Walters <bi...@gmail.com>
>> *Date: *Monday, October 14, 2019 at 7:10 PM
>> *To: *<us...@cassandra.apache.org>
>> *Subject: *Elevated response times from all nodes in a data center at
>> the same time.
>>
>>
>>
>> Hi Everyone,
>>
>>
>>
>> Need some suggestions regarding a peculiar issue we started facing in our
>> production cluster for the last couple of days.
>>
>>
>>
>> Here are our Production environment details.
>>
>>
>>
>> AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone
>> in each region.
>>
>> No of Nodes: 24
>>
>> Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for
>> APIs and 2 OLAP Data centers for Analytics and Batch loads)
>>
>> Instance Types: r5.8x Large
>>
>> Average Node Size: 182 GB
>>
>> Work Load: Read heavy
>>
>> Read TPS: 22k
>>
>> Cassandra version: 3.0.15
>>
>> Java Version: JDK 181.
>>
>> EBS Volumes: GP2 with 1TB 3000 iops.
>>
>>
>>
>> 1. We have been running in production for more than one year and our
>> experience with Cassandra is great. Experienced little hiccups here and
>> there but nothing severe.
>>
>>
>>
>> 2. But recently for the past couple of days we see a behavior where our
>> p99 latency in our AWS us-east-1 region OLTP data center, suddenly starts
>> rising from 2 ms to 200 ms. It starts with one node where we see the 99th
>> percentile Read Request latency in Datastax Opscenter starts increasing.
>> And it spreads immediately, to all other 6 nodes in the data center.
>>
>>
>>
>> 3. We do not see any Read request timeouts or Exception in the our API
>> Splunk logs only p99 and average latency go up suddenly.
>>
>>
>>
>> 4. We have investigated CPU level usage, Disk I/O, Memory usage and
>> Network parameters for the nodes during this period and we are not
>> experiencing any sudden surge in these parameters.
>>
>>
>>
>> 5. We setup client using WhiteListPolicy to send queries to each of the 6
>> nodes to understand which one is bad, but we see all of them responding
>> with very high latency. It doesn't happen during our peak traffic period
>> sometime in the night.
>>
>>
>>
>> 6. We checked the system.log files on our nodes, took a thread dump and
>> checked for any rouge processes running on the nodes which is stealing CPU
>> but we are able to find nothing.
>>
>>
>>
>> 7. We even checked our the write requests coming in during this time and
>> we do not see any large batch operations happening.
>>
>>
>>
>> 8. Initially we tried restarting the nodes to see if the issue can be
>> mitigated but it kept happening, and we had to fail over API traffic to
>> us-west-2 region OLTP data center. After a couple of hours we failed back
>> and everything seems to be working.
>>
>>
>>
>> We are baffled by this behavior, only correlation we find is the "Native
>> requests pending" in our Task queues when this happens.
>>
>>
>>
>> Please let us know your suggestions on how to debug this issue. Has
>> anyone experienced an issue like this before.(We had issues where one node
>> starts acting bad due to bad EBS volume I/O read and write time, but all
>> nodes experiencing an issue at same time is very peculiar)
>>
>>
>>
>> Thank You,
>>
>> Bill Walters.
>>
>

Re: Elevated response times from all nodes in a data center at the same time.

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
Hello Bill,

I think it might be worth it to focus the effort on diagnosing the issue
properly in the first place, thus I'll try to guide you through this.

First some comments on your environment:

AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in
> each region.
> No of Nodes: 24
> Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for APIs
> and 2 OLAP Data centers for Analytics and Batch loads)
> Instance Types: r5.8x Large
> Average Node Size: 182 GB
> Work Load: Read heavy
>

When I read this, I think 'Tune the garbage collection properly!'. do you
see GC pauses being a problem? The easiest way to interpret GC logs is
probably to upload them there: https://gceasy.io. Ensure there that the 'GC
throughput' is at least around 95+% (ideally 98+%). This would mean that 2
to 5% of the time each node stops to perform stop the world GCs. If that's
a thing, we can help you setting GC options a bit nicer than what it is
currently probably. That post would then probably be a good starting point:
https://thelastpickle.com/blog/2018/04/11/gc-tuning.html


> Read TPS: 22k
> Cassandra version: 3.0.15
>

Reading this, I'd recommend an upgrade to the 3.0.latest (3.0.18 at the
time being)  or (personal preference) 3.11.4. There was a bunch of fixes,
maybe are you hitting something that was fixed already, check changes
there, see if any change could do some good for your use case:
https://github.com/apache/cassandra/blob/cassandra-3.0.18/CHANGES.txt#L1-L124

Java Version: JDK 181.
> EBS Volumes: GP2 with 1TB 3000 iops.
>

The GP2 IOPS depends on the dis size. If you find out at anytime that disks
are not keeping up, a good way out could be to increase the disk size
(despite the small dataset) to actually increase the disk. IOPS &
throughput. Now this did not recently change, and it was working for you
until now, thus you don't have to Increase the disk size now probably. Just
be aware that GP2 with 1 TB are quite slow.

About the issue:

our p99 latency in our AWS us-east-1 region OLTP data center, suddenly
> starts rising from 2 ms to 200 ms. It starts with one node where we see the
> 99th percentile Read Request latency in Datastax Opscenter starts
> increasing. And it spreads immediately, to all other 6 nodes in the data
> center.
>

Well, this sounds quite bad. The first 2 things coming to my mind here are:
- Are you reading tombstones? (check logs for tombstones, trace a few
queries)
- Are you reading a huge partition? (check the max partition size, compare
it to the mean and ensure it is remaining below 1 GB (or even 100 MB
ideally).

An inefficient read, for the reasons above or other reasons, would not
necessarily impact nodes' resources but could definitely destroy
performances for this query and the following ones due to the 'requests
congestion'.

To try to make a sense of the current tombstones level you can look at:
- logs (grep tombstone)
- sstablemetadata gives you the % of droppable tombstones. This is an
estimate and of the space that could be freed, it gives no information on
whether tombstones are being read and can affect performances or not, yet
it gives you an idea of the tombstones that can be generated in the workflow
- Trace queries: either trace a manual query from cqlsh with 'TRACING ON;'
then sending queries similar to prod ones. Or directly using 'nodetool
settraceprobability X', /!\ ensure X is really low to start with - like
0.001 or 0.0001 maybe, we probably don't need many queries to understand
what happened and tracing might inflicts a big penalty to Cassandra servers
in terms of performances (each of the traced queries will induce a bunch of
queries to actually persist the trace In system_traces key space.

We do not see any Read request timeouts or Exception in the our API Splunk
> logs only p99 and average latency go up suddenly.
>

What's the value you use for timeouts? Also, any other exception/timeout,
somewhere else than for reads?
What are the result of:

- nodetool tablestats (I think this would gather what you need to
check --> nodetool
tablestats | grep -e Keyspace -e Table: -e latency -e partition -e
tombstones)
- watch -d nodetool tpstats (here look at any pending threads constantly
higher than 0, any blocked or dropped threads)

We have investigated CPU level usage, Disk I/O, Memory usage and Network
> parameters for the nodes during this period and we are not experiencing any
> sudden surge in these parameters.


If the resources are fine, there is a bottleneck within Cassandra, that we
need to find, the commands above aim at that, finding C*'s bottleneck,
assuming machines can handle more load.

We setup client using WhiteListPolicy to send queries to each of the 6
> nodes to understand which one is bad, but we see all of them responding
> with very high latency. It doesn't happen during our peak traffic period
> sometime in the night.


 This brings something else to my mind. The fact latency goes lower when
there is a traffic increase, it can perfectly mean that each of the queries
sent during the spike are really efficient, and despite you might still
have some other queries being slow (even during peak hours), having that
many 'efficient/quick' requests, lowers the average/pXX latencies. Does the
max latency change over time?

You can here try to get a sense of this with:

- nodetool proxyhistograms
- nodetool tablehistograms <keyspace> <table> # For the table level stats

We checked the system.log files on our nodes, took a thread dump and
> checked for any rouge processes running on the nodes which is stealing CPU
> but we are able to find nothing.


From what I read/understand, resource are fine, I would put these searches
aside for now. About the log file, I like to use:

- tail -fn 100 /var/log/cassandra/system.log #See current logs (if you are
having the issues NOW)
- grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log # to check what
happened and was wrong

For now I can't think about anything else, I hope some of those ideas will
help you diagnose the problem. Once it is diagnosed, we should be able to
reason about how we can fix it.

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le mar. 15 oct. 2019 à 17:26, Reid Pinchback <rp...@tripadvisor.com> a
écrit :

> I’d look to see if you have compactions fronting the p99’s.  If so, then
> go back to looking at the I/O.  Disbelieve any metrics not captured at a
> high resolution for a time window around the compactions, like 100ms.  You
> could be hitting I/O stalls where reads are blocked by the flushing of
> writes.  It’s short-lived when it happens, and per-minute metrics won’t
> provide breadcrumbs.
>
>
>
> *From: *Bill Walters <bi...@gmail.com>
> *Date: *Monday, October 14, 2019 at 7:10 PM
> *To: *<us...@cassandra.apache.org>
> *Subject: *Elevated response times from all nodes in a data center at the
> same time.
>
>
>
> Hi Everyone,
>
>
>
> Need some suggestions regarding a peculiar issue we started facing in our
> production cluster for the last couple of days.
>
>
>
> Here are our Production environment details.
>
>
>
> AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in
> each region.
>
> No of Nodes: 24
>
> Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for APIs
> and 2 OLAP Data centers for Analytics and Batch loads)
>
> Instance Types: r5.8x Large
>
> Average Node Size: 182 GB
>
> Work Load: Read heavy
>
> Read TPS: 22k
>
> Cassandra version: 3.0.15
>
> Java Version: JDK 181.
>
> EBS Volumes: GP2 with 1TB 3000 iops.
>
>
>
> 1. We have been running in production for more than one year and our
> experience with Cassandra is great. Experienced little hiccups here and
> there but nothing severe.
>
>
>
> 2. But recently for the past couple of days we see a behavior where our
> p99 latency in our AWS us-east-1 region OLTP data center, suddenly starts
> rising from 2 ms to 200 ms. It starts with one node where we see the 99th
> percentile Read Request latency in Datastax Opscenter starts increasing.
> And it spreads immediately, to all other 6 nodes in the data center.
>
>
>
> 3. We do not see any Read request timeouts or Exception in the our API
> Splunk logs only p99 and average latency go up suddenly.
>
>
>
> 4. We have investigated CPU level usage, Disk I/O, Memory usage and
> Network parameters for the nodes during this period and we are not
> experiencing any sudden surge in these parameters.
>
>
>
> 5. We setup client using WhiteListPolicy to send queries to each of the 6
> nodes to understand which one is bad, but we see all of them responding
> with very high latency. It doesn't happen during our peak traffic period
> sometime in the night.
>
>
>
> 6. We checked the system.log files on our nodes, took a thread dump and
> checked for any rouge processes running on the nodes which is stealing CPU
> but we are able to find nothing.
>
>
>
> 7. We even checked our the write requests coming in during this time and
> we do not see any large batch operations happening.
>
>
>
> 8. Initially we tried restarting the nodes to see if the issue can be
> mitigated but it kept happening, and we had to fail over API traffic to
> us-west-2 region OLTP data center. After a couple of hours we failed back
> and everything seems to be working.
>
>
>
> We are baffled by this behavior, only correlation we find is the "Native
> requests pending" in our Task queues when this happens.
>
>
>
> Please let us know your suggestions on how to debug this issue. Has anyone
> experienced an issue like this before.(We had issues where one node starts
> acting bad due to bad EBS volume I/O read and write time, but all nodes
> experiencing an issue at same time is very peculiar)
>
>
>
> Thank You,
>
> Bill Walters.
>

Re: Elevated response times from all nodes in a data center at the same time.

Posted by Reid Pinchback <rp...@tripadvisor.com>.
I’d look to see if you have compactions fronting the p99’s.  If so, then go back to looking at the I/O.  Disbelieve any metrics not captured at a high resolution for a time window around the compactions, like 100ms.  You could be hitting I/O stalls where reads are blocked by the flushing of writes.  It’s short-lived when it happens, and per-minute metrics won’t provide breadcrumbs.

From: Bill Walters <bi...@gmail.com>
Date: Monday, October 14, 2019 at 7:10 PM
To: <us...@cassandra.apache.org>
Subject: Elevated response times from all nodes in a data center at the same time.

Hi Everyone,

Need some suggestions regarding a peculiar issue we started facing in our production cluster for the last couple of days.

Here are our Production environment details.

AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in each region.
No of Nodes: 24
Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for APIs and 2 OLAP Data centers for Analytics and Batch loads)
Instance Types: r5.8x Large
Average Node Size: 182 GB
Work Load: Read heavy
Read TPS: 22k
Cassandra version: 3.0.15
Java Version: JDK 181.
EBS Volumes: GP2 with 1TB 3000 iops.

1. We have been running in production for more than one year and our experience with Cassandra is great. Experienced little hiccups here and there but nothing severe.

2. But recently for the past couple of days we see a behavior where our p99 latency in our AWS us-east-1 region OLTP data center, suddenly starts rising from 2 ms to 200 ms. It starts with one node where we see the 99th percentile Read Request latency in Datastax Opscenter starts increasing. And it spreads immediately, to all other 6 nodes in the data center.

3. We do not see any Read request timeouts or Exception in the our API Splunk logs only p99 and average latency go up suddenly.

4. We have investigated CPU level usage, Disk I/O, Memory usage and Network parameters for the nodes during this period and we are not experiencing any sudden surge in these parameters.

5. We setup client using WhiteListPolicy to send queries to each of the 6 nodes to understand which one is bad, but we see all of them responding with very high latency. It doesn't happen during our peak traffic period sometime in the night.

6. We checked the system.log files on our nodes, took a thread dump and checked for any rouge processes running on the nodes which is stealing CPU but we are able to find nothing.

7. We even checked our the write requests coming in during this time and we do not see any large batch operations happening.

8. Initially we tried restarting the nodes to see if the issue can be mitigated but it kept happening, and we had to fail over API traffic to us-west-2 region OLTP data center. After a couple of hours we failed back and everything seems to be working.

We are baffled by this behavior, only correlation we find is the "Native requests pending" in our Task queues when this happens.

Please let us know your suggestions on how to debug this issue. Has anyone experienced an issue like this before.(We had issues where one node starts acting bad due to bad EBS volume I/O read and write time, but all nodes experiencing an issue at same time is very peculiar)

Thank You,
Bill Walters.