You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Oleksandr Shulgin <ol...@zalando.de> on 2018/12/05 17:38:25 UTC

Sporadic high IO bandwidth and Linux OOM killer

Hello,

We are running the following setup on AWS EC2:

Host system (AWS AMI): Ubuntu 14.04.4 LTS,
Linux <hostname> 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5
08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Cassandra process runs inside a docker container.
Docker image is based on Ubuntu 18.04.1 LTS.

Apache Cassandra 3.0.17, installed from .deb packages.

$ java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1ubuntu0.18.04.1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

We have a total of 36 nodes.  All are r4.large instances, they have 2 vCPUs
and ~15 GB RAM.
On each instance we have:
- 2TB gp2 SSD EBS volume for data and commit log,
- 8GB gp2 SSD EBS for system (root volume).

Non-default settings in cassandra.yaml:
num_tokens: 16
memtable_flush_writers: 1
concurrent_compactors: 1
snitch: Ec2Snitch

JVM heap/stack size options: -Xms8G -Xmx8G -Xmn800M -Xss256k
Garbage collection: CMS with default settings.

We repair once a week using Cassandra Reaper: parallel, intensity 1, 64
segments per node.  The issue also happens outside of repair time.

The symptoms:
============

Sporadically a node becomes unavailable for a period of time between few
minutes and a few hours.  According to our analysis and as pointed out by
AWS support team, the unavailability is caused by exceptionally high read
bandwidth on the *root* EBS volume.  I repeat, on the root volume, *not* on
the data/commitlog volume.  Basically, the amount if IO exceeds instance's
bandwidth (~52MB/s) and all other network communication becomes impossible.

The root volume contains operating system, docker container with OpenJDK
and Cassandra binaries, and the logs.

Most of the time, whenever this happens it is too late to SSH into the
instance to troubleshoot: it becomes completely unavailable within very
short period of time.
Rebooting the affected instance helps to bring it back to life.

Starting from the middle of last week we have seen this problem repeatedly
1-3 times a day, affecting different instances in a seemingly random
fashion.  Most of the time it affects only one instance, but we've had one
incident when 9 nodes (3 from each of the 3 Availability Zones) were down
at the same time due to this exact issue.

Actually, we've had the same issue previously on the same Cassandra cluster
around 3 months ago (beginning to mid of September 2018).  At that time we
were running on m4.xlarge instances (these have 4 vCPUs and 16GB RAM).

As a mitigation measure we have migrated away from those to r4.2xlarge.
Then we didn't observe any issues for a few weeks, so we have scaled down
two times: to r4.xlarge and then to r4.large.  The last migration was
completed before Nov 13th.  No changes to the cluster or application
happened since that time.

Now, after some weeks the issue appears again...

When we are not fast enough to react and reboot the affected instance, we
can see that ultimately Linux OOM killer kicks in and kills the java
process running Cassandra.  After that the instance becomes available
almost immediately.  This allows us to rule out other processes running in
background as potential offenders.

We routinely observe Memory.HeapMemoryUsage.used between 1GB and 6GB
and Memory.NonHeapMemoryUsage.used below 100MB, as reported by JMX (via
Jolokia).  At the same time, Committed_AS on each host is constantly around
11-12GB, as reported by atop(1) and prometheus.

We are running atop with sampling interval of 60 seconds.  After the fact
we observe that the java process is the one responsible for the most of
disk activity during unavailability period.  We also see kswapd0 high on
the list from time to time, which always has 0K reads, but non-zero write
bandwidth.  There is no swap space defined on these instances, so not
really clear why kswapd appears at the top of the list all (measurement
error?).

We also attempted to troubleshoot by running jstack, jmap and pmap against
Cassandra process in background every few minutes.  The idea was to compare
dumps taken before and during unavailability, but that didn't lead to any
findings so far.  Ultimately we had to stop doing this, once we've seen
that jmap can also become stuck burning CPU cycles.  Now the output of jmap
is not that useful, but we fear that jstack might also expose the same
behavior.  So we wanted to avoid making the issue worse than it currently
is and disabled this debug sampling.

Now to my questions:

1. Is there anything in Cassandra or in the JVM that could explain all of a
sudden reading from non-data volume at such a high rate, for prolonged
periods of time as described above?

2. Why does JVM heap utilization never reaches the 8GB that we provide to
it?

3. Why Committed virtual memory is so much bigger than sum of the heap and
off-heap reported by JMX?  To what can this difference be attributed?
I've just visited a node at random and collected "off heap memory used"
numbers reported by nodetool cfstats, and still I see only 2.6GB in total,
while Commited_AS is ~12GB.  Is there a more direct way to monitor off-heap
memory usage by JVM?

4. The only Jira issue related to Linux OOM, we've found is this one:
https://issues.apache.org/jira/browse/CASSANDRA-13931  This might be
related to our OOM, but still doesn't explain the unexpected IO anomalies.

I would really appreciate any hints / pointers / insights!  The more I
learn about this issue, the less I understand it...

Regards,
--
Alex

Re: Sporadic high IO bandwidth and Linux OOM killer

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Wed, 5 Dec 2018, 19:53 Jonathan Haddad <jon@jonhaddad.com wrote:

> Seeing high kswapd usage means there's a lot of churn in the page cache.
> It doesn't mean you're using swap, it means the box is spending time
> clearing pages out of the page cache to make room for the stuff you're
> reading now.
>

Jon,

Thanks for your thoughts!

machines don't have enough memory - they are way undersized for a
> production workload.
>

Well, they were doing fine since around February this year. The issue
started to appear out of the blue sky.

Things that make it worse:
> * high readahead (use 8kb on ssd)
> * high compression chunk length when reading small rows / partitions.
> Nobody specifies this, 64KB by default is awful.  I almost always switch to
> 4KB-16KB here but on these boxes you're kind of screwed since you're
> already basically out of memory.
>

That's interesting, even though from my understanding Cassandra is mostly
doing sequential IO.  But I'm not sure this is really relevant to the issue
at hand, as reading is done from the root device.

What could it be reading from there? After the JVM has started up and
config file is parsed I really don't see why should it read anything
additionally. Or am I missing something?

To make it clear: normally the root EBS on these nodes is doing at most 10
reads per second. When the issue starts, reads per second jump to hundreds
within few minutes (sometimes there's a preceding period of slow build up,
but in the end it's really exponential).

I'd never put Cassandra in production with less than 30GB ram and 8 cores
> per box.
>

We had to tweak the heap size once we started to run repairs, because
default heuristic aimed too low for us. Otherwise, as I've said we've seen
zero problems with our workload.

Cheers,
--
Alex

Re: Sporadic high IO bandwidth and Linux OOM killer

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

Seeing high kswapd usage means there's a lot of churn in the page cache.
It doesn't mean you're using swap, it means the box is spending time
clearing pages out of the page cache to make room for the stuff you're
reading now.  The machines don't have enough memory - they are way
undersized for a production workload.

Things that make it worse:
* high readahead (use 8kb on ssd)
* high compression chunk length when reading small rows / partitions.
Nobody specifies this, 64KB by default is awful.  I almost always switch to
4KB-16KB here but on these boxes you're kind of screwed since you're
already basically out of memory.

I'd never put Cassandra in production with less than 30GB ram and 8 cores
per box.

On Wed, Dec 5, 2018 at 10:44 AM Jon Meredith <jm...@gmail.com> wrote:

> The kswapd issue is interesting, is it possible you're being affected by
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457 - although I
> don't see a fix for Trusty listed on there?
>
> On Wed, Dec 5, 2018 at 11:34 AM Riccardo Ferrari <fe...@gmail.com>
> wrote:
>
>> Hi Alex,
>>
>> I saw that behaviout in the past. I can tell you the kswapd0 usage is
>> connected to the `disk_access_mode` property. On 64bit systems defaults to
>> mmap. That also explains why your virtual memory is so high (it somehow
>> matches the node load, right?). I can not find and good reference however
>> googling for "kswapd0 cassandra" you'll find quite some resources.
>>
>> HTH,
>>
>> On Wed, Dec 5, 2018 at 6:39 PM Oleksandr Shulgin <
>> oleksandr.shulgin@zalando.de> wrote:
>>
>>> Hello,
>>>
>>> We are running the following setup on AWS EC2:
>>>
>>> Host system (AWS AMI): Ubuntu 14.04.4 LTS,
>>> Linux <hostname> 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5
>>> 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> Cassandra process runs inside a docker container.
>>> Docker image is based on Ubuntu 18.04.1 LTS.
>>>
>>> Apache Cassandra 3.0.17, installed from .deb packages.
>>>
>>> $ java -version
>>> openjdk version "1.8.0_181"
>>> OpenJDK Runtime Environment (build
>>> 1.8.0_181-8u181-b13-1ubuntu0.18.04.1-b13)
>>> OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
>>>
>>> We have a total of 36 nodes.  All are r4.large instances, they have 2
>>> vCPUs and ~15 GB RAM.
>>> On each instance we have:
>>> - 2TB gp2 SSD EBS volume for data and commit log,
>>> - 8GB gp2 SSD EBS for system (root volume).
>>>
>>> Non-default settings in cassandra.yaml:
>>> num_tokens: 16
>>> memtable_flush_writers: 1
>>> concurrent_compactors: 1
>>> snitch: Ec2Snitch
>>>
>>> JVM heap/stack size options: -Xms8G -Xmx8G -Xmn800M -Xss256k
>>> Garbage collection: CMS with default settings.
>>>
>>> We repair once a week using Cassandra Reaper: parallel, intensity 1, 64
>>> segments per node.  The issue also happens outside of repair time.
>>>
>>> The symptoms:
>>> ============
>>>
>>> Sporadically a node becomes unavailable for a period of time between few
>>> minutes and a few hours.  According to our analysis and as pointed out by
>>> AWS support team, the unavailability is caused by exceptionally high read
>>> bandwidth on the *root* EBS volume.  I repeat, on the root volume, *not* on
>>> the data/commitlog volume.  Basically, the amount if IO exceeds instance's
>>> bandwidth (~52MB/s) and all other network communication becomes impossible.
>>>
>>> The root volume contains operating system, docker container with OpenJDK
>>> and Cassandra binaries, and the logs.
>>>
>>> Most of the time, whenever this happens it is too late to SSH into the
>>> instance to troubleshoot: it becomes completely unavailable within very
>>> short period of time.
>>> Rebooting the affected instance helps to bring it back to life.
>>>
>>> Starting from the middle of last week we have seen this problem
>>> repeatedly 1-3 times a day, affecting different instances in a seemingly
>>> random fashion.  Most of the time it affects only one instance, but we've
>>> had one incident when 9 nodes (3 from each of the 3 Availability Zones)
>>> were down at the same time due to this exact issue.
>>>
>>> Actually, we've had the same issue previously on the same Cassandra
>>> cluster around 3 months ago (beginning to mid of September 2018).  At that
>>> time we were running on m4.xlarge instances (these have 4 vCPUs and 16GB
>>> RAM).
>>>
>>> As a mitigation measure we have migrated away from those to r4.2xlarge.
>>> Then we didn't observe any issues for a few weeks, so we have scaled down
>>> two times: to r4.xlarge and then to r4.large.  The last migration was
>>> completed before Nov 13th.  No changes to the cluster or application
>>> happened since that time.
>>>
>>> Now, after some weeks the issue appears again...
>>>
>>> When we are not fast enough to react and reboot the affected instance,
>>> we can see that ultimately Linux OOM killer kicks in and kills the java
>>> process running Cassandra.  After that the instance becomes available
>>> almost immediately.  This allows us to rule out other processes running in
>>> background as potential offenders.
>>>
>>> We routinely observe Memory.HeapMemoryUsage.used between 1GB and 6GB
>>> and Memory.NonHeapMemoryUsage.used below 100MB, as reported by JMX (via
>>> Jolokia).  At the same time, Committed_AS on each host is constantly around
>>> 11-12GB, as reported by atop(1) and prometheus.
>>>
>>> We are running atop with sampling interval of 60 seconds.  After the
>>> fact we observe that the java process is the one responsible for the most
>>> of disk activity during unavailability period.  We also see kswapd0 high on
>>> the list from time to time, which always has 0K reads, but non-zero write
>>> bandwidth.  There is no swap space defined on these instances, so not
>>> really clear why kswapd appears at the top of the list all (measurement
>>> error?).
>>>
>>> We also attempted to troubleshoot by running jstack, jmap and pmap
>>> against Cassandra process in background every few minutes.  The idea was to
>>> compare dumps taken before and during unavailability, but that didn't lead
>>> to any findings so far.  Ultimately we had to stop doing this, once we've
>>> seen that jmap can also become stuck burning CPU cycles.  Now the output of
>>> jmap is not that useful, but we fear that jstack might also expose the same
>>> behavior.  So we wanted to avoid making the issue worse than it currently
>>> is and disabled this debug sampling.
>>>
>>> Now to my questions:
>>>
>>> 1. Is there anything in Cassandra or in the JVM that could explain all
>>> of a sudden reading from non-data volume at such a high rate, for prolonged
>>> periods of time as described above?
>>>
>>> 2. Why does JVM heap utilization never reaches the 8GB that we provide
>>> to it?
>>>
>>> 3. Why Committed virtual memory is so much bigger than sum of the heap
>>> and off-heap reported by JMX?  To what can this difference be attributed?
>>> I've just visited a node at random and collected "off heap memory used"
>>> numbers reported by nodetool cfstats, and still I see only 2.6GB in total,
>>> while Commited_AS is ~12GB.  Is there a more direct way to monitor off-heap
>>> memory usage by JVM?
>>>
>>> 4. The only Jira issue related to Linux OOM, we've found is this one:
>>> https://issues.apache.org/jira/browse/CASSANDRA-13931  This might be
>>> related to our OOM, but still doesn't explain the unexpected IO anomalies.
>>>
>>> I would really appreciate any hints / pointers / insights!  The more I
>>> learn about this issue, the less I understand it...
>>>
>>> Regards,
>>> --
>>> Alex
>>>
>>>

-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Re: Sporadic high IO bandwidth and Linux OOM killer

Posted by Jon Meredith <jm...@gmail.com>.

The kswapd issue is interesting, is it possible you're being affected by
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457 - although I
don't see a fix for Trusty listed on there?

On Wed, Dec 5, 2018 at 11:34 AM Riccardo Ferrari <fe...@gmail.com> wrote:

> Hi Alex,
>
> I saw that behaviout in the past. I can tell you the kswapd0 usage is
> connected to the `disk_access_mode` property. On 64bit systems defaults to
> mmap. That also explains why your virtual memory is so high (it somehow
> matches the node load, right?). I can not find and good reference however
> googling for "kswapd0 cassandra" you'll find quite some resources.
>
> HTH,
>
> On Wed, Dec 5, 2018 at 6:39 PM Oleksandr Shulgin <
> oleksandr.shulgin@zalando.de> wrote:
>
>> Hello,
>>
>> We are running the following setup on AWS EC2:
>>
>> Host system (AWS AMI): Ubuntu 14.04.4 LTS,
>> Linux <hostname> 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5
>> 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>>
>> Cassandra process runs inside a docker container.
>> Docker image is based on Ubuntu 18.04.1 LTS.
>>
>> Apache Cassandra 3.0.17, installed from .deb packages.
>>
>> $ java -version
>> openjdk version "1.8.0_181"
>> OpenJDK Runtime Environment (build
>> 1.8.0_181-8u181-b13-1ubuntu0.18.04.1-b13)
>> OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
>>
>> We have a total of 36 nodes.  All are r4.large instances, they have 2
>> vCPUs and ~15 GB RAM.
>> On each instance we have:
>> - 2TB gp2 SSD EBS volume for data and commit log,
>> - 8GB gp2 SSD EBS for system (root volume).
>>
>> Non-default settings in cassandra.yaml:
>> num_tokens: 16
>> memtable_flush_writers: 1
>> concurrent_compactors: 1
>> snitch: Ec2Snitch
>>
>> JVM heap/stack size options: -Xms8G -Xmx8G -Xmn800M -Xss256k
>> Garbage collection: CMS with default settings.
>>
>> We repair once a week using Cassandra Reaper: parallel, intensity 1, 64
>> segments per node.  The issue also happens outside of repair time.
>>
>> The symptoms:
>> ============
>>
>> Sporadically a node becomes unavailable for a period of time between few
>> minutes and a few hours.  According to our analysis and as pointed out by
>> AWS support team, the unavailability is caused by exceptionally high read
>> bandwidth on the *root* EBS volume.  I repeat, on the root volume, *not* on
>> the data/commitlog volume.  Basically, the amount if IO exceeds instance's
>> bandwidth (~52MB/s) and all other network communication becomes impossible.
>>
>> The root volume contains operating system, docker container with OpenJDK
>> and Cassandra binaries, and the logs.
>>
>> Most of the time, whenever this happens it is too late to SSH into the
>> instance to troubleshoot: it becomes completely unavailable within very
>> short period of time.
>> Rebooting the affected instance helps to bring it back to life.
>>
>> Starting from the middle of last week we have seen this problem
>> repeatedly 1-3 times a day, affecting different instances in a seemingly
>> random fashion.  Most of the time it affects only one instance, but we've
>> had one incident when 9 nodes (3 from each of the 3 Availability Zones)
>> were down at the same time due to this exact issue.
>>
>> Actually, we've had the same issue previously on the same Cassandra
>> cluster around 3 months ago (beginning to mid of September 2018).  At that
>> time we were running on m4.xlarge instances (these have 4 vCPUs and 16GB
>> RAM).
>>
>> As a mitigation measure we have migrated away from those to r4.2xlarge.
>> Then we didn't observe any issues for a few weeks, so we have scaled down
>> two times: to r4.xlarge and then to r4.large.  The last migration was
>> completed before Nov 13th.  No changes to the cluster or application
>> happened since that time.
>>
>> Now, after some weeks the issue appears again...
>>
>> When we are not fast enough to react and reboot the affected instance, we
>> can see that ultimately Linux OOM killer kicks in and kills the java
>> process running Cassandra.  After that the instance becomes available
>> almost immediately.  This allows us to rule out other processes running in
>> background as potential offenders.
>>
>> We routinely observe Memory.HeapMemoryUsage.used between 1GB and 6GB
>> and Memory.NonHeapMemoryUsage.used below 100MB, as reported by JMX (via
>> Jolokia).  At the same time, Committed_AS on each host is constantly around
>> 11-12GB, as reported by atop(1) and prometheus.
>>
>> We are running atop with sampling interval of 60 seconds.  After the fact
>> we observe that the java process is the one responsible for the most of
>> disk activity during unavailability period.  We also see kswapd0 high on
>> the list from time to time, which always has 0K reads, but non-zero write
>> bandwidth.  There is no swap space defined on these instances, so not
>> really clear why kswapd appears at the top of the list all (measurement
>> error?).
>>
>> We also attempted to troubleshoot by running jstack, jmap and pmap
>> against Cassandra process in background every few minutes.  The idea was to
>> compare dumps taken before and during unavailability, but that didn't lead
>> to any findings so far.  Ultimately we had to stop doing this, once we've
>> seen that jmap can also become stuck burning CPU cycles.  Now the output of
>> jmap is not that useful, but we fear that jstack might also expose the same
>> behavior.  So we wanted to avoid making the issue worse than it currently
>> is and disabled this debug sampling.
>>
>> Now to my questions:
>>
>> 1. Is there anything in Cassandra or in the JVM that could explain all of
>> a sudden reading from non-data volume at such a high rate, for prolonged
>> periods of time as described above?
>>
>> 2. Why does JVM heap utilization never reaches the 8GB that we provide to
>> it?
>>
>> 3. Why Committed virtual memory is so much bigger than sum of the heap
>> and off-heap reported by JMX?  To what can this difference be attributed?
>> I've just visited a node at random and collected "off heap memory used"
>> numbers reported by nodetool cfstats, and still I see only 2.6GB in total,
>> while Commited_AS is ~12GB.  Is there a more direct way to monitor off-heap
>> memory usage by JVM?
>>
>> 4. The only Jira issue related to Linux OOM, we've found is this one:
>> https://issues.apache.org/jira/browse/CASSANDRA-13931  This might be
>> related to our OOM, but still doesn't explain the unexpected IO anomalies.
>>
>> I would really appreciate any hints / pointers / insights!  The more I
>> learn about this issue, the less I understand it...
>>
>> Regards,
>> --
>> Alex
>>
>>

Re: Sporadic high IO bandwidth and Linux OOM killer

Posted by Yuri de Wit <yd...@gmail.com>.

We


On Fri, Dec 28, 2018, 4:23 PM Oleksandr Shulgin <
oleksandr.shulgin@zalando.de> wrote:

> On Fri, Dec 7, 2018 at 12:43 PM Oleksandr Shulgin <
> oleksandr.shulgin@zalando.de> wrote:
>
>>
>> After a fresh JVM start the memory allocation looks roughly like this:
>>
>>              total       used       free     shared    buffers     cached
>> Mem:           14G        14G       173M       1.1M        12M       3.2G
>> -/+ buffers/cache:        11G       3.4G
>> Swap:           0B         0B         0B
>>
>> Then, within a number of days, the allocated disk cache shrinks all the
>> way down to unreasonable numbers like only 150M.  At the same time "free"
>> stays at the original level and "used" grows all the way up to 14G.
>> Shortly after that the node becomes unavailable because of the IO and
>> ultimately after some time the JVM gets killed.
>>
>> Most importantly, the resident size of JVM process stays at around 11-12G
>> all the time, like it was shortly after the start.  How can we find where
>> the rest of the memory gets allocated?  Is it just some sort of malloc
>> fragmentation?
>>
>
> For the ones following along at home, here's what we ended up with so far:
>
> 0. Switched to the next biggest EC2 instance type, r4.xlarge: and the
> symptoms are gone.  Our bill is dominated by the price EBS storage, so this
> is much less than 2x increase in total.
>
> 1. We've noticed that increased memory usage correlates with the number of
> SSTables on disk.  When the number of files on disk decreases, available
> memory increases.  This leads us to think that extra memory allocation is
> indeed due to use of mmap.  Not clear how we could account for that.
>
> 2. Improved our monitoring to include number of files (via total - free
> inodes).
>
> Given the cluster's resource utilization, it still feels like r4.large
> would be a good fit, if only we could figure out those few "missing" GB of
> RAM. ;-)
>
> Cheers!
> --
> Alex
>
>

Re: Sporadic high IO bandwidth and Linux OOM killer

Posted by Jeff Jirsa <jj...@gmail.com>.

I’ve lost some context but there are two direct memory allocations per sstable - compression offsets and the bloom filter. Both of those get built during sstable creation and the bloom filter’s size is aggressively allocated , so you’ll see a big chunk of memory disappear as compaction kicks off based on the estimated number of keys.

Are you sure that’s not what you’re seeing? If it is, dropping bloom filter FP ratio or increasing compression chunk size may help (and probably saves you some disk, you’ll get better ratios but slightly slower by increasing that chunk size)

-- 
Jeff Jirsa


> On Dec 28, 2018, at 1:22 PM, Oleksandr Shulgin <ol...@zalando.de> wrote:
> 
>> On Fri, Dec 7, 2018 at 12:43 PM Oleksandr Shulgin <ol...@zalando.de> wrote:
>> 
>> After a fresh JVM start the memory allocation looks roughly like this:
>> 
>>              total       used       free     shared    buffers     cached
>> Mem:           14G        14G       173M       1.1M        12M       3.2G
>> -/+ buffers/cache:        11G       3.4G
>> Swap:           0B         0B         0B
>> 
>> Then, within a number of days, the allocated disk cache shrinks all the way down to unreasonable numbers like only 150M.  At the same time "free" stays at the original level and "used" grows all the way up to 14G.  Shortly after that the node becomes unavailable because of the IO and ultimately after some time the JVM gets killed.
>> 
>> Most importantly, the resident size of JVM process stays at around 11-12G all the time, like it was shortly after the start.  How can we find where the rest of the memory gets allocated?  Is it just some sort of malloc fragmentation?
> 
> For the ones following along at home, here's what we ended up with so far:
> 
> 0. Switched to the next biggest EC2 instance type, r4.xlarge: and the symptoms are gone.  Our bill is dominated by the price EBS storage, so this is much less than 2x increase in total.
> 
> 1. We've noticed that increased memory usage correlates with the number of SSTables on disk.  When the number of files on disk decreases, available memory increases.  This leads us to think that extra memory allocation is indeed due to use of mmap.  Not clear how we could account for that.
> 
> 2. Improved our monitoring to include number of files (via total - free inodes).
> 
> Given the cluster's resource utilization, it still feels like r4.large would be a good fit, if only we could figure out those few "missing" GB of RAM. ;-)
> 
> Cheers!
> --
> Alex
>

Re: Sporadic high IO bandwidth and Linux OOM killer

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Fri, Dec 7, 2018 at 12:43 PM Oleksandr Shulgin <
oleksandr.shulgin@zalando.de> wrote:

>
> After a fresh JVM start the memory allocation looks roughly like this:
>
>              total       used       free     shared    buffers     cached
> Mem:           14G        14G       173M       1.1M        12M       3.2G
> -/+ buffers/cache:        11G       3.4G
> Swap:           0B         0B         0B
>
> Then, within a number of days, the allocated disk cache shrinks all the
> way down to unreasonable numbers like only 150M.  At the same time "free"
> stays at the original level and "used" grows all the way up to 14G.
> Shortly after that the node becomes unavailable because of the IO and
> ultimately after some time the JVM gets killed.
>
> Most importantly, the resident size of JVM process stays at around 11-12G
> all the time, like it was shortly after the start.  How can we find where
> the rest of the memory gets allocated?  Is it just some sort of malloc
> fragmentation?
>

For the ones following along at home, here's what we ended up with so far:

0. Switched to the next biggest EC2 instance type, r4.xlarge: and the
symptoms are gone.  Our bill is dominated by the price EBS storage, so this
is much less than 2x increase in total.

1. We've noticed that increased memory usage correlates with the number of
SSTables on disk.  When the number of files on disk decreases, available
memory increases.  This leads us to think that extra memory allocation is
indeed due to use of mmap.  Not clear how we could account for that.

2. Improved our monitoring to include number of files (via total - free
inodes).

Given the cluster's resource utilization, it still feels like r4.large
would be a good fit, if only we could figure out those few "missing" GB of
RAM. ;-)

Cheers!
--
Alex

Re: Sporadic high IO bandwidth and Linux OOM killer

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Thu, Dec 6, 2018 at 3:39 PM Riccardo Ferrari <fe...@gmail.com> wrote:

> To be honest I've never seen the OOM in action on those instances. My Xmx
> was 8GB just like yours and that let me think you have some process that is
> competing for memory, is it? Do you have any cron, any backup, anything
> that can trick the OOMKiller ?
>

Riccardo,

As I've mentioned previously, apart from docker running Cassandra on JVM,
there is a small number of houskeeping processes, namely cron to trigger
log rotation, a log shipping agent, node metrics exporter (prometheus) and
some other small things.  None of those come close in their memory
requirements compared to Cassandra and are routinely pretty low in memory
usage reports from atop and similar tools.  The overhead of these seems to
be minimal.

My unresponsiveness was seconds long. This is/was bad becasue gossip
> protocol was going crazy by marking nodes down and all the consequences
> this can lead in distributed system, think about hints, dynamic snitch, and
> whatever depends on node availability ...
> Can you share some number about your `tpstats` or system load in general?
>

Here's some pretty typical tpstats output from one of the nodes:

Pool Name                    Active   Pending      Completed   Blocked  All
time blocked
MutationStage                     0         0      319319724         0
           0
ViewMutationStage                 0         0              0         0
           0
ReadStage                         0         0       80006984         0
           0
RequestResponseStage              0         0      258548356         0
           0
ReadRepairStage                   0         0        2707455         0
           0
CounterMutationStage              0         0              0         0
           0
MiscStage                         0         0              0         0
           0
CompactionExecutor                1        55        1552918         0
           0
MemtableReclaimMemory             0         0           4042         0
           0
PendingRangeCalculator            0         0            111         0
           0
GossipStage                       0         0        6343859         0
           0
SecondaryIndexManagement          0         0              0         0
           0
HintsDispatcher                   0         0            226         0
           0
MigrationStage                    0         0              0         0
           0
MemtablePostFlush                 0         0           4046         0
           0
ValidationExecutor                1         1           1510         0
           0
Sampler                           0         0              0         0
           0
MemtableFlushWriter               0         0           4042         0
           0
InternalResponseStage             0         0           5890         0
           0
AntiEntropyStage                  0         0           5532         0
           0
CacheCleanupExecutor              0         0              0         0
           0
Repair#250                        1         1              1         0
           0
Native-Transport-Requests         2         0      260447405         0
          18

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
HINT                         0
MUTATION                     1
COUNTER_MUTATION             0
BATCH_STORE                  0
BATCH_REMOVE                 0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0

Speaking of CPU utilization, it is consistently within 30-60% on all nodes
(and even less in the night).

> On the tuning side I just went through the following article:
> https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/config/configRecommendedSettings.html
>
> No rollbacks, just moving forward! Right now we are upgrading the instance
> size to something more recent than m1.xlarge (for many different reasons,
> including security, ECU and network).Nevertheless it might be a good idea
> to upgrade to the 3.X branch to leverage on better off-heap memory
> management.
>

One thing we have noticed very recently is that our nodes are indeed
running low on memory.  It even seems now that the IO is a side effect of
impending OOM, not the other way round as we have thought initially.

After a fresh JVM start the memory allocation looks roughly like this:

             total       used       free     shared    buffers     cached
Mem:           14G        14G       173M       1.1M        12M       3.2G
-/+ buffers/cache:        11G       3.4G
Swap:           0B         0B         0B

Then, within a number of days, the allocated disk cache shrinks all the way
down to unreasonable numbers like only 150M.  At the same time "free" stays
at the original level and "used" grows all the way up to 14G.  Shortly
after that the node becomes unavailable because of the IO and ultimately
after some time the JVM gets killed.

Most importantly, the resident size of JVM process stays at around 11-12G
all the time, like it was shortly after the start.  How can we find where
the rest of the memory gets allocated?  Is it just some sort of malloc
fragmentation?

As we are running a relatively recent version of JDK, we've tried to use
the option -Djdk.nio.maxCachedBufferSize=262144 on one of the nodes, as
suggested in this issue:
https://issues.apache.org/jira/browse/CASSANDRA-13931
But we didn't see any improvement.  Also, the expectation is if it would be
the issue in the first place, the resident size of JVM process would grow
at the same rate as available memory is shrinking, correct?

Another thing we didn't find the answer so far is why within JVM heap.used
(<= 6GB) never reaches heap.committed = 8GB.  Any ideas?

Regards,
--
Alex

Re: Sporadic high IO bandwidth and Linux OOM killer

Posted by Riccardo Ferrari <fe...@gmail.com>.

Hi,

To be honest I've never seen the OOM in action on those instances. My Xmx
was 8GB just like yours and that let me think you have some process that is
competing for memory, is it? Do you have any cron, any backup, anything
that can trick the OOMKiller ?

My unresponsiveness was seconds long. This is/was bad becasue gossip
protocol was going crazy by marking nodes down and all the consequences
this can lead in distributed system, think about hints, dynamic snitch, and
whatever depends on node availability ...
Can you share some number about your `tpstats` or system load in general?

On the tuning side I just went through the following article:
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/config/configRecommendedSettings.html

No rollbacks, just moving forward! Right now we are upgrading the instance
size to something more recent than m1.xlarge (for many different reasons,
including security, ECU and network).Nevertheless it might be a good idea
to upgrade to the 3.X branch to leverage on better off-heap memory
management.

Best,


On Thu, Dec 6, 2018 at 2:33 PM Oleksandr Shulgin <
oleksandr.shulgin@zalando.de> wrote:

> On Thu, Dec 6, 2018 at 11:14 AM Riccardo Ferrari <fe...@gmail.com>
> wrote:
>
>>
>> I had few instances in the past that were showing that unresponsivveness
>> behaviour. Back then I saw with iotop/htop/dstat ... the system was stuck
>> on a single thread processing (full throttle) for seconds. According to
>> iotop that was the kswapd0 process. That system was an ubuntu 16.04
>> actually "Ubuntu 16.04.4 LTS".
>>
>
> Riccardo,
>
> Did you by chance also observe Linux OOM?  How long did the
> unresponsiveness last in your case?
>
> From there I started to dig what kswap process was involved in a system
>> with no swap and found that is used for mmapping. This erratic (allow me to
>> say erratic) behaviour was not showing up when I was on 3.0.6 but started
>> to right after upgrading to 3.0.17.
>>
>> By "load" I refer to the load as reported by the `nodetool status`. On my
>> systems, when disk_access_mode is auto (read mmap), it is the sum of the
>> node load plus the jmv heap size. Of course this is just what I noted on my
>> systems not really sure if that should be the case on yours too.
>>
>
> I've checked and indeed we are using disk_access_mode=auto (well,
> implicitly because it's not even part of config file anymore):
> DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap.
>
> I hope someone with more experience than me will add a comment about your
>> settings. Reading the configuration file, writers and compactors should be
>> 2 at minimum. I can confirm when I tried in the past to change the
>> concurrent_compactors to 1 I had really bad things happenings (high system
>> load, high message drop rate, ...)
>>
>
> As I've mentioned, we did not observe any other issues with the current
> setup: system load is reasonable, no dropped messages, no big number of
> hints, request latencies are OK, no big number of pending compactions.
> Also during repair everything looks fine.
>
> I have the "feeling", when running on constrained hardware the underlaying
>> kernel optimization is a must. I agree with Jonathan H. that you should
>> think about increasing the instance size, CPU and memory mathters a lot.
>>
>
> How did you solve your issue in the end?  You didn't rollback to 3.0.6?
> Did you tune kernel parameters?  Which ones?
>
> Thank you!
> --
> Alex
>
>

Re: Sporadic high IO bandwidth and Linux OOM killer

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Thu, Dec 6, 2018 at 11:14 AM Riccardo Ferrari <fe...@gmail.com> wrote:

>
> I had few instances in the past that were showing that unresponsivveness
> behaviour. Back then I saw with iotop/htop/dstat ... the system was stuck
> on a single thread processing (full throttle) for seconds. According to
> iotop that was the kswapd0 process. That system was an ubuntu 16.04
> actually "Ubuntu 16.04.4 LTS".
>

Riccardo,

Did you by chance also observe Linux OOM?  How long did the
unresponsiveness last in your case?

From there I started to dig what kswap process was involved in a system
> with no swap and found that is used for mmapping. This erratic (allow me to
> say erratic) behaviour was not showing up when I was on 3.0.6 but started
> to right after upgrading to 3.0.17.
>
> By "load" I refer to the load as reported by the `nodetool status`. On my
> systems, when disk_access_mode is auto (read mmap), it is the sum of the
> node load plus the jmv heap size. Of course this is just what I noted on my
> systems not really sure if that should be the case on yours too.
>

I've checked and indeed we are using disk_access_mode=auto (well,
implicitly because it's not even part of config file anymore):
DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap.

I hope someone with more experience than me will add a comment about your
> settings. Reading the configuration file, writers and compactors should be
> 2 at minimum. I can confirm when I tried in the past to change the
> concurrent_compactors to 1 I had really bad things happenings (high system
> load, high message drop rate, ...)
>

As I've mentioned, we did not observe any other issues with the current
setup: system load is reasonable, no dropped messages, no big number of
hints, request latencies are OK, no big number of pending compactions.
Also during repair everything looks fine.

I have the "feeling", when running on constrained hardware the underlaying
> kernel optimization is a must. I agree with Jonathan H. that you should
> think about increasing the instance size, CPU and memory mathters a lot.
>

How did you solve your issue in the end?  You didn't rollback to 3.0.6?
Did you tune kernel parameters?  Which ones?

Thank you!
--
Alex

Re: Sporadic high IO bandwidth and Linux OOM killer

Posted by Riccardo Ferrari <fe...@gmail.com>.

Alex,

I had few instances in the past that were showing that unresponsivveness
behaviour. Back then I saw with iotop/htop/dstat ... the system was stuck
on a single thread processing (full throttle) for seconds. According to
iotop that was the kswapd0 process. That system was an ubuntu 16.04
actually "Ubuntu 16.04.4 LTS".

From there I started to dig what kswap process was involved in a system
with no swap and found that is used for mmapping. This erratic (allow me to
say erratic) behaviour was not showing up when I was on 3.0.6 but started
to right after upgrading to 3.0.17.

By "load" I refer to the load as reported by the `nodetool status`. On my
systems, when disk_access_mode is auto (read mmap), it is the sum of the
node load plus the jmv heap size. Of course this is just what I noted on my
systems not really sure if that should be the case on yours too.

I hope someone with more experience than me will add a comment about your
settings. Reading the configuration file, writers and compactors should be
2 at minimum. I can confirm when I tried in the past to change the
concurrent_compactors to 1 I had really bad things happenings (high system
load, high message drop rate, ...)

I have the "feeling", when running on constrained hardware the underlaying
kernel optimization is a must. I agree with Jonathan H. that you should
think about increasing the instance size, CPU and memory mathters a lot.

Best,

On Wed, Dec 5, 2018 at 10:36 PM Oleksandr Shulgin <
oleksandr.shulgin@zalando.de> wrote:

> On Wed, 5 Dec 2018, 19:34 Riccardo Ferrari <ferrarir@gmail.com wrote:
>
>> Hi Alex,
>>
>> I saw that behaviout in the past.
>>
>
> Riccardo,
>
> Thank you for the reply!
>
> Do you refer to kswapd issue only or have you observed more problems that
> match behavior I have described?
>
> I can tell you the kswapd0 usage is connected to the `disk_access_mode`
>> property. On 64bit systems defaults to mmap.
>>
>
> Hm, that's interesting, I will double-check.
>
> That also explains why your virtual memory is so high (it somehow matches
>> the node load, right?).
>>
>
> Not sure what do you mean by "load" here. We have a bit less than 1.5TB
> per node on average.
>
> Regards,
> --
> Alex
>
>

Re: Sporadic high IO bandwidth and Linux OOM killer

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Wed, 5 Dec 2018, 19:34 Riccardo Ferrari <ferrarir@gmail.com wrote:

> Hi Alex,
>
> I saw that behaviout in the past.
>

Riccardo,

Thank you for the reply!

Do you refer to kswapd issue only or have you observed more problems that
match behavior I have described?

I can tell you the kswapd0 usage is connected to the `disk_access_mode`
> property. On 64bit systems defaults to mmap.
>

Hm, that's interesting, I will double-check.

That also explains why your virtual memory is so high (it somehow matches
> the node load, right?).
>

Not sure what do you mean by "load" here. We have a bit less than 1.5TB per
node on average.

Regards,
--
Alex

Re: Sporadic high IO bandwidth and Linux OOM killer

Posted by Riccardo Ferrari <fe...@gmail.com>.

Hi Alex,

I saw that behaviout in the past. I can tell you the kswapd0 usage is
connected to the `disk_access_mode` property. On 64bit systems defaults to
mmap. That also explains why your virtual memory is so high (it somehow
matches the node load, right?). I can not find and good reference however
googling for "kswapd0 cassandra" you'll find quite some resources.

HTH,

On Wed, Dec 5, 2018 at 6:39 PM Oleksandr Shulgin <
oleksandr.shulgin@zalando.de> wrote:

> Hello,
>
> We are running the following setup on AWS EC2:
>
> Host system (AWS AMI): Ubuntu 14.04.4 LTS,
> Linux <hostname> 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5
> 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> Cassandra process runs inside a docker container.
> Docker image is based on Ubuntu 18.04.1 LTS.
>
> Apache Cassandra 3.0.17, installed from .deb packages.
>
> $ java -version
> openjdk version "1.8.0_181"
> OpenJDK Runtime Environment (build
> 1.8.0_181-8u181-b13-1ubuntu0.18.04.1-b13)
> OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
>
> We have a total of 36 nodes.  All are r4.large instances, they have 2
> vCPUs and ~15 GB RAM.
> On each instance we have:
> - 2TB gp2 SSD EBS volume for data and commit log,
> - 8GB gp2 SSD EBS for system (root volume).
>
> Non-default settings in cassandra.yaml:
> num_tokens: 16
> memtable_flush_writers: 1
> concurrent_compactors: 1
> snitch: Ec2Snitch
>
> JVM heap/stack size options: -Xms8G -Xmx8G -Xmn800M -Xss256k
> Garbage collection: CMS with default settings.
>
> We repair once a week using Cassandra Reaper: parallel, intensity 1, 64
> segments per node.  The issue also happens outside of repair time.
>
> The symptoms:
> ============
>
> Sporadically a node becomes unavailable for a period of time between few
> minutes and a few hours.  According to our analysis and as pointed out by
> AWS support team, the unavailability is caused by exceptionally high read
> bandwidth on the *root* EBS volume.  I repeat, on the root volume, *not* on
> the data/commitlog volume.  Basically, the amount if IO exceeds instance's
> bandwidth (~52MB/s) and all other network communication becomes impossible.
>
> The root volume contains operating system, docker container with OpenJDK
> and Cassandra binaries, and the logs.
>
> Most of the time, whenever this happens it is too late to SSH into the
> instance to troubleshoot: it becomes completely unavailable within very
> short period of time.
> Rebooting the affected instance helps to bring it back to life.
>
> Starting from the middle of last week we have seen this problem repeatedly
> 1-3 times a day, affecting different instances in a seemingly random
> fashion.  Most of the time it affects only one instance, but we've had one
> incident when 9 nodes (3 from each of the 3 Availability Zones) were down
> at the same time due to this exact issue.
>
> Actually, we've had the same issue previously on the same Cassandra
> cluster around 3 months ago (beginning to mid of September 2018).  At that
> time we were running on m4.xlarge instances (these have 4 vCPUs and 16GB
> RAM).
>
> As a mitigation measure we have migrated away from those to r4.2xlarge.
> Then we didn't observe any issues for a few weeks, so we have scaled down
> two times: to r4.xlarge and then to r4.large.  The last migration was
> completed before Nov 13th.  No changes to the cluster or application
> happened since that time.
>
> Now, after some weeks the issue appears again...
>
> When we are not fast enough to react and reboot the affected instance, we
> can see that ultimately Linux OOM killer kicks in and kills the java
> process running Cassandra.  After that the instance becomes available
> almost immediately.  This allows us to rule out other processes running in
> background as potential offenders.
>
> We routinely observe Memory.HeapMemoryUsage.used between 1GB and 6GB
> and Memory.NonHeapMemoryUsage.used below 100MB, as reported by JMX (via
> Jolokia).  At the same time, Committed_AS on each host is constantly around
> 11-12GB, as reported by atop(1) and prometheus.
>
> We are running atop with sampling interval of 60 seconds.  After the fact
> we observe that the java process is the one responsible for the most of
> disk activity during unavailability period.  We also see kswapd0 high on
> the list from time to time, which always has 0K reads, but non-zero write
> bandwidth.  There is no swap space defined on these instances, so not
> really clear why kswapd appears at the top of the list all (measurement
> error?).
>
> We also attempted to troubleshoot by running jstack, jmap and pmap against
> Cassandra process in background every few minutes.  The idea was to compare
> dumps taken before and during unavailability, but that didn't lead to any
> findings so far.  Ultimately we had to stop doing this, once we've seen
> that jmap can also become stuck burning CPU cycles.  Now the output of jmap
> is not that useful, but we fear that jstack might also expose the same
> behavior.  So we wanted to avoid making the issue worse than it currently
> is and disabled this debug sampling.
>
> Now to my questions:
>
> 1. Is there anything in Cassandra or in the JVM that could explain all of
> a sudden reading from non-data volume at such a high rate, for prolonged
> periods of time as described above?
>
> 2. Why does JVM heap utilization never reaches the 8GB that we provide to
> it?
>
> 3. Why Committed virtual memory is so much bigger than sum of the heap and
> off-heap reported by JMX?  To what can this difference be attributed?
> I've just visited a node at random and collected "off heap memory used"
> numbers reported by nodetool cfstats, and still I see only 2.6GB in total,
> while Commited_AS is ~12GB.  Is there a more direct way to monitor off-heap
> memory usage by JVM?
>
> 4. The only Jira issue related to Linux OOM, we've found is this one:
> https://issues.apache.org/jira/browse/CASSANDRA-13931  This might be
> related to our OOM, but still doesn't explain the unexpected IO anomalies.
>
> I would really appreciate any hints / pointers / insights!  The more I
> learn about this issue, the less I understand it...
>
> Regards,
> --
> Alex
>
>