You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Juho Mäkinen <ju...@gmail.com> on 2016/07/20 16:13:18 UTC

My cluster shows high system load without any apparent reason

I just recently upgraded our cluster to 2.2.7 and after turning the cluster
under production load the instances started to show high load (as shown by
uptime) without any apparent reason and I'm not quite sure what could be
causing it.

We are running on i2.4xlarge, so we have 16 cores, 120GB of ram, four 800GB
SSDs (set as lvm stripe into one big lvol). Running 3.13.0-87-generic on
HVM virtualisation. Cluster has 26 TiB of data stored in two tables.

Symptoms:
 - High load, sometimes up to 30 for a short duration of few minutes, then
the load drops back to the cluster average: 3-4
 - Instances might have one compaction running, but might not have any
compactions.
 - Each node is serving around 250-300 reads per second and around 200
writes per second.
 - Restarting node fixes the problem for around 18-24 hours.
 - No or very little IO-wait.
 - top shows that around 3-10 threads are running on high cpu, but that
alone should not cause a load of 20-30.
 - Doesn't seem to be GC load: A system starts to show symptoms so that it
has ran only one CMS sweep. Not like it would do constant stop-the-world
gc's.
 - top shows that the C* processes use 100G of RSS memory. I assume that
this is because cassandra opens all SSTables with mmap() so that they will
pop up in the RSS count because of this.

What I've done so far:
 - Rolling restart. Helped for about one day.
 - Tried doing manual GC to the cluster.
 - Increased heap from 8 GiB with CMS to 16 GiB with G1GC.
 - sjk-plus shows bunch of SharedPool workers. Not sure what to make of
this.
 - Browsed over
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html but
didn't find any apparent

I know that the general symptom of "system shows high load" is not very
good and informative, but I don't know how to better describe what's going
on. I appreciate all ideas what to try and how to debug this further.

 - Garo

RE: My cluster shows high system load without any apparent reason

Posted by Chris Lee <Ch...@huawei.com>.

Unsubscribe me.

Thank you

From: Ryan Svihla [mailto:rs@foundev.pro]
Sent: viernes, 22 de julio de 2016 14:39
To: user@cassandra.apache.org
Subject: Re: My cluster shows high system load without any apparent reason

You aren't using counters by chance?

regards,

Ryan Svihla

On Jul 22, 2016, 2:00 PM -0500, Mark Rose <ma...@markrose.ca>>, wrote:

Hi Garo,

Are you using XFS or Ext4 for data? XFS is much better at deleting
large files, such as may happen after a compaction. If you have 26 TB
in just two tables, I bet you have some massive sstables which may
take a while for Ext4 to delete, which may be causing the stalls. The
underlying block layers will not show high IO-wait. See if the stall
times line up with large compactions in system.log.

If you must use Ext4, another way to avoid issues with massive
sstables is to run more, smaller instances.

As an aside, for the amount of reads/writes you're doing, I've found
using c3/m3 instances with the commit log on the ephemeral storage and
data on st1 EBS volumes to be much more cost effective. It's something
to look into if you haven't already.

-Mark

On Fri, Jul 22, 2016 at 8:10 AM, Juho Mäkinen <ju...@gmail.com>> wrote:

After a few days I've also tried disabling Linux kernel huge pages
defragement (echo never > /sys/kernel/mm/transparent_hugepage/defrag) and
turning coalescing off (otc_coalescing_strategy: DISABLED), but either did
do any good. I'm using LCS, there are no big GC pauses, and I have set
"concurrent_compactors: 5" (machines have 16 CPUs), but there are usually
not any compactions running when the load spike comes. "nodetool tpstats"
shows no running thread pools except on the Native-Transport-Requests
(usually 0-4) and perhaps ReadStage (usually 0-1).

The symptoms are the same: after about 12-24 hours increasingly number of
nodes start to show short CPU load spikes and this affects the median read
latencies. I ran a dstat when a load spike was already under way (see
screenshot http://i.imgur.com/B0S5Zki.png), but any other column than the
load itself doesn't show any major change except the system/kernel CPU
usage.

All further ideas how to debug this are greatly appreciated.

On Wed, Jul 20, 2016 at 7:13 PM, Juho Mäkinen <juho.makinen@gmail.com
wrote:

I just recently upgraded our cluster to 2.2.7 and after turning the
cluster under production load the instances started to show high load (as
shown by uptime) without any apparent reason and I'm not quite sure what
could be causing it.

We are running on i2.4xlarge, so we have 16 cores, 120GB of ram, four
800GB SSDs (set as lvm stripe into one big lvol). Running 3.13.0-87-generic
on HVM virtualisation. Cluster has 26 TiB of data stored in two tables.

Symptoms:
- High load, sometimes up to 30 for a short duration of few minutes, then
the load drops back to the cluster average: 3-4
- Instances might have one compaction running, but might not have any
compactions.
- Each node is serving around 250-300 reads per second and around 200
writes per second.
- Restarting node fixes the problem for around 18-24 hours.
- No or very little IO-wait.
- top shows that around 3-10 threads are running on high cpu, but that
alone should not cause a load of 20-30.
- Doesn't seem to be GC load: A system starts to show symptoms so that it
has ran only one CMS sweep. Not like it would do constant stop-the-world
gc's.
- top shows that the C* processes use 100G of RSS memory. I assume that
this is because cassandra opens all SSTables with mmap() so that they will
pop up in the RSS count because of this.

What I've done so far:
- Rolling restart. Helped for about one day.
- Tried doing manual GC to the cluster.
- Increased heap from 8 GiB with CMS to 16 GiB with G1GC.
- sjk-plus shows bunch of SharedPool workers. Not sure what to make of
this.
- Browsed over
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html but didn't
find any apparent

I know that the general symptom of "system shows high load" is not very
good and informative, but I don't know how to better describe what's going
on. I appreciate all ideas what to try and how to debug this further.

- Garo

Re: My cluster shows high system load without any apparent reason

Posted by Ryan Svihla <rs...@foundev.pro>.

You aren't using counters by chance?

regards,

Ryan Svihla

On Jul 22, 2016, 2:00 PM -0500, Mark Rose <ma...@markrose.ca>, wrote:
> Hi Garo,
>
> Are you using XFS or Ext4 for data? XFS is much better at deleting
> large files, such as may happen after a compaction. If you have 26 TB
> in just two tables, I bet you have some massive sstables which may
> take a while for Ext4 to delete, which may be causing the stalls. The
> underlying block layers will not show high IO-wait. See if the stall
> times line up with large compactions in system.log.
>
> If you must use Ext4, another way to avoid issues with massive
> sstables is to run more, smaller instances.
>
> As an aside, for the amount of reads/writes you're doing, I've found
> using c3/m3 instances with the commit log on the ephemeral storage and
> data on st1 EBS volumes to be much more cost effective. It's something
> to look into if you haven't already.
>
> -Mark
>
> On Fri, Jul 22, 2016 at 8:10 AM, Juho Mäkinen <ju...@gmail.com> wrote:
> > After a few days I've also tried disabling Linux kernel huge pages
> > defragement (echo never > /sys/kernel/mm/transparent_hugepage/defrag) and
> > turning coalescing off (otc_coalescing_strategy: DISABLED), but either did
> > do any good. I'm using LCS, there are no big GC pauses, and I have set
> > "concurrent_compactors: 5" (machines have 16 CPUs), but there are usually
> > not any compactions running when the load spike comes. "nodetool tpstats"
> > shows no running thread pools except on the Native-Transport-Requests
> > (usually 0-4) and perhaps ReadStage (usually 0-1).
> >
> > The symptoms are the same: after about 12-24 hours increasingly number of
> > nodes start to show short CPU load spikes and this affects the median read
> > latencies. I ran a dstat when a load spike was already under way (see
> > screenshot http://i.imgur.com/B0S5Zki.png), but any other column than the
> > load itself doesn't show any major change except the system/kernel CPU
> > usage.
> >
> > All further ideas how to debug this are greatly appreciated.
> >
> >
> > On Wed, Jul 20, 2016 at 7:13 PM, Juho Mäkinen <juho.makinen@gmail.com
> > wrote:
> > >
> > > I just recently upgraded our cluster to 2.2.7 and after turning the
> > > cluster under production load the instances started to show high load (as
> > > shown by uptime) without any apparent reason and I'm not quite sure what
> > > could be causing it.
> > >
> > > We are running on i2.4xlarge, so we have 16 cores, 120GB of ram, four
> > > 800GB SSDs (set as lvm stripe into one big lvol). Running 3.13.0-87-generic
> > > on HVM virtualisation. Cluster has 26 TiB of data stored in two tables.
> > >
> > > Symptoms:
> > > - High load, sometimes up to 30 for a short duration of few minutes, then
> > > the load drops back to the cluster average: 3-4
> > > - Instances might have one compaction running, but might not have any
> > > compactions.
> > > - Each node is serving around 250-300 reads per second and around 200
> > > writes per second.
> > > - Restarting node fixes the problem for around 18-24 hours.
> > > - No or very little IO-wait.
> > > - top shows that around 3-10 threads are running on high cpu, but that
> > > alone should not cause a load of 20-30.
> > > - Doesn't seem to be GC load: A system starts to show symptoms so that it
> > > has ran only one CMS sweep. Not like it would do constant stop-the-world
> > > gc's.
> > > - top shows that the C* processes use 100G of RSS memory. I assume that
> > > this is because cassandra opens all SSTables with mmap() so that they will
> > > pop up in the RSS count because of this.
> > >
> > > What I've done so far:
> > > - Rolling restart. Helped for about one day.
> > > - Tried doing manual GC to the cluster.
> > > - Increased heap from 8 GiB with CMS to 16 GiB with G1GC.
> > > - sjk-plus shows bunch of SharedPool workers. Not sure what to make of
> > > this.
> > > - Browsed over
> > > https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html but didn't
> > > find any apparent
> > >
> > > I know that the general symptom of "system shows high load" is not very
> > > good and informative, but I don't know how to better describe what's going
> > > on. I appreciate all ideas what to try and how to debug this further.
> > >
> > > - Garo
> > >
> >

Re: My cluster shows high system load without any apparent reason

Posted by Mark Rose <ma...@markrose.ca>.

Hi Garo,

I haven't had this issue on SSDs, but I have definitely seen it with
spinning drives. I would think that SSDs would have more than enough
bandwidth to keep up with requests, but you may be running into issues
with Cassandra calling fsync on the commitlog.

What are your settings for the following?

commitlog_sync
commitlog_sync_period_in_ms
commitlog_sync_batch_window_in_ms

If you're using periodic, you could try changing
commitlog_sync_period_in_ms to something smaller like 1000 ms and
seeing if the problem is reduced (the theory is that there would be
less pending data to sync). If you are using batch, switch to
periodic. You could try mounting a GP2 volume and putting the commit
log directory on it and see if the problem goes away (say 200 GB for
sufficient IOPS). I'm guessing you don't have much in the way of
unallocated blocks in your LVM vg.

Writing to the commit log is single threaded, and if the commit log is
tied up waiting for IO during an fsync, it will block writes to the
node. If the threads are blocked on writing, the nodes will also be
stall for reading. The symptoms you are seeing are exactly the same as
I saw with spinning rust. I'm not sure why you didn't see this problem
with EBS.

-Mark

On Sat, Jul 23, 2016 at 7:21 AM, Juho Mäkinen <ju...@gmail.com> wrote:
> Hi Mark.
>
> I have an LVM volume which stripes the four ephemeral SSD:s in the system
> and we use that for both data and commit log. I've used similar setup in the
> past (but with EBS) and we didn't see this behavior. Each node gets just
> around 250 writes per second. It is possible that the commit log is the
> issue here, but could I somehow measure it from the JMX metrics without the
> need of restructuring my entire cluster?
>
> Here's a screenshot from the latencies from our application point of view,
> which uses the Cassandra cluster to do reads. I started a rolling restart at
> around 09:30 and you can clearly see how the system latency dropped.
> http://imgur.com/a/kaPG7
>
> On Sat, Jul 23, 2016 at 2:25 AM, Mark Rose <ma...@markrose.ca> wrote:
>>
>> Hi Garo,
>>
>> Did you put the commit log on its own drive? Spiking CPU during stalls
>> is a symptom of not doing that. The commitlog is very latency
>> sensitive, even under low load. Do be sure you're using the deadline
>> or noop scheduler for that reason, too.
>>
>> -Mark
>>
>> On Fri, Jul 22, 2016 at 4:44 PM, Juho Mäkinen <ju...@gmail.com>
>> wrote:
>> >> Are you using XFS or Ext4 for data?
>> >
>> >
>> > We are using XFS. Many nodes have a couple large SSTables (in order of
>> > 20-50
>> > GiB), but I havent cross checked if the load spikes happen only on
>> > machines
>> > which have these tables.
>> >
>> >>
>> >> As an aside, for the amount of reads/writes you're doing, I've found
>> >> using c3/m3 instances with the commit log on the ephemeral storage and
>> >> data on st1 EBS volumes to be much more cost effective. It's something
>> >> to look into if you haven't already.
>> >
>> >
>> > Thanks for the idea! I previously used c4.4xlarge instances with two
>> > 1500 GB
>> > GP2 volumes, but I found out that we maxed out their bandwidth too
>> > easily,
>> > so that's why my newest cluster is based on i2.4xlarge instances.
>> >
>> > And to answer Ryan: No, we are not using counters.
>> >
>> > I was thinking that could the big amount (100+ GiB) of mmap'ed files
>> > somehow
>> > cause some inefficiencies on the kernel side. That's why I started to
>> > learn
>> > on kernel huge pages and came up with the idea of disabling the huge
>> > page
>> > defrag, but nothing what I've found indicates that this can be a real
>> > problem. After all, Linux fs cache is a really old feature, so I expect
>> > it
>> > to be pretty bug free.
>> >
>> > I guess that I have to next learn how the load value itself is
>> > calculated. I
>> > know about the basic idea that when load is below the number of CPUs
>> > then
>> > the system should still be fine, but there's at least the iowait which
>> > is
>> > also used to calculate the load. So because I am not seeing any
>> > extensive
>> > iowait, and my userland CPU usage is well below what my 16 cores should
>> > handle, then what else contributes to the system load? Can I somehow
>> > make
>> > any educated guess what the high load might tell me if it's not iowait
>> > and
>> > it's not purely userland process CPU usage? This is starting to get
>> > really
>> > deep really fast :/
>> >
>> >  - Garo
>> >
>> >
>> >>
>> >>
>> >> -Mark
>> >>
>> >> On Fri, Jul 22, 2016 at 8:10 AM, Juho Mäkinen <ju...@gmail.com>
>> >> wrote:
>> >> > After a few days I've also tried disabling Linux kernel huge pages
>> >> > defragement (echo never > /sys/kernel/mm/transparent_hugepage/defrag)
>> >> > and
>> >> > turning coalescing off (otc_coalescing_strategy: DISABLED), but
>> >> > either
>> >> > did
>> >> > do any good. I'm using LCS, there are no big GC pauses, and I have
>> >> > set
>> >> > "concurrent_compactors: 5" (machines have 16 CPUs), but there are
>> >> > usually
>> >> > not any compactions running when the load spike comes. "nodetool
>> >> > tpstats"
>> >> > shows no running thread pools except on the Native-Transport-Requests
>> >> > (usually 0-4) and perhaps ReadStage (usually 0-1).
>> >> >
>> >> > The symptoms are the same: after about 12-24 hours increasingly
>> >> > number
>> >> > of
>> >> > nodes start to show short CPU load spikes and this affects the median
>> >> > read
>> >> > latencies. I ran a dstat when a load spike was already under way (see
>> >> > screenshot http://i.imgur.com/B0S5Zki.png), but any other column than
>> >> > the
>> >> > load itself doesn't show any major change except the system/kernel
>> >> > CPU
>> >> > usage.
>> >> >
>> >> > All further ideas how to debug this are greatly appreciated.
>> >> >
>> >> >
>> >> > On Wed, Jul 20, 2016 at 7:13 PM, Juho Mäkinen
>> >> > <ju...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> I just recently upgraded our cluster to 2.2.7 and after turning the
>> >> >> cluster under production load the instances started to show high
>> >> >> load
>> >> >> (as
>> >> >> shown by uptime) without any apparent reason and I'm not quite sure
>> >> >> what
>> >> >> could be causing it.
>> >> >>
>> >> >> We are running on i2.4xlarge, so we have 16 cores, 120GB of ram,
>> >> >> four
>> >> >> 800GB SSDs (set as lvm stripe into one big lvol). Running
>> >> >> 3.13.0-87-generic
>> >> >> on HVM virtualisation. Cluster has 26 TiB of data stored in two
>> >> >> tables.
>> >> >>
>> >> >> Symptoms:
>> >> >>  - High load, sometimes up to 30 for a short duration of few
>> >> >> minutes,
>> >> >> then
>> >> >> the load drops back to the cluster average: 3-4
>> >> >>  - Instances might have one compaction running, but might not have
>> >> >> any
>> >> >> compactions.
>> >> >>  - Each node is serving around 250-300 reads per second and around
>> >> >> 200
>> >> >> writes per second.
>> >> >>  - Restarting node fixes the problem for around 18-24 hours.
>> >> >>  - No or very little IO-wait.
>> >> >>  - top shows that around 3-10 threads are running on high cpu, but
>> >> >> that
>> >> >> alone should not cause a load of 20-30.
>> >> >>  - Doesn't seem to be GC load: A system starts to show symptoms so
>> >> >> that
>> >> >> it
>> >> >> has ran only one CMS sweep. Not like it would do constant
>> >> >> stop-the-world
>> >> >> gc's.
>> >> >>  - top shows that the C* processes use 100G of RSS memory. I assume
>> >> >> that
>> >> >> this is because cassandra opens all SSTables with mmap() so that
>> >> >> they
>> >> >> will
>> >> >> pop up in the RSS count because of this.
>> >> >>
>> >> >> What I've done so far:
>> >> >>  - Rolling restart. Helped for about one day.
>> >> >>  - Tried doing manual GC to the cluster.
>> >> >>  - Increased heap from 8 GiB with CMS to 16 GiB with G1GC.
>> >> >>  - sjk-plus shows bunch of SharedPool workers. Not sure what to make
>> >> >> of
>> >> >> this.
>> >> >>  - Browsed over
>> >> >> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
>> >> >> but
>> >> >> didn't
>> >> >> find any apparent
>> >> >>
>> >> >> I know that the general symptom of "system shows high load" is not
>> >> >> very
>> >> >> good and informative, but I don't know how to better describe what's
>> >> >> going
>> >> >> on. I appreciate all ideas what to try and how to debug this
>> >> >> further.
>> >> >>
>> >> >>  - Garo
>> >> >>
>> >> >
>> >
>> >
>
>

Re: My cluster shows high system load without any apparent reason

Posted by Juho Mäkinen <ju...@gmail.com>.

Hi Mark.

I have an LVM volume which stripes the four ephemeral SSD:s in the system
and we use that for both data and commit log. I've used similar setup in
the past (but with EBS) and we didn't see this behavior. Each node gets
just around 250 writes per second. It is possible that the commit log is
the issue here, but could I somehow measure it from the JMX metrics without
the need of restructuring my entire cluster?

Here's a screenshot from the latencies from our application point of view,
which uses the Cassandra cluster to do reads. I started a rolling restart
at around 09:30 and you can clearly see how the system latency dropped.
http://imgur.com/a/kaPG7

On Sat, Jul 23, 2016 at 2:25 AM, Mark Rose <ma...@markrose.ca> wrote:

> Hi Garo,
>
> Did you put the commit log on its own drive? Spiking CPU during stalls
> is a symptom of not doing that. The commitlog is very latency
> sensitive, even under low load. Do be sure you're using the deadline
> or noop scheduler for that reason, too.
>
> -Mark
>
> On Fri, Jul 22, 2016 at 4:44 PM, Juho Mäkinen <ju...@gmail.com>
> wrote:
> >> Are you using XFS or Ext4 for data?
> >
> >
> > We are using XFS. Many nodes have a couple large SSTables (in order of
> 20-50
> > GiB), but I havent cross checked if the load spikes happen only on
> machines
> > which have these tables.
> >
> >>
> >> As an aside, for the amount of reads/writes you're doing, I've found
> >> using c3/m3 instances with the commit log on the ephemeral storage and
> >> data on st1 EBS volumes to be much more cost effective. It's something
> >> to look into if you haven't already.
> >
> >
> > Thanks for the idea! I previously used c4.4xlarge instances with two
> 1500 GB
> > GP2 volumes, but I found out that we maxed out their bandwidth too
> easily,
> > so that's why my newest cluster is based on i2.4xlarge instances.
> >
> > And to answer Ryan: No, we are not using counters.
> >
> > I was thinking that could the big amount (100+ GiB) of mmap'ed files
> somehow
> > cause some inefficiencies on the kernel side. That's why I started to
> learn
> > on kernel huge pages and came up with the idea of disabling the huge page
> > defrag, but nothing what I've found indicates that this can be a real
> > problem. After all, Linux fs cache is a really old feature, so I expect
> it
> > to be pretty bug free.
> >
> > I guess that I have to next learn how the load value itself is
> calculated. I
> > know about the basic idea that when load is below the number of CPUs then
> > the system should still be fine, but there's at least the iowait which is
> > also used to calculate the load. So because I am not seeing any extensive
> > iowait, and my userland CPU usage is well below what my 16 cores should
> > handle, then what else contributes to the system load? Can I somehow make
> > any educated guess what the high load might tell me if it's not iowait
> and
> > it's not purely userland process CPU usage? This is starting to get
> really
> > deep really fast :/
> >
> >  - Garo
> >
> >
> >>
> >>
> >> -Mark
> >>
> >> On Fri, Jul 22, 2016 at 8:10 AM, Juho Mäkinen <ju...@gmail.com>
> >> wrote:
> >> > After a few days I've also tried disabling Linux kernel huge pages
> >> > defragement (echo never > /sys/kernel/mm/transparent_hugepage/defrag)
> >> > and
> >> > turning coalescing off (otc_coalescing_strategy: DISABLED), but either
> >> > did
> >> > do any good. I'm using LCS, there are no big GC pauses, and I have set
> >> > "concurrent_compactors: 5" (machines have 16 CPUs), but there are
> >> > usually
> >> > not any compactions running when the load spike comes. "nodetool
> >> > tpstats"
> >> > shows no running thread pools except on the Native-Transport-Requests
> >> > (usually 0-4) and perhaps ReadStage (usually 0-1).
> >> >
> >> > The symptoms are the same: after about 12-24 hours increasingly number
> >> > of
> >> > nodes start to show short CPU load spikes and this affects the median
> >> > read
> >> > latencies. I ran a dstat when a load spike was already under way (see
> >> > screenshot http://i.imgur.com/B0S5Zki.png), but any other column than
> >> > the
> >> > load itself doesn't show any major change except the system/kernel CPU
> >> > usage.
> >> >
> >> > All further ideas how to debug this are greatly appreciated.
> >> >
> >> >
> >> > On Wed, Jul 20, 2016 at 7:13 PM, Juho Mäkinen <juho.makinen@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> I just recently upgraded our cluster to 2.2.7 and after turning the
> >> >> cluster under production load the instances started to show high load
> >> >> (as
> >> >> shown by uptime) without any apparent reason and I'm not quite sure
> >> >> what
> >> >> could be causing it.
> >> >>
> >> >> We are running on i2.4xlarge, so we have 16 cores, 120GB of ram, four
> >> >> 800GB SSDs (set as lvm stripe into one big lvol). Running
> >> >> 3.13.0-87-generic
> >> >> on HVM virtualisation. Cluster has 26 TiB of data stored in two
> tables.
> >> >>
> >> >> Symptoms:
> >> >>  - High load, sometimes up to 30 for a short duration of few minutes,
> >> >> then
> >> >> the load drops back to the cluster average: 3-4
> >> >>  - Instances might have one compaction running, but might not have
> any
> >> >> compactions.
> >> >>  - Each node is serving around 250-300 reads per second and around
> 200
> >> >> writes per second.
> >> >>  - Restarting node fixes the problem for around 18-24 hours.
> >> >>  - No or very little IO-wait.
> >> >>  - top shows that around 3-10 threads are running on high cpu, but
> that
> >> >> alone should not cause a load of 20-30.
> >> >>  - Doesn't seem to be GC load: A system starts to show symptoms so
> that
> >> >> it
> >> >> has ran only one CMS sweep. Not like it would do constant
> >> >> stop-the-world
> >> >> gc's.
> >> >>  - top shows that the C* processes use 100G of RSS memory. I assume
> >> >> that
> >> >> this is because cassandra opens all SSTables with mmap() so that they
> >> >> will
> >> >> pop up in the RSS count because of this.
> >> >>
> >> >> What I've done so far:
> >> >>  - Rolling restart. Helped for about one day.
> >> >>  - Tried doing manual GC to the cluster.
> >> >>  - Increased heap from 8 GiB with CMS to 16 GiB with G1GC.
> >> >>  - sjk-plus shows bunch of SharedPool workers. Not sure what to make
> of
> >> >> this.
> >> >>  - Browsed over
> >> >> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
> but
> >> >> didn't
> >> >> find any apparent
> >> >>
> >> >> I know that the general symptom of "system shows high load" is not
> very
> >> >> good and informative, but I don't know how to better describe what's
> >> >> going
> >> >> on. I appreciate all ideas what to try and how to debug this further.
> >> >>
> >> >>  - Garo
> >> >>
> >> >
> >
> >
>

Re: My cluster shows high system load without any apparent reason

Posted by Mark Rose <ma...@markrose.ca>.

Hi Garo,

Did you put the commit log on its own drive? Spiking CPU during stalls
is a symptom of not doing that. The commitlog is very latency
sensitive, even under low load. Do be sure you're using the deadline
or noop scheduler for that reason, too.

-Mark

On Fri, Jul 22, 2016 at 4:44 PM, Juho Mäkinen <ju...@gmail.com> wrote:
>> Are you using XFS or Ext4 for data?
>
>
> We are using XFS. Many nodes have a couple large SSTables (in order of 20-50
> GiB), but I havent cross checked if the load spikes happen only on machines
> which have these tables.
>
>>
>> As an aside, for the amount of reads/writes you're doing, I've found
>> using c3/m3 instances with the commit log on the ephemeral storage and
>> data on st1 EBS volumes to be much more cost effective. It's something
>> to look into if you haven't already.
>
>
> Thanks for the idea! I previously used c4.4xlarge instances with two 1500 GB
> GP2 volumes, but I found out that we maxed out their bandwidth too easily,
> so that's why my newest cluster is based on i2.4xlarge instances.
>
> And to answer Ryan: No, we are not using counters.
>
> I was thinking that could the big amount (100+ GiB) of mmap'ed files somehow
> cause some inefficiencies on the kernel side. That's why I started to learn
> on kernel huge pages and came up with the idea of disabling the huge page
> defrag, but nothing what I've found indicates that this can be a real
> problem. After all, Linux fs cache is a really old feature, so I expect it
> to be pretty bug free.
>
> I guess that I have to next learn how the load value itself is calculated. I
> know about the basic idea that when load is below the number of CPUs then
> the system should still be fine, but there's at least the iowait which is
> also used to calculate the load. So because I am not seeing any extensive
> iowait, and my userland CPU usage is well below what my 16 cores should
> handle, then what else contributes to the system load? Can I somehow make
> any educated guess what the high load might tell me if it's not iowait and
> it's not purely userland process CPU usage? This is starting to get really
> deep really fast :/
>
>  - Garo
>
>
>>
>>
>> -Mark
>>
>> On Fri, Jul 22, 2016 at 8:10 AM, Juho Mäkinen <ju...@gmail.com>
>> wrote:
>> > After a few days I've also tried disabling Linux kernel huge pages
>> > defragement (echo never > /sys/kernel/mm/transparent_hugepage/defrag)
>> > and
>> > turning coalescing off (otc_coalescing_strategy: DISABLED), but either
>> > did
>> > do any good. I'm using LCS, there are no big GC pauses, and I have set
>> > "concurrent_compactors: 5" (machines have 16 CPUs), but there are
>> > usually
>> > not any compactions running when the load spike comes. "nodetool
>> > tpstats"
>> > shows no running thread pools except on the Native-Transport-Requests
>> > (usually 0-4) and perhaps ReadStage (usually 0-1).
>> >
>> > The symptoms are the same: after about 12-24 hours increasingly number
>> > of
>> > nodes start to show short CPU load spikes and this affects the median
>> > read
>> > latencies. I ran a dstat when a load spike was already under way (see
>> > screenshot http://i.imgur.com/B0S5Zki.png), but any other column than
>> > the
>> > load itself doesn't show any major change except the system/kernel CPU
>> > usage.
>> >
>> > All further ideas how to debug this are greatly appreciated.
>> >
>> >
>> > On Wed, Jul 20, 2016 at 7:13 PM, Juho Mäkinen <ju...@gmail.com>
>> > wrote:
>> >>
>> >> I just recently upgraded our cluster to 2.2.7 and after turning the
>> >> cluster under production load the instances started to show high load
>> >> (as
>> >> shown by uptime) without any apparent reason and I'm not quite sure
>> >> what
>> >> could be causing it.
>> >>
>> >> We are running on i2.4xlarge, so we have 16 cores, 120GB of ram, four
>> >> 800GB SSDs (set as lvm stripe into one big lvol). Running
>> >> 3.13.0-87-generic
>> >> on HVM virtualisation. Cluster has 26 TiB of data stored in two tables.
>> >>
>> >> Symptoms:
>> >>  - High load, sometimes up to 30 for a short duration of few minutes,
>> >> then
>> >> the load drops back to the cluster average: 3-4
>> >>  - Instances might have one compaction running, but might not have any
>> >> compactions.
>> >>  - Each node is serving around 250-300 reads per second and around 200
>> >> writes per second.
>> >>  - Restarting node fixes the problem for around 18-24 hours.
>> >>  - No or very little IO-wait.
>> >>  - top shows that around 3-10 threads are running on high cpu, but that
>> >> alone should not cause a load of 20-30.
>> >>  - Doesn't seem to be GC load: A system starts to show symptoms so that
>> >> it
>> >> has ran only one CMS sweep. Not like it would do constant
>> >> stop-the-world
>> >> gc's.
>> >>  - top shows that the C* processes use 100G of RSS memory. I assume
>> >> that
>> >> this is because cassandra opens all SSTables with mmap() so that they
>> >> will
>> >> pop up in the RSS count because of this.
>> >>
>> >> What I've done so far:
>> >>  - Rolling restart. Helped for about one day.
>> >>  - Tried doing manual GC to the cluster.
>> >>  - Increased heap from 8 GiB with CMS to 16 GiB with G1GC.
>> >>  - sjk-plus shows bunch of SharedPool workers. Not sure what to make of
>> >> this.
>> >>  - Browsed over
>> >> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html but
>> >> didn't
>> >> find any apparent
>> >>
>> >> I know that the general symptom of "system shows high load" is not very
>> >> good and informative, but I don't know how to better describe what's
>> >> going
>> >> on. I appreciate all ideas what to try and how to debug this further.
>> >>
>> >>  - Garo
>> >>
>> >
>
>

Re: My cluster shows high system load without any apparent reason

Posted by Juho Mäkinen <ju...@gmail.com>.

>
> Are you using XFS or Ext4 for data?


We are using XFS. Many nodes have a couple large SSTables (in order of
20-50 GiB), but I havent cross checked if the load spikes happen only on
machines which have these tables.


> As an aside, for the amount of reads/writes you're doing, I've found
> using c3/m3 instances with the commit log on the ephemeral storage and
> data on st1 EBS volumes to be much more cost effective. It's something
> to look into if you haven't already.
>

Thanks for the idea! I previously used c4.4xlarge instances with two 1500
GB GP2 volumes, but I found out that we maxed out their bandwidth too
easily, so that's why my newest cluster is based on i2.4xlarge instances.

And to answer Ryan: No, we are not using counters.

I was thinking that could the big amount (100+ GiB) of mmap'ed files
somehow cause some inefficiencies on the kernel side. That's why I started
to learn on kernel huge pages and came up with the idea of disabling the
huge page defrag, but nothing what I've found indicates that this can be a
real problem. After all, Linux fs cache is a really old feature, so I
expect it to be pretty bug free.

I guess that I have to next learn how the load value itself is calculated.
I know about the basic idea that when load is below the number of CPUs then
the system should still be fine, but there's at least the iowait which is
also used to calculate the load. So because I am not seeing any extensive
iowait, and my userland CPU usage is well below what my 16 cores should
handle, then what else contributes to the system load? Can I somehow make
any educated guess what the high load might tell me if it's not iowait and
it's not purely userland process CPU usage? This is starting to get really
deep really fast :/

 - Garo



>
> -Mark
>
> On Fri, Jul 22, 2016 at 8:10 AM, Juho Mäkinen <ju...@gmail.com>
> wrote:
> > After a few days I've also tried disabling Linux kernel huge pages
> > defragement (echo never > /sys/kernel/mm/transparent_hugepage/defrag) and
> > turning coalescing off (otc_coalescing_strategy: DISABLED), but either
> did
> > do any good. I'm using LCS, there are no big GC pauses, and I have set
> > "concurrent_compactors: 5" (machines have 16 CPUs), but there are usually
> > not any compactions running when the load spike comes. "nodetool tpstats"
> > shows no running thread pools except on the Native-Transport-Requests
> > (usually 0-4) and perhaps ReadStage (usually 0-1).
> >
> > The symptoms are the same: after about 12-24 hours increasingly number of
> > nodes start to show short CPU load spikes and this affects the median
> read
> > latencies. I ran a dstat when a load spike was already under way (see
> > screenshot http://i.imgur.com/B0S5Zki.png), but any other column than
> the
> > load itself doesn't show any major change except the system/kernel CPU
> > usage.
> >
> > All further ideas how to debug this are greatly appreciated.
> >
> >
> > On Wed, Jul 20, 2016 at 7:13 PM, Juho Mäkinen <ju...@gmail.com>
> > wrote:
> >>
> >> I just recently upgraded our cluster to 2.2.7 and after turning the
> >> cluster under production load the instances started to show high load
> (as
> >> shown by uptime) without any apparent reason and I'm not quite sure what
> >> could be causing it.
> >>
> >> We are running on i2.4xlarge, so we have 16 cores, 120GB of ram, four
> >> 800GB SSDs (set as lvm stripe into one big lvol). Running
> 3.13.0-87-generic
> >> on HVM virtualisation. Cluster has 26 TiB of data stored in two tables.
> >>
> >> Symptoms:
> >>  - High load, sometimes up to 30 for a short duration of few minutes,
> then
> >> the load drops back to the cluster average: 3-4
> >>  - Instances might have one compaction running, but might not have any
> >> compactions.
> >>  - Each node is serving around 250-300 reads per second and around 200
> >> writes per second.
> >>  - Restarting node fixes the problem for around 18-24 hours.
> >>  - No or very little IO-wait.
> >>  - top shows that around 3-10 threads are running on high cpu, but that
> >> alone should not cause a load of 20-30.
> >>  - Doesn't seem to be GC load: A system starts to show symptoms so that
> it
> >> has ran only one CMS sweep. Not like it would do constant stop-the-world
> >> gc's.
> >>  - top shows that the C* processes use 100G of RSS memory. I assume that
> >> this is because cassandra opens all SSTables with mmap() so that they
> will
> >> pop up in the RSS count because of this.
> >>
> >> What I've done so far:
> >>  - Rolling restart. Helped for about one day.
> >>  - Tried doing manual GC to the cluster.
> >>  - Increased heap from 8 GiB with CMS to 16 GiB with G1GC.
> >>  - sjk-plus shows bunch of SharedPool workers. Not sure what to make of
> >> this.
> >>  - Browsed over
> >> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html but
> didn't
> >> find any apparent
> >>
> >> I know that the general symptom of "system shows high load" is not very
> >> good and informative, but I don't know how to better describe what's
> going
> >> on. I appreciate all ideas what to try and how to debug this further.
> >>
> >>  - Garo
> >>
> >
>

Re: My cluster shows high system load without any apparent reason

Posted by Mark Rose <ma...@markrose.ca>.

Hi Garo,

Are you using XFS or Ext4 for data? XFS is much better at deleting
large files, such as may happen after a compaction. If you have 26 TB
in just two tables, I bet you have some massive sstables which may
take a while for Ext4 to delete, which may be causing the stalls. The
underlying block layers will not show high IO-wait. See if the stall
times line up with large compactions in system.log.

If you must use Ext4, another way to avoid issues with massive
sstables is to run more, smaller instances.

As an aside, for the amount of reads/writes you're doing, I've found
using c3/m3 instances with the commit log on the ephemeral storage and
data on st1 EBS volumes to be much more cost effective. It's something
to look into if you haven't already.

-Mark

On Fri, Jul 22, 2016 at 8:10 AM, Juho Mäkinen <ju...@gmail.com> wrote:
> After a few days I've also tried disabling Linux kernel huge pages
> defragement (echo never > /sys/kernel/mm/transparent_hugepage/defrag) and
> turning coalescing off (otc_coalescing_strategy: DISABLED), but either did
> do any good. I'm using LCS, there are no big GC pauses, and I have set
> "concurrent_compactors: 5" (machines have 16 CPUs), but there are usually
> not any compactions running when the load spike comes. "nodetool tpstats"
> shows no running thread pools except on the Native-Transport-Requests
> (usually 0-4) and perhaps ReadStage (usually 0-1).
>
> The symptoms are the same: after about 12-24 hours increasingly number of
> nodes start to show short CPU load spikes and this affects the median read
> latencies. I ran a dstat when a load spike was already under way (see
> screenshot http://i.imgur.com/B0S5Zki.png), but any other column than the
> load itself doesn't show any major change except the system/kernel CPU
> usage.
>
> All further ideas how to debug this are greatly appreciated.
>
>
> On Wed, Jul 20, 2016 at 7:13 PM, Juho Mäkinen <ju...@gmail.com>
> wrote:
>>
>> I just recently upgraded our cluster to 2.2.7 and after turning the
>> cluster under production load the instances started to show high load (as
>> shown by uptime) without any apparent reason and I'm not quite sure what
>> could be causing it.
>>
>> We are running on i2.4xlarge, so we have 16 cores, 120GB of ram, four
>> 800GB SSDs (set as lvm stripe into one big lvol). Running 3.13.0-87-generic
>> on HVM virtualisation. Cluster has 26 TiB of data stored in two tables.
>>
>> Symptoms:
>>  - High load, sometimes up to 30 for a short duration of few minutes, then
>> the load drops back to the cluster average: 3-4
>>  - Instances might have one compaction running, but might not have any
>> compactions.
>>  - Each node is serving around 250-300 reads per second and around 200
>> writes per second.
>>  - Restarting node fixes the problem for around 18-24 hours.
>>  - No or very little IO-wait.
>>  - top shows that around 3-10 threads are running on high cpu, but that
>> alone should not cause a load of 20-30.
>>  - Doesn't seem to be GC load: A system starts to show symptoms so that it
>> has ran only one CMS sweep. Not like it would do constant stop-the-world
>> gc's.
>>  - top shows that the C* processes use 100G of RSS memory. I assume that
>> this is because cassandra opens all SSTables with mmap() so that they will
>> pop up in the RSS count because of this.
>>
>> What I've done so far:
>>  - Rolling restart. Helped for about one day.
>>  - Tried doing manual GC to the cluster.
>>  - Increased heap from 8 GiB with CMS to 16 GiB with G1GC.
>>  - sjk-plus shows bunch of SharedPool workers. Not sure what to make of
>> this.
>>  - Browsed over
>> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html but didn't
>> find any apparent
>>
>> I know that the general symptom of "system shows high load" is not very
>> good and informative, but I don't know how to better describe what's going
>> on. I appreciate all ideas what to try and how to debug this further.
>>
>>  - Garo
>>
>

Re: My cluster shows high system load without any apparent reason

Posted by Julien Anguenot <ju...@anguenot.org>.

Hey Garo, 

I see you are using 2.2.x. Do you have compression enabled on commit logs by any chance? If so, try to disable it and see if that helps.

Right after migrating from 2.1.x to 2.2.x, I remember having the behavior you described with 10k SAS disks when commit log compression was enabled w/ LZ4.  After disabling compression on commit logs the issue was gone on my side.

   J. 

--
Julien Anguenot (@anguenot)

> On Jul 22, 2016, at 2:10 PM, Juho Mäkinen <ju...@gmail.com> wrote:
> 
> After a few days I've also tried disabling Linux kernel huge pages defragement (echo never > /sys/kernel/mm/transparent_hugepage/defrag) and turning coalescing off (otc_coalescing_strategy: DISABLED), but either did do any good. I'm using LCS, there are no big GC pauses, and I have set "concurrent_compactors: 5" (machines have 16 CPUs), but there are usually not any compactions running when the load spike comes. "nodetool tpstats" shows no running thread pools except on the Native-Transport-Requests (usually 0-4) and perhaps ReadStage (usually 0-1).
> 
> The symptoms are the same: after about 12-24 hours increasingly number of nodes start to show short CPU load spikes and this affects the median read latencies. I ran a dstat when a load spike was already under way (see screenshot http://i.imgur.com/B0S5Zki.png <http://i.imgur.com/B0S5Zki.png>), but any other column than the load itself doesn't show any major change except the system/kernel CPU usage.
> 
> All further ideas how to debug this are greatly appreciated.
> 
> 
> On Wed, Jul 20, 2016 at 7:13 PM, Juho Mäkinen <juho.makinen@gmail.com <ma...@gmail.com>> wrote:
> I just recently upgraded our cluster to 2.2.7 and after turning the cluster under production load the instances started to show high load (as shown by uptime) without any apparent reason and I'm not quite sure what could be causing it.
> 
> We are running on i2.4xlarge, so we have 16 cores, 120GB of ram, four 800GB SSDs (set as lvm stripe into one big lvol). Running 3.13.0-87-generic on HVM virtualisation. Cluster has 26 TiB of data stored in two tables.
> 
> Symptoms:
>  - High load, sometimes up to 30 for a short duration of few minutes, then the load drops back to the cluster average: 3-4
>  - Instances might have one compaction running, but might not have any compactions.
>  - Each node is serving around 250-300 reads per second and around 200 writes per second.
>  - Restarting node fixes the problem for around 18-24 hours.
>  - No or very little IO-wait.
>  - top shows that around 3-10 threads are running on high cpu, but that alone should not cause a load of 20-30.
>  - Doesn't seem to be GC load: A system starts to show symptoms so that it has ran only one CMS sweep. Not like it would do constant stop-the-world gc's.
>  - top shows that the C* processes use 100G of RSS memory. I assume that this is because cassandra opens all SSTables with mmap() so that they will pop up in the RSS count because of this.
> 
> What I've done so far:
>  - Rolling restart. Helped for about one day.
>  - Tried doing manual GC to the cluster.
>  - Increased heap from 8 GiB with CMS to 16 GiB with G1GC.
>  - sjk-plus shows bunch of SharedPool workers. Not sure what to make of this.
>  - Browsed over https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html <https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html> but didn't find any apparent 
> 
> I know that the general symptom of "system shows high load" is not very good and informative, but I don't know how to better describe what's going on. I appreciate all ideas what to try and how to debug this further.
> 
>  - Garo
> 
>

Re: My cluster shows high system load without any apparent reason

Posted by Juho Mäkinen <ju...@gmail.com>.

After a few days I've also tried disabling Linux kernel huge pages
defragement (echo never > /sys/kernel/mm/transparent_hugepage/defrag) and
turning coalescing off (otc_coalescing_strategy: DISABLED), but either did
do any good. I'm using LCS, there are no big GC pauses, and I have set
"concurrent_compactors: 5" (machines have 16 CPUs), but there are usually
not any compactions running when the load spike comes. "nodetool tpstats"
shows no running thread pools except on the Native-Transport-Requests
(usually 0-4) and perhaps ReadStage (usually 0-1).

The symptoms are the same: after about 12-24 hours increasingly number of
nodes start to show short CPU load spikes and this affects the median read
latencies. I ran a dstat when a load spike was already under way (see
screenshot http://i.imgur.com/B0S5Zki.png), but any other column than the
load itself doesn't show any major change except the system/kernel CPU
usage.

All further ideas how to debug this are greatly appreciated.

On Wed, Jul 20, 2016 at 7:13 PM, Juho Mäkinen <ju...@gmail.com>
wrote:

> I just recently upgraded our cluster to 2.2.7 and after turning the
> cluster under production load the instances started to show high load (as
> shown by uptime) without any apparent reason and I'm not quite sure what
> could be causing it.
>
> We are running on i2.4xlarge, so we have 16 cores, 120GB of ram, four
> 800GB SSDs (set as lvm stripe into one big lvol). Running 3.13.0-87-generic
> on HVM virtualisation. Cluster has 26 TiB of data stored in two tables.
>
> Symptoms:
>  - High load, sometimes up to 30 for a short duration of few minutes, then
> the load drops back to the cluster average: 3-4
>  - Instances might have one compaction running, but might not have any
> compactions.
>  - Each node is serving around 250-300 reads per second and around 200
> writes per second.
>  - Restarting node fixes the problem for around 18-24 hours.
>  - No or very little IO-wait.
>  - top shows that around 3-10 threads are running on high cpu, but that
> alone should not cause a load of 20-30.
>  - Doesn't seem to be GC load: A system starts to show symptoms so that it
> has ran only one CMS sweep. Not like it would do constant stop-the-world
> gc's.
>  - top shows that the C* processes use 100G of RSS memory. I assume that
> this is because cassandra opens all SSTables with mmap() so that they will
> pop up in the RSS count because of this.
>
> What I've done so far:
>  - Rolling restart. Helped for about one day.
>  - Tried doing manual GC to the cluster.
>  - Increased heap from 8 GiB with CMS to 16 GiB with G1GC.
>  - sjk-plus shows bunch of SharedPool workers. Not sure what to make of
> this.
>  - Browsed over
> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html but
> didn't find any apparent
>
> I know that the general symptom of "system shows high load" is not very
> good and informative, but I don't know how to better describe what's going
> on. I appreciate all ideas what to try and how to debug this further.
>
>  - Garo
>
>