You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Will Hayworth <wh...@atlassian.com> on 2016/02/07 00:28:47 UTC

Back to the futex()? :(

*tl;dr: other than CAS operations, what are the potential sources of lock
contention in C*?*

Hi all! :) I'm a novice Cassandra and Linux admin who's been preparing a
small cluster for production, and I've been seeing something weird. For
background: I'm running 3.2.1 on a cluster of 12 EC2 m4.2xlarges (32 GB
RAM, 8 HT cores) backed by 3.5 TB GP2 EBS volumes. Until late yesterday,
that was a cluster of 12 m4.xlarges with 3 TB volumes. I bumped it because
while backloading historical data I had been seeing awful throughput (20K
op/s at CL.ONE). I'd read through Al Tobey's *amazing* C* tuning guide
<https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html> once or
twice before but this time I was careful and fixed a bunch of defaults that
just weren't right, in cassandra.yaml/JVM options/block device parameters.
Folks on IRC were super helpful as always (hat tip to Jeff Jirsa in
particular) and pointed out, for example, that I shouldn't be using DTCS
for loading historical data--heh. After changing to LTCS, unbatching my
writes* and reserving a CPU core for interrupts and fixing the clocksource
to TSC, I finally hit 80K early this morning. Hooray! :)

Now, my question: I'm still seeing a *ton* of blocked processes in the
vmstats, anything from 2 to 9 per 10 second sample period--and this is
before EBS is even being hit! I've been trying in vain to figure out what
this could be--GC seems very quiet, after all. On Al's page's advice, I've
been running strace and, indeed, I've been seeing *tens of thousands of
futex() calls* in periods of 10 or 20 seconds. What eludes me is *where* this
lock contention is coming from. I'm not using LWTs or performing CAS
operations of which I'm aware. Assuming this isn't a red herring, what
gives?

Sorry for the essay--I just wanted to err on the side of more
context--and *thank
you* for any advice you'd like to offer,
Will

P.S. More background if you'd like--I'm running on Amazon Linux 2015.09,
using jemalloc 3.6, JDK 1.8.0_65-b17. Here <http://pastebin.com/kuhBmHXG> is
my cassandra.yaml and here <http://pastebin.com/fyXeTfRa> are my JVM args.
I realized I neglected to adjust memtable_flush_writers as I was writing
this--so I'll get on that. Aside from that, I'm not sure what to do.
(Thanks, again, for reading.)

* They were batched for consistency--I'm hoping to return to using them
when I'm back at normal load, which is tiny compared to backloading, but
the impact on performance was eye-opening.
___________________________________________________________
Will Hayworth
Developer, Engagement Engine
Atlassian

My pronoun is "they". <http://pronoun.is/they>

Re: Back to the futex()? :(

Posted by Will Hayworth <wh...@atlassian.com>.

Thanks for the links, Ben--I'll be sure to give those a read. And yeah,
followed the Crowdstrike presentation in detail (including the stuff they
called out on Al Tobey's page). Again, the reason for the huge heap is that
otherwise my memtable can't actually fit in memory (no off-heap until 3.4),
but your point about tuning before true prod is well taken. :) (And I read
your post, too--the reason we started with m4.xlarges is in large part
because you all made it work.)

Nate--the RF is taken care of, thanks (otherwise I've seen issues where my
code can't log in to a given node, which makes sense) and, furthermore, I
ran a repair after doing all the initial loading. I'm not doing dynamic
permissions (though I'm hoping to use Vault <https://www.vaultproject.io/> to
generate short-lived user/password combinations soon), so I'll be sure to
adjust permissions_validity_in_ms.

Thank you both so much for your help!

___________________________________________________________
Will Hayworth
Developer, Engagement Engine
Atlassian

My pronoun is "they". <http://pronoun.is/they>



On Tue, Feb 9, 2016 at 11:25 AM, Nate McCall <na...@thelastpickle.com> wrote:

> I noticed you have authentication enabled. Make sure you set the following:
>
> - the replication factor for the system_auth keyspace should equal the
> number of nodes
> - permissions_validity_in_ms is a permission cache timeout. If you are not
> doing dynamic permissions or creating/revoking frequently, turn this WAY up
>
> May not be the immediate reason, but the above are definitely not helping
> if set at defaults.
>
> On Sat, Feb 6, 2016 at 6:49 PM, Will Hayworth <wh...@atlassian.com>
> wrote:
>
>> Additionally: this isn't the futex_wait bug (or at least it shouldn't
>> be?). Amazon says
>> <https://forums.aws.amazon.com/thread.jspa?messageID=623731> that was
>> fixed several kernel versions before mine, which
>> is 4.1.10-17.31.amzn1.x86_64. And the reason my heap is so large is
>> because, per CASSANDRA-9472, we can't use offheap until 3.4 is released.
>>
>> Will
>>
>> ___________________________________________________________
>> Will Hayworth
>> Developer, Engagement Engine
>> Atlassian
>>
>> My pronoun is "they". <http://pronoun.is/they>
>>
>>
>>
>> On Sat, Feb 6, 2016 at 3:28 PM, Will Hayworth <wh...@atlassian.com>
>> wrote:
>>
>>> *tl;dr: other than CAS operations, what are the potential sources of
>>> lock contention in C*?*
>>>
>>> Hi all! :) I'm a novice Cassandra and Linux admin who's been preparing a
>>> small cluster for production, and I've been seeing something weird. For
>>> background: I'm running 3.2.1 on a cluster of 12 EC2 m4.2xlarges (32 GB
>>> RAM, 8 HT cores) backed by 3.5 TB GP2 EBS volumes. Until late yesterday,
>>> that was a cluster of 12 m4.xlarges with 3 TB volumes. I bumped it because
>>> while backloading historical data I had been seeing awful throughput (20K
>>> op/s at CL.ONE). I'd read through Al Tobey's *amazing* C* tuning guide
>>> <https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html> once
>>> or twice before but this time I was careful and fixed a bunch of defaults
>>> that just weren't right, in cassandra.yaml/JVM options/block device
>>> parameters. Folks on IRC were super helpful as always (hat tip to Jeff
>>> Jirsa in particular) and pointed out, for example, that I shouldn't be
>>> using DTCS for loading historical data--heh. After changing to LTCS,
>>> unbatching my writes* and reserving a CPU core for interrupts and fixing
>>> the clocksource to TSC, I finally hit 80K early this morning. Hooray! :)
>>>
>>> Now, my question: I'm still seeing a *ton* of blocked processes in the
>>> vmstats, anything from 2 to 9 per 10 second sample period--and this is
>>> before EBS is even being hit! I've been trying in vain to figure out what
>>> this could be--GC seems very quiet, after all. On Al's page's advice, I've
>>> been running strace and, indeed, I've been seeing *tens of thousands of
>>> futex() calls* in periods of 10 or 20 seconds. What eludes me is *where* this
>>> lock contention is coming from. I'm not using LWTs or performing CAS
>>> operations of which I'm aware. Assuming this isn't a red herring, what
>>> gives?
>>>
>>> Sorry for the essay--I just wanted to err on the side of more
>>> context--and *thank you* for any advice you'd like to offer,
>>> Will
>>>
>>> P.S. More background if you'd like--I'm running on Amazon Linux 2015.09,
>>> using jemalloc 3.6, JDK 1.8.0_65-b17. Here
>>> <http://pastebin.com/kuhBmHXG> is my cassandra.yaml and here
>>> <http://pastebin.com/fyXeTfRa> are my JVM args. I realized I neglected
>>> to adjust memtable_flush_writers as I was writing this--so I'll get on
>>> that. Aside from that, I'm not sure what to do. (Thanks, again, for
>>> reading.)
>>>
>>> * They were batched for consistency--I'm hoping to return to using them
>>> when I'm back at normal load, which is tiny compared to backloading, but
>>> the impact on performance was eye-opening.
>>> ___________________________________________________________
>>> Will Hayworth
>>> Developer, Engagement Engine
>>> Atlassian
>>>
>>> My pronoun is "they". <http://pronoun.is/they>
>>>
>>>
>>>
>>
>
>
> --
> -----------------
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>

Re: Back to the futex()? :(

Posted by Nate McCall <na...@thelastpickle.com>.

I noticed you have authentication enabled. Make sure you set the following:

- the replication factor for the system_auth keyspace should equal the
number of nodes
- permissions_validity_in_ms is a permission cache timeout. If you are not
doing dynamic permissions or creating/revoking frequently, turn this WAY up

May not be the immediate reason, but the above are definitely not helping
if set at defaults.

On Sat, Feb 6, 2016 at 6:49 PM, Will Hayworth <wh...@atlassian.com>
wrote:

> Additionally: this isn't the futex_wait bug (or at least it shouldn't
> be?). Amazon says
> <https://forums.aws.amazon.com/thread.jspa?messageID=623731> that was
> fixed several kernel versions before mine, which
> is 4.1.10-17.31.amzn1.x86_64. And the reason my heap is so large is
> because, per CASSANDRA-9472, we can't use offheap until 3.4 is released.
>
> Will
>
> ___________________________________________________________
> Will Hayworth
> Developer, Engagement Engine
> Atlassian
>
> My pronoun is "they". <http://pronoun.is/they>
>
>
>
> On Sat, Feb 6, 2016 at 3:28 PM, Will Hayworth <wh...@atlassian.com>
> wrote:
>
>> *tl;dr: other than CAS operations, what are the potential sources of lock
>> contention in C*?*
>>
>> Hi all! :) I'm a novice Cassandra and Linux admin who's been preparing a
>> small cluster for production, and I've been seeing something weird. For
>> background: I'm running 3.2.1 on a cluster of 12 EC2 m4.2xlarges (32 GB
>> RAM, 8 HT cores) backed by 3.5 TB GP2 EBS volumes. Until late yesterday,
>> that was a cluster of 12 m4.xlarges with 3 TB volumes. I bumped it because
>> while backloading historical data I had been seeing awful throughput (20K
>> op/s at CL.ONE). I'd read through Al Tobey's *amazing* C* tuning guide
>> <https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html> once
>> or twice before but this time I was careful and fixed a bunch of defaults
>> that just weren't right, in cassandra.yaml/JVM options/block device
>> parameters. Folks on IRC were super helpful as always (hat tip to Jeff
>> Jirsa in particular) and pointed out, for example, that I shouldn't be
>> using DTCS for loading historical data--heh. After changing to LTCS,
>> unbatching my writes* and reserving a CPU core for interrupts and fixing
>> the clocksource to TSC, I finally hit 80K early this morning. Hooray! :)
>>
>> Now, my question: I'm still seeing a *ton* of blocked processes in the
>> vmstats, anything from 2 to 9 per 10 second sample period--and this is
>> before EBS is even being hit! I've been trying in vain to figure out what
>> this could be--GC seems very quiet, after all. On Al's page's advice, I've
>> been running strace and, indeed, I've been seeing *tens of thousands of
>> futex() calls* in periods of 10 or 20 seconds. What eludes me is *where* this
>> lock contention is coming from. I'm not using LWTs or performing CAS
>> operations of which I'm aware. Assuming this isn't a red herring, what
>> gives?
>>
>> Sorry for the essay--I just wanted to err on the side of more
>> context--and *thank you* for any advice you'd like to offer,
>> Will
>>
>> P.S. More background if you'd like--I'm running on Amazon Linux 2015.09,
>> using jemalloc 3.6, JDK 1.8.0_65-b17. Here <http://pastebin.com/kuhBmHXG> is
>> my cassandra.yaml and here <http://pastebin.com/fyXeTfRa> are my JVM
>> args. I realized I neglected to adjust memtable_flush_writers as I was
>> writing this--so I'll get on that. Aside from that, I'm not sure what to
>> do. (Thanks, again, for reading.)
>>
>> * They were batched for consistency--I'm hoping to return to using them
>> when I'm back at normal load, which is tiny compared to backloading, but
>> the impact on performance was eye-opening.
>> ___________________________________________________________
>> Will Hayworth
>> Developer, Engagement Engine
>> Atlassian
>>
>> My pronoun is "they". <http://pronoun.is/they>
>>
>>
>>
>


-- 
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Back to the futex()? :(

Posted by Ben Bromhead <be...@instaclustr.com>.

<unverified-conjecture>
I'm not surprised that when you profile cassandra you are seeing some lock
contention, particularly given its SEDA architecture, as there is a lot of
waiting that threads end up doing while requests make their way through the
various stages.

See
https://wiki.apache.org/cassandra/ArchitectureInternals
https://issues.apache.org/jira/browse/CASSANDRA-10989
https://blogs.oracle.com/roland/entry/real_time_java_and_futexes
</unverified-conjecture>

So I would say the thread_wait issue is a red herring in this case given it
will be inherent for most Cassandra deployments... the caveat is that you
are running 3.2.1 which is a thoroughly new version of Cassandra that may
have a new bug and I'm not sure how many people here have experience with
it. Especially given that the new tick-tock approach makes it hard to judge
when a release is ready for prime time.

Otherwise follow the good folk at crowdstrike for getting good performance
out of EBS (
http://www.slideshare.net/jimplush/1-million-writes-per-second-on-60-nodes-with-cassandra-and-ebs).
They have done all the hard work for the rest of us.

Reduce your JVM heap size to something closer to 8GB, given that your
cluster hasn't seen a production workload I wouldn't worry about tuning
heap etc unless you see GC pressure in the logs. You don't want to spend a
lot of time tuning for backloading when the actual traffic will be / could
be different.

The performance you are getting is roughly on par to what we have seen with
some early benchmarking of EBS volumes (
https://www.instaclustr.com/2015/10/28/cassandra-on-aws-ebs-infrastructure/),
but with machines half the size. We decided to go a slightly different path
and use m4.xlarges we are always playing with different configurations to
see what works best.

On Sat, 6 Feb 2016 at 16:50 Will Hayworth <wh...@atlassian.com> wrote:

> Additionally: this isn't the futex_wait bug (or at least it shouldn't
> be?). Amazon says
> <https://forums.aws.amazon.com/thread.jspa?messageID=623731> that was
> fixed several kernel versions before mine, which
> is 4.1.10-17.31.amzn1.x86_64. And the reason my heap is so large is
> because, per CASSANDRA-9472, we can't use offheap until 3.4 is released.
>
> Will
>
> ___________________________________________________________
> Will Hayworth
> Developer, Engagement Engine
> Atlassian
>
> My pronoun is "they". <http://pronoun.is/they>
>
>
>
> On Sat, Feb 6, 2016 at 3:28 PM, Will Hayworth <wh...@atlassian.com>
> wrote:
>
>> *tl;dr: other than CAS operations, what are the potential sources of lock
>> contention in C*?*
>>
>> Hi all! :) I'm a novice Cassandra and Linux admin who's been preparing a
>> small cluster for production, and I've been seeing something weird. For
>> background: I'm running 3.2.1 on a cluster of 12 EC2 m4.2xlarges (32 GB
>> RAM, 8 HT cores) backed by 3.5 TB GP2 EBS volumes. Until late yesterday,
>> that was a cluster of 12 m4.xlarges with 3 TB volumes. I bumped it because
>> while backloading historical data I had been seeing awful throughput (20K
>> op/s at CL.ONE). I'd read through Al Tobey's *amazing* C* tuning guide
>> <https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html> once
>> or twice before but this time I was careful and fixed a bunch of defaults
>> that just weren't right, in cassandra.yaml/JVM options/block device
>> parameters. Folks on IRC were super helpful as always (hat tip to Jeff
>> Jirsa in particular) and pointed out, for example, that I shouldn't be
>> using DTCS for loading historical data--heh. After changing to LTCS,
>> unbatching my writes* and reserving a CPU core for interrupts and fixing
>> the clocksource to TSC, I finally hit 80K early this morning. Hooray! :)
>>
>> Now, my question: I'm still seeing a *ton* of blocked processes in the
>> vmstats, anything from 2 to 9 per 10 second sample period--and this is
>> before EBS is even being hit! I've been trying in vain to figure out what
>> this could be--GC seems very quiet, after all. On Al's page's advice, I've
>> been running strace and, indeed, I've been seeing *tens of thousands of
>> futex() calls* in periods of 10 or 20 seconds. What eludes me is *where* this
>> lock contention is coming from. I'm not using LWTs or performing CAS
>> operations of which I'm aware. Assuming this isn't a red herring, what
>> gives?
>>
>> Sorry for the essay--I just wanted to err on the side of more
>> context--and *thank you* for any advice you'd like to offer,
>> Will
>>
>> P.S. More background if you'd like--I'm running on Amazon Linux 2015.09,
>> using jemalloc 3.6, JDK 1.8.0_65-b17. Here <http://pastebin.com/kuhBmHXG> is
>> my cassandra.yaml and here <http://pastebin.com/fyXeTfRa> are my JVM
>> args. I realized I neglected to adjust memtable_flush_writers as I was
>> writing this--so I'll get on that. Aside from that, I'm not sure what to
>> do. (Thanks, again, for reading.)
>>
>> * They were batched for consistency--I'm hoping to return to using them
>> when I'm back at normal load, which is tiny compared to backloading, but
>> the impact on performance was eye-opening.
>> ___________________________________________________________
>> Will Hayworth
>> Developer, Engagement Engine
>> Atlassian
>>
>> My pronoun is "they". <http://pronoun.is/they>
>>
>>
>>
> --
Ben Bromhead
CTO | Instaclustr <https://www.instaclustr.com/>
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer

Re: Back to the futex()? :(

Posted by Will Hayworth <wh...@atlassian.com>.

Additionally: this isn't the futex_wait bug (or at least it shouldn't
be?). Amazon
says <https://forums.aws.amazon.com/thread.jspa?messageID=623731> that was
fixed several kernel versions before mine, which
is 4.1.10-17.31.amzn1.x86_64. And the reason my heap is so large is
because, per CASSANDRA-9472, we can't use offheap until 3.4 is released.

Will

___________________________________________________________
Will Hayworth
Developer, Engagement Engine
Atlassian

My pronoun is "they". <http://pronoun.is/they>



On Sat, Feb 6, 2016 at 3:28 PM, Will Hayworth <wh...@atlassian.com>
wrote:

> *tl;dr: other than CAS operations, what are the potential sources of lock
> contention in C*?*
>
> Hi all! :) I'm a novice Cassandra and Linux admin who's been preparing a
> small cluster for production, and I've been seeing something weird. For
> background: I'm running 3.2.1 on a cluster of 12 EC2 m4.2xlarges (32 GB
> RAM, 8 HT cores) backed by 3.5 TB GP2 EBS volumes. Until late yesterday,
> that was a cluster of 12 m4.xlarges with 3 TB volumes. I bumped it because
> while backloading historical data I had been seeing awful throughput (20K
> op/s at CL.ONE). I'd read through Al Tobey's *amazing* C* tuning guide
> <https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html> once
> or twice before but this time I was careful and fixed a bunch of defaults
> that just weren't right, in cassandra.yaml/JVM options/block device
> parameters. Folks on IRC were super helpful as always (hat tip to Jeff
> Jirsa in particular) and pointed out, for example, that I shouldn't be
> using DTCS for loading historical data--heh. After changing to LTCS,
> unbatching my writes* and reserving a CPU core for interrupts and fixing
> the clocksource to TSC, I finally hit 80K early this morning. Hooray! :)
>
> Now, my question: I'm still seeing a *ton* of blocked processes in the
> vmstats, anything from 2 to 9 per 10 second sample period--and this is
> before EBS is even being hit! I've been trying in vain to figure out what
> this could be--GC seems very quiet, after all. On Al's page's advice, I've
> been running strace and, indeed, I've been seeing *tens of thousands of
> futex() calls* in periods of 10 or 20 seconds. What eludes me is *where* this
> lock contention is coming from. I'm not using LWTs or performing CAS
> operations of which I'm aware. Assuming this isn't a red herring, what
> gives?
>
> Sorry for the essay--I just wanted to err on the side of more context--and *thank
> you* for any advice you'd like to offer,
> Will
>
> P.S. More background if you'd like--I'm running on Amazon Linux 2015.09,
> using jemalloc 3.6, JDK 1.8.0_65-b17. Here <http://pastebin.com/kuhBmHXG> is
> my cassandra.yaml and here <http://pastebin.com/fyXeTfRa> are my JVM
> args. I realized I neglected to adjust memtable_flush_writers as I was
> writing this--so I'll get on that. Aside from that, I'm not sure what to
> do. (Thanks, again, for reading.)
>
> * They were batched for consistency--I'm hoping to return to using them
> when I'm back at normal load, which is tiny compared to backloading, but
> the impact on performance was eye-opening.
> ___________________________________________________________
> Will Hayworth
> Developer, Engagement Engine
> Atlassian
>
> My pronoun is "they". <http://pronoun.is/they>
>
>
>