You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Lorand Kasler <lo...@soundcloud.com> on 2016/01/28 17:11:29 UTC

Read operations freeze for a few second while adding a new node

Hi,

We are struggling with a problem that when adding nodes around 5% read
operations freeze (aka time out after 1 second) for a few seconds (10-20
seconds). It might not seems much, but at the order of 200k requests per
second that's quite big of disruption.  It is well documented and known
that adding nodes *has* impact on the latency or the completion of the
requests but is there a way to lessen that?
It is completely okay for write operations to fail or get blocked while
adding nodes, but having the read path also impacted by this much (going
from 30 millisecond 99 percentile latency to above 1 second) is what
puzzles us.

We have a 36 node cluster, every node owning ~120 GB of data. We are using
Cassandra version 2.0.14 with vnodes and we are in the process of
increasing capacity of the cluster, by roughly doubling the nodes.  They
have SSDs and have peak IO usage of ~30%.

Apart from the latency metrics only FlushWrites are blocked 18% of the time
(based on the tpstats counters), but that can only lead to blocking writes
and not reads?

Thank you

Re: Read operations freeze for a few second while adding a new node

Posted by Anuj Wadehra <an...@yahoo.co.in>.

Hi Lorand,
Do you see any different gc pattern during these 20 seconds?
In 2.0.x, memtable create lot of heap pressure. So in a way, reads are not isolated from writes.
Frankly speaking, I would have accepted 20 second slowness as scaling is one time activity. But may be your business case doesnt make that acceptable. 
Such tough requirements often drive improvements.. 

ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Thu, 28 Jan, 2016 at 9:41 pm, Lorand Kasler<lo...@soundcloud.com> wrote:   Hi,
We are struggling with a problem that when adding nodes around 5% read operations freeze (aka time out after 1 second) for a few seconds (10-20 seconds). It might not seems much, but at the order of 200k requests per second that's quite big of disruption.  It is well documented and known that adding nodes *has* impact on the latency or the completion of the requests but is there a way to lessen that? It is completely okay for write operations to fail or get blocked while adding nodes, but having the read path also impacted by this much (going from 30 millisecond 99 percentile latency to above 1 second) is what puzzles us.
We have a 36 node cluster, every node owning ~120 GB of data. We are using Cassandra version 2.0.14 with vnodes and we are in the process of increasing capacity of the cluster, by roughly doubling the nodes.  They have SSDs and have peak IO usage of ~30%. 
Apart from the latency metrics only FlushWrites are blocked 18% of the time (based on the tpstats counters), but that can only lead to blocking writes and not reads? 
Thank you

Re: Read operations freeze for a few second while adding a new node

Posted by Jeff Jirsa <je...@crowdstrike.com>.

Is this during streaming plan setup (is your 10-20 second time of impact approximately 30 seconds from the time you start the node that’s joining the ring), or does it happen for the entire time you’re joining the node to the ring?

If so, there’s a chance it’s GC related – the streaming plan code used to instantiate ALL of the compression metadata chunks in order to calculate, which creates a fair amount of garbage, which creates some GC activity. https://issues.apache.org/jira/browse/CASSANDRA-10680 was created due to some edge cases (very small compression chunk size + 3T of data per node = hundreds of millions of objects), but it’s possible that you’re seeing a less-extreme version of that same behavior.

From:  Lorand Kasler
Reply-To:  "user@cassandra.apache.org"
Date:  Thursday, January 28, 2016 at 8:11 AM
To:  "user@cassandra.apache.org"
Subject:  Read operations freeze for a few second while adding a new node

Hi, 

We are struggling with a problem that when adding nodes around 5% read operations freeze (aka time out after 1 second) for a few seconds (10-20 seconds). It might not seems much, but at the order of 200k requests per second that's quite big of disruption.  It is well documented and known that adding nodes *has* impact on the latency or the completion of the requests but is there a way to lessen that? 
It is completely okay for write operations to fail or get blocked while adding nodes, but having the read path also impacted by this much (going from 30 millisecond 99 percentile latency to above 1 second) is what puzzles us.

We have a 36 node cluster, every node owning ~120 GB of data. We are using Cassandra version 2.0.14 with vnodes and we are in the process of increasing capacity of the cluster, by roughly doubling the nodes.  They have SSDs and have peak IO usage of ~30%. 

Apart from the latency metrics only FlushWrites are blocked 18% of the time (based on the tpstats counters), but that can only lead to blocking writes and not reads? 

Thank you

Re: Read operations freeze for a few second while adding a new node

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

If you've got a read heavy workload you should check out
http://blakeeggleston.com/cassandra-tuning-the-jvm-for-read-heavy-workloads.html



On Thu, Jan 28, 2016 at 8:11 AM Lorand Kasler <lo...@soundcloud.com>
wrote:

> Hi,
>
> We are struggling with a problem that when adding nodes around 5% read
> operations freeze (aka time out after 1 second) for a few seconds (10-20
> seconds). It might not seems much, but at the order of 200k requests per
> second that's quite big of disruption.  It is well documented and known
> that adding nodes *has* impact on the latency or the completion of the
> requests but is there a way to lessen that?
> It is completely okay for write operations to fail or get blocked while
> adding nodes, but having the read path also impacted by this much (going
> from 30 millisecond 99 percentile latency to above 1 second) is what
> puzzles us.
>
> We have a 36 node cluster, every node owning ~120 GB of data. We are using
> Cassandra version 2.0.14 with vnodes and we are in the process of
> increasing capacity of the cluster, by roughly doubling the nodes.  They
> have SSDs and have peak IO usage of ~30%.
>
> Apart from the latency metrics only FlushWrites are blocked 18% of the
> time (based on the tpstats counters), but that can only lead to blocking
> writes and not reads?
>
> Thank you
>