You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Mason Hale <ma...@onespot.com> on 2010/07/07 18:23:17 UTC

Cluster performance degrades if any single node is slow

We've been experiencing some cluster-wide performance issues if any single
node in the cluster is performing poorly. For example this occurs if
compaction is running on any node in the cluster, or if a new node is being
bootstrapped.

We believe the root cause of this issue is a performance optimization in
Cassandra that requests the "full" data from only a single node in the
cluster, and MD5 checksums of the same data from all other nodes (depending
on the consistency level of the read) for a given request. The net effect of
this optimization is the read will block until the data is received from the
node that is replying with the full data, even if all other nodes are
responding much more quickly. Thus the entire cluster is only as fast as the
slowest node for some fraction of all requests. The portion of requests sent
to the slow node is a function of the total cluster size, replication factor
and consistency level. For smallish clusters (e.g. 10  or fewer servers)
this performance degradation can be quite pronounced.

CASSANDRA-981 (https://issues.apache.org/jira/browse/CASSANDRA-981)
discusses this issue and proposes the solution of dynamically identifying
slow nodes and automatically treating them as if they were on a remote
network, thus preventing certain performance critical operations (such as
full data requests) from being performed on that node. This seems like a
fine solution.

However, a design that requires any read operation to wait on the reply from
a specific single node seems counter to the fundamental design goal of
avoiding any single points of failure. In this case, a single node with
degraded performance (but still online) can dramatically reduce the overall
performance of the cluster. The proposed solution would dynamically detect
this condition and take evasive action when the condition is detected, but
it would require some number of requests to perform poorly before a slow
node is detected. It also smells like a complex solution that could have
some unexpected side-effects and edge-cases.

I wonder if a simpler solution would be more effective here? In the same way
that hinted handoff can now be disabled via configuration, would it be
feasible to optionally turn off this optimization? This way I can make the
trade-off decision between the incremental performance improvement from this
optimization or more reliable cluster-wide performance. Ideally, I would be
able to configure how many nodes should reply with "full data" with each
request. Thus I could increase this from 1 to 2 to avoid cluster-wide
performance degradation if any single node is performing poorly. By being
able to turn off or tune this setting I would also be able to do some a/b
testing to evaluate what performance benefit is being gained by this
optimization.

I'm curious to know if anyone else has run into this issue, and if anyone
else wishes they could turn off or tune this "full data"/md5 performance
optimization?

thanks,
Mason

Re: Cluster performance degrades if any single node is slow

Posted by Jonathan Ellis <jb...@gmail.com>.

On Wed, Jul 7, 2010 at 11:50 AM, Mason Hale <ma...@onespot.com> wrote:
> I'm curious of what performance benefit is actually being gained from this
> optimization.

It's really pretty easy to saturate gigabit ethernet with a Cassandra
cluster.  Mulitplying traffic by roughly RF is definitely a lose.

> Sorry to keep beating this horse, but we're regularly being alerted to
> performance issues any time a mini-compaction occurs on any node in our
> cluster. I'm fishing for a quick and easy way to prevent this from
> happening.

I believe Brandon is in the process of suggesting enabling compaction
de-prioritization to Scott
(https://issues.apache.org/jira/browse/CASSANDRA-1181).

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Cluster performance degrades if any single node is slow

Posted by Stu Hood <st...@rackspace.com>.

Could we conditionally use an MD5 request only if a node was in a different zone/datacenter according to the replication strategy? Presumably the bandwidth usage within a datacenter isn't a concern as much as latency.

-----Original Message-----
From: "Mason Hale" <ma...@onespot.com>
Sent: Wednesday, July 7, 2010 11:50am
To: user@cassandra.apache.org
Subject: Re: Cluster performance degrades if any single node is slow

On Wed, Jul 7, 2010 at 11:33 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> Having a few requests time out while the service detects badness is
> typical in this kind of system.  I don't think writing a completely
> separate StorageProxy + supporting classes to allow avoiding this in
> exchange for RF times the network bandwidth is a good idea.

My suggestion of turning off (or tuning) the full-data/md5 optimization
assumes that exposing this a configuration option would be less work and
less complicated than dynamically detecting and routing around slow nodes.
From your reply, it sounds as if this assumption does not hold. I assumed
(without looking at the code) that the existing StorageProxy could expose
this as a configuration option without requiring a lot of additional work
and certainly without requiring an entirely separate set of supporting
classes.

I'm curious of what performance benefit is actually being gained from this
optimization. Has this benefit been tested and measured? Since the benefit
would depend greatly on the size of the data being requested, for smallish
data requests the performance improvement would be negligible, correct?
Given the reliability downside is rather severe, this feels like a trade-off
a system administrator would like to be able to make.

Sorry to keep beating this horse, but we're regularly being alerted to
performance issues any time a mini-compaction occurs on any node in our
cluster. I'm fishing for a quick and easy way to prevent this from
happening.

thanks,
Mason

Re: Cluster performance degrades if any single node is slow

Posted by Mason Hale <ma...@onespot.com>.

On Wed, Jul 7, 2010 at 11:33 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> Having a few requests time out while the service detects badness is
> typical in this kind of system.  I don't think writing a completely
> separate StorageProxy + supporting classes to allow avoiding this in
> exchange for RF times the network bandwidth is a good idea.

My suggestion of turning off (or tuning) the full-data/md5 optimization
assumes that exposing this a configuration option would be less work and
less complicated than dynamically detecting and routing around slow nodes.
>From your reply, it sounds as if this assumption does not hold. I assumed
(without looking at the code) that the existing StorageProxy could expose
this as a configuration option without requiring a lot of additional work
and certainly without requiring an entirely separate set of supporting
classes.

I'm curious of what performance benefit is actually being gained from this
optimization. Has this benefit been tested and measured? Since the benefit
would depend greatly on the size of the data being requested, for smallish
data requests the performance improvement would be negligible, correct?
Given the reliability downside is rather severe, this feels like a trade-off
a system administrator would like to be able to make.

Sorry to keep beating this horse, but we're regularly being alerted to
performance issues any time a mini-compaction occurs on any node in our
cluster. I'm fishing for a quick and easy way to prevent this from
happening.

thanks,
Mason

Re: Cluster performance degrades if any single node is slow

Posted by Jonathan Ellis <jb...@gmail.com>.

Having a few requests time out while the service detects badness is
typical in this kind of system.  I don't think writing a completely
separate StorageProxy + supporting classes to allow avoiding this in
exchange for RF times the network bandwidth is a good idea.

On Wed, Jul 7, 2010 at 11:23 AM, Mason Hale <ma...@onespot.com> wrote:
> We've been experiencing some cluster-wide performance issues if any single
> node in the cluster is performing poorly. For example this occurs if
> compaction is running on any node in the cluster, or if a new node is being
> bootstrapped.
> We believe the root cause of this issue is a performance optimization in
> Cassandra that requests the "full" data from only a single node in the
> cluster, and MD5 checksums of the same data from all other nodes (depending
> on the consistency level of the read) for a given request. The net effect of
> this optimization is the read will block until the data is received from the
> node that is replying with the full data, even if all other nodes are
> responding much more quickly. Thus the entire cluster is only as fast as the
> slowest node for some fraction of all requests. The portion of requests sent
> to the slow node is a function of the total cluster size, replication factor
> and consistency level. For smallish clusters (e.g. 10  or fewer servers)
> this performance degradation can be quite pronounced.
> CASSANDRA-981 (https://issues.apache.org/jira/browse/CASSANDRA-981)
> discusses this issue and proposes the solution of dynamically identifying
> slow nodes and automatically treating them as if they were on a remote
> network, thus preventing certain performance critical operations (such as
> full data requests) from being performed on that node. This seems like a
> fine solution.
> However, a design that requires any read operation to wait on the reply from
> a specific single node seems counter to the fundamental design goal of
> avoiding any single points of failure. In this case, a single node with
> degraded performance (but still online) can dramatically reduce the overall
> performance of the cluster. The proposed solution would dynamically detect
> this condition and take evasive action when the condition is detected, but
> it would require some number of requests to perform poorly before a slow
> node is detected. It also smells like a complex solution that could have
> some unexpected side-effects and edge-cases.
> I wonder if a simpler solution would be more effective here? In the same way
> that hinted handoff can now be disabled via configuration, would it be
> feasible to optionally turn off this optimization? This way I can make the
> trade-off decision between the incremental performance improvement from this
> optimization or more reliable cluster-wide performance. Ideally, I would be
> able to configure how many nodes should reply with "full data" with each
> request. Thus I could increase this from 1 to 2 to avoid cluster-wide
> performance degradation if any single node is performing poorly. By being
> able to turn off or tune this setting I would also be able to do some a/b
> testing to evaluate what performance benefit is being gained by this
> optimization.
> I'm curious to know if anyone else has run into this issue, and if anyone
> else wishes they could turn off or tune this "full data"/md5 performance
> optimization?
> thanks,
> Mason
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com