You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Soerian Lieve <sl...@liveramp.com> on 2015/07/17 23:08:26 UTC

Unbalanced disk load

Hi,

I am currently benchmarking Cassandra with three machines, and on each
machine I am seeing an unbalanced distribution of data among the data
directories (1 per disk).
I am concerned that this affects my write performance, is there anything
that I can make the distribution be more even? Would raid0 be my best
option?

Details:
3 machines, each have 24 cores, 64GB of RAM, 7 SSDs of 500GB each.
Commitlog is on a separate disk, cassandra.yaml configured according to
Datastax' guide on cassandra.yaml.
Total size of data is about 2TB, 14B records, all unique. Replication
factor of 1.

Thanks,
Soerian

Re: Unbalanced disk load

Posted by Anuj Wadehra <an...@yahoo.co.in>.

Moreover, if you are using SSDs keeping data directories and commitlog on separate disks wont provide much benefit.


As Nate said, relying on RAID with RF=1 is not good design. Cassandra replicas provide greater fault tolerance and HA as they are on different nodes. 


Thanks

Anuj




Sent from Yahoo Mail on Android

From:"Nate McCall" <na...@thelastpickle.com>
Date:Sun, 19 Jul, 2015 at 1:20 am
Subject:Re: Unbalanced disk load

>
> I am currently benchmarking Cassandra with three machines, and on each machine I am seeing an unbalanced distribution of data among the data directories (1 per disk). 
> I am concerned that this affects my write performance, is there anything that I can make the distribution be more even? Would raid0 be my best option?
>

Using LeveledCompactionStrategy should provide a much better balance. 

However, depending on your use case, this may not be the right choice for your workload, in which case RAID0 with a single data_dir will be the best option. 


 

> Total size of data is about 2TB, 14B records, all unique. Replication factor of 1.


RF=1 means *no* redundancy which is a bad idea to run in production (and sort of defeats the purpose of a system like Cassandra). This is not going to be an accurate a picture for a load test as it eliminates a lot of cross-node traffic which you would see with a higher Replication Factor. 


--
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Unbalanced disk load

Posted by Nate McCall <na...@thelastpickle.com>.

>
> I am currently benchmarking Cassandra with three machines, and on each
machine I am seeing an unbalanced distribution of data among the data
directories (1 per disk).
> I am concerned that this affects my write performance, is there anything
that I can make the distribution be more even? Would raid0 be my best
option?
>

Using LeveledCompactionStrategy should provide a much better balance.

However, depending on your use case, this may not be the right choice for
your workload, in which case RAID0 with a single data_dir will be the best
option.

> Total size of data is about 2TB, 14B records, all unique. Replication
factor of 1.

RF=1 means *no* redundancy which is a bad idea to run in production (and
sort of defeats the purpose of a system like Cassandra). This is not going
to be an accurate a picture for a load test as it eliminates a lot of
cross-node traffic which you would see with a higher Replication Factor.


--
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Unbalanced disk load

Posted by Robert Coli <rc...@eventbrite.com>.

On Sat, Jul 18, 2015 at 10:09 AM, J. Ryan Earl <os...@jryanearl.us> wrote:

> Even with https://issues.apache.org/jira/browse/CASSANDRA-7386 data
> balancing across JBOD setups is pretty horrible.  Having used JBOD for
> about 2 years from 1.2.x and up, it is my opinion JBOD on Cassandra is
> nascent at best and far from mature.  For a variety of reasons, JBOD should
> perform better if IO and data is balanced across multiple devices due to
> things like linux device queues, striping overhead, access contention, and
> so forth.  However, data and access patterns simply are not balanced in
> Cassandra JBOD setups.
>

I have heard this, especially with STS, over the years from many different
reporters.

JBOD seems to be best (only? for the case where you have actual physical
disks you can replace and want nodes to continue to serve reads in the
short time until you do, and where you don't mind the cost of imbalance.

=Rob

Re: Unbalanced disk load

Posted by "J. Ryan Earl" <os...@jryanearl.us>.

Even with https://issues.apache.org/jira/browse/CASSANDRA-7386 data
balancing across JBOD setups is pretty horrible.  Having used JBOD for
about 2 years from 1.2.x and up, it is my opinion JBOD on Cassandra is
nascent at best and far from mature.  For a variety of reasons, JBOD should
perform better if IO and data is balanced across multiple devices due to
things like linux device queues, striping overhead, access contention, and
so forth.  However, data and access patterns simply are not balanced in
Cassandra JBOD setups.

Here's an example of what we see on one of our nodes:

/dev/sdd1             1.1T  202G  915G  19% /data/2
/dev/sde1             1.1T  136G  982G  13% /data/3
/dev/sdi1             1.1T  217G  901G  20% /data/7
/dev/sdc1             1.1T  402G  715G  36% /data/1
/dev/sdh1             1.1T  187G  931G  17% /data/6
/dev/sdf1             1.1T  201G  917G  18% /data/4
/dev/sdg1             1.1T  154G  963G  14% /data/5

Essentially, for a storage engine to make good use of JBOD, like HDFS or
Ceph does, the storage engines need to essentially be designed from the
ground up to use JBOD.  In Cassandra, a single sstable cannot be split at
the storage engine level to be split across members of the JBOD.  In our
case, we have a single sstable file that is bigger than all the data files
combined on other disks.  Looking at the disk using 402G, we see:

274G vcells_polished_data-vcells_polished-jb-38985-Data.db

A single sstable is using 274G.  In addition to data usage imbalance, we
see hot spots as well.  With static fields in particular, and CFs that
don't change much, you'll get CFs that end up compacting into fewer number
of large sstables.  With most of the data for a CF being in one sstable and
on one data volume, a single data volume then becomes a hotspot for reads
on that CF.  Cassandra tries to minimize the number of sstables a row will
be written across, but in particular after some compaction on an CFs that
are rarely updated, most of the data for a CF can end up in a single
sstable, and stables aren't split aross data volumes.  Thus a single volume
will be a hot-spot for access to that CF in a JBOD setup as Cassandra does
not effectively distribute data across individual volumes in all
circumstances.

There may be tuning which would help this, but it's specific to JBOD and
not somewhat that you would have to worry about in a single data volume
setup, ie RAID0.  With a RAID0, the downside of course, is that losing a
single member disk to the RAID0 takes the node down.  The upside is you
don't have to worry about the imbalance of both I/O and data footprint
across individual volumes.

Unlike HDFS, Ceph, and RAID for that matter, where you're dealing with
maximum fixed sizes blocks/stripes that are then distributed at a granular
level across the JBOD volumes, Cassandra is dealing with uncapped, low
granularity, variable sized sstable data files which it attempts to
distribute across JBOD volumes making JBOD far from ideal.  Frankly, it's
hard for me to imagine any columnar data store doing JBOD well.

On Fri, Jul 17, 2015 at 4:08 PM, Soerian Lieve <sl...@liveramp.com> wrote:

> Hi,
>
> I am currently benchmarking Cassandra with three machines, and on each
> machine I am seeing an unbalanced distribution of data among the data
> directories (1 per disk).
> I am concerned that this affects my write performance, is there anything
> that I can make the distribution be more even? Would raid0 be my best
> option?
>
> Details:
> 3 machines, each have 24 cores, 64GB of RAM, 7 SSDs of 500GB each.
> Commitlog is on a separate disk, cassandra.yaml configured according to
> Datastax' guide on cassandra.yaml.
> Total size of data is about 2TB, 14B records, all unique. Replication
> factor of 1.
>
> Thanks,
> Soerian
>