You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Eric Rosenberry <ep...@gmail.com> on 2010/03/10 09:00:23 UTC

Effective allocation of multiple disks

Based on the documentation, it is clear that with Cassandra you want to have
one disk for commitlog, and one disk for data.

My question is: If you think your workload is going to require more io
performance to the data disks than a single disk can handle, how would you
recommend effectively utilizing additional disks?

It would seem a number of vendors sell 1U boxes with four 3.5 inch disks.
 If we use one for commitlog, is there a way to have Cassandra itself
equally split data across the three remaining disks?  Or is this something
that needs to be handled by the hardware level, or operating system/file
system level?

Options include a hardware RAID controller in a RAID 0 stripe (this is more
$$$ and for what gain?), or utilizing a volume manager like LVM.

Along those same lines, if you do implement some type of striping, what RAID
stripe size is recommended?  (I think Todd Burruss asked this earlier but I
did not see a response)

Thanks for any input!

-Eric

Re: Effective allocation of multiple disks

Posted by Jonathan Ellis <jb...@gmail.com>.

Thanks for testing that, added a note to
http://wiki.apache.org/cassandra/CassandraHardware on stripe size.

On Wed, Mar 10, 2010 at 11:03 AM, B. Todd Burruss <bb...@real.com> wrote:
> with the file sizes we're talking about with cassandra and other database
> products, the stripe size doesn't seem to matter.  i suppose there may be a
> modicum of overhead with a small stripe size, but i'm not sure.  mine is set
> to 128k, which produced the same results as 16k and 256k.
>
> i will say the number of drives within the RAID 0 setup does seem to matter.
>  more you have the more parallelism you can get with a good RAID controller.
>
> Eric Rosenberry wrote:
>>
>> Based on the documentation, it is clear that with Cassandra you want to
>> have one disk for commitlog, and one disk for data.
>>
>> My question is: If you think your workload is going to require more io
>> performance to the data disks than a single disk can handle, how would you
>> recommend effectively utilizing additional disks?
>>
>> It would seem a number of vendors sell 1U boxes with four 3.5 inch disks.
>>  If we use one for commitlog, is there a way to have Cassandra itself
>> equally split data across the three remaining disks?  Or is this something
>> that needs to be handled by the hardware level, or operating system/file
>> system level?
>>
>> Options include a hardware RAID controller in a RAID 0 stripe (this is
>> more $$$ and for what gain?), or utilizing a volume manager like LVM.
>>
>> Along those same lines, if you do implement some type of striping, what
>> RAID stripe size is recommended?  (I think Todd Burruss asked this earlier
>> but I did not see a response)
>>
>> Thanks for any input!
>>
>> -Eric
>

Re: Effective allocation of multiple disks

Posted by "B. Todd Burruss" <bb...@real.com>.

with the file sizes we're talking about with cassandra and other 
database products, the stripe size doesn't seem to matter.  i suppose 
there may be a modicum of overhead with a small stripe size, but i'm not 
sure.  mine is set to 128k, which produced the same results as 16k and 256k.

i will say the number of drives within the RAID 0 setup does seem to 
matter.  more you have the more parallelism you can get with a good RAID 
controller.

Eric Rosenberry wrote:
> Based on the documentation, it is clear that with Cassandra you want 
> to have one disk for commitlog, and one disk for data.
>
> My question is: If you think your workload is going to require more io 
> performance to the data disks than a single disk can handle, how would 
> you recommend effectively utilizing additional disks?
>
> It would seem a number of vendors sell 1U boxes with four 3.5 inch 
> disks.  If we use one for commitlog, is there a way to have Cassandra 
> itself equally split data across the three remaining disks?  Or is 
> this something that needs to be handled by the hardware level, or 
> operating system/file system level?
>
> Options include a hardware RAID controller in a RAID 0 stripe (this is 
> more $$$ and for what gain?), or utilizing a volume manager like LVM.
>
> Along those same lines, if you do implement some type of striping, 
> what RAID stripe size is recommended?  (I think Todd Burruss asked 
> this earlier but I did not see a response)
>
> Thanks for any input!
>
> -Eric

Re: Effective allocation of multiple disks

Posted by Anthony Molinaro <an...@alumni.caltech.edu>.

I'm still wondering what happens when you have something like 2 500GB disks,
with 2 sstables which use up 25OGB, one on each disk, then a major compaction
occurs.  Will it still compact and probably fill up a disk (especially with
the 2x overhead of compaction mentioned either here or on the wiki)?

Seems like you basically could easily get into a situation where you can't
fix it without something like a volume manager, or a complete shutdown, move
data to bigger disk upgrade.

I guess one way might be to treat each disk as a separate node (ie, give
it some fraction of the keyspace based on its disk space), then when you
add a directory to the config you would have to load balance but only
within that node.  I'm sure that complicates ring maintenance but maybe
its a better experience, as the multiple data directories should all fill
uniformly?

Just some other thoughts.

-Anthony

On Thu, Mar 11, 2010 at 12:45:14PM -0600, Jonathan Ellis wrote:
> Except that for a major compaction the whole thing gets put in one
> directory.  That's the problem w/ the JBOD approach.
> 
> On Thu, Mar 11, 2010 at 12:01 PM, Eric Evans <ee...@rackspace.com> wrote:
> > On Wed, 2010-03-10 at 23:20 -0600, Jonathan Ellis wrote:
> >> On Wed, Mar 10, 2010 at 9:31 PM, Anthony Molinaro
> >> <an...@alumni.caltech.edu> wrote:
> >> > I would almost
> >> > recommend just keeping things simple and removing multiple data
> >> directories
> >> > from the config altogether and just documenting that you should plan
> >> on using
> >> > OS level mechanisms for growing diskspace and io.
> >>
> >> I think that is a pretty sane suggestion actually.
> >
> > Or maybe leave the code as is and just document the situation more
> > clearly? If you're adding more disks to increase storage capacity and
> > you don't strictly need the extra IO, then multiple data directories
> > might be preferable to other forms of aggregation (it's certainly
> > simpler than say a volume manager).
> >
> > --
> > Eric Evans
> > eevans@rackspace.com
> >
> >

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Effective allocation of multiple disks

Posted by Ryan King <ry...@twitter.com>.

We're going to us software raid.

-ryan

On Fri, Mar 12, 2010 at 9:24 AM, Eric Rosenberry <ep...@gmail.com> wrote:
> Ryan-
> Are you going to use software or hardware based RAID 0?
>
> Does anyone on the list have any data to compare the performance of hardware
> RAID 0 vs. software LVM RAID 0?
> I would think software RAID 0 would be fine since there is no actual
> computation being done...
> Thanks!
>
> -Eric
>
> On Thu, Mar 11, 2010 at 1:16 PM, Ryan King <ry...@twitter.com> wrote:
>>
>> Even without major compaction, you can get significant imbalances in
>> how much data is on each disk which will bottleneck your IO
>> throughput. We're running JBOD right now, but going to switch to RAID
>> 0 soon.
>>
>> -ryan
>
>

Re: Effective allocation of multiple disks

Posted by Eric Rosenberry <ep...@gmail.com>.

Ryan-

Are you going to use software or hardware based RAID 0?

Does anyone on the list have any data to compare the performance of hardware
RAID 0 vs. software LVM RAID 0?

I would think software RAID 0 would be fine since there is no actual
computation being done...

Thanks!

-Eric

On Thu, Mar 11, 2010 at 1:16 PM, Ryan King <ry...@twitter.com> wrote:
>
>
> Even without major compaction, you can get significant imbalances in
> how much data is on each disk which will bottleneck your IO
> throughput. We're running JBOD right now, but going to switch to RAID
> 0 soon.
>
> -ryan
>

Re: Effective allocation of multiple disks

Posted by Ryan King <ry...@twitter.com>.

On Thu, Mar 11, 2010 at 10:45 AM, Jonathan Ellis <jb...@gmail.com> wrote:
> Except that for a major compaction the whole thing gets put in one
> directory.  That's the problem w/ the JBOD approach.

Even without major compaction, you can get significant imbalances in
how much data is on each disk which will bottleneck your IO
throughput. We're running JBOD right now, but going to switch to RAID
0 soon.

-ryan

Re: Effective allocation of multiple disks

Posted by Jonathan Ellis <jb...@gmail.com>.

Except that for a major compaction the whole thing gets put in one
directory.  That's the problem w/ the JBOD approach.

On Thu, Mar 11, 2010 at 12:01 PM, Eric Evans <ee...@rackspace.com> wrote:
> On Wed, 2010-03-10 at 23:20 -0600, Jonathan Ellis wrote:
>> On Wed, Mar 10, 2010 at 9:31 PM, Anthony Molinaro
>> <an...@alumni.caltech.edu> wrote:
>> > I would almost
>> > recommend just keeping things simple and removing multiple data
>> directories
>> > from the config altogether and just documenting that you should plan
>> on using
>> > OS level mechanisms for growing diskspace and io.
>>
>> I think that is a pretty sane suggestion actually.
>
> Or maybe leave the code as is and just document the situation more
> clearly? If you're adding more disks to increase storage capacity and
> you don't strictly need the extra IO, then multiple data directories
> might be preferable to other forms of aggregation (it's certainly
> simpler than say a volume manager).
>
> --
> Eric Evans
> eevans@rackspace.com
>
>

Re: Effective allocation of multiple disks

Posted by Ted Zlatanov <tz...@lifelogs.com>.

On Thu, 11 Mar 2010 12:01:27 -0600 Eric Evans <ee...@rackspace.com> wrote: 

EE> On Wed, 2010-03-10 at 23:20 -0600, Jonathan Ellis wrote:
>> On Wed, Mar 10, 2010 at 9:31 PM, Anthony Molinaro
>> <an...@alumni.caltech.edu> wrote:
>> > I would almost recommend just keeping things simple and removing
>> > multiple data directories from the config altogether and just
>> > documenting that you should plan on using OS level mechanisms for
>> > growing diskspace and io.
>> 
>> I think that is a pretty sane suggestion actually. 

EE> Or maybe leave the code as is and just document the situation more
EE> clearly? If you're adding more disks to increase storage capacity
EE> and you don't strictly need the extra IO, then multiple data
EE> directories might be preferable to other forms of aggregation (it's
EE> certainly simpler than say a volume manager).

Could Cassandra use a block device as raw storage?  You avoid the
filesystem overhead and it lets the sysadmin determine the best kind of
device (RAID or not underneath) to allocate.

Ted

Re: Effective allocation of multiple disks

Posted by Eric Evans <ee...@rackspace.com>.

On Wed, 2010-03-10 at 23:20 -0600, Jonathan Ellis wrote:
> On Wed, Mar 10, 2010 at 9:31 PM, Anthony Molinaro
> <an...@alumni.caltech.edu> wrote:
> > I would almost
> > recommend just keeping things simple and removing multiple data
> directories
> > from the config altogether and just documenting that you should plan
> on using
> > OS level mechanisms for growing diskspace and io.
> 
> I think that is a pretty sane suggestion actually. 

Or maybe leave the code as is and just document the situation more
clearly? If you're adding more disks to increase storage capacity and
you don't strictly need the extra IO, then multiple data directories
might be preferable to other forms of aggregation (it's certainly
simpler than say a volume manager).

-- 
Eric Evans
eevans@rackspace.com

Re: Effective allocation of multiple disks

Posted by Jonathan Ellis <jb...@gmail.com>.

On Wed, Mar 10, 2010 at 9:31 PM, Anthony Molinaro
<an...@alumni.caltech.edu> wrote:
> I would almost
> recommend just keeping things simple and removing multiple data directories
> from the config altogether and just documenting that you should plan on using
> OS level mechanisms for growing diskspace and io.

I think that is a pretty sane suggestion actually.

-Jonathan

Re: Effective allocation of multiple disks

Posted by Anthony Molinaro <an...@alumni.caltech.edu>.

Except major compactions are not that rare if you have a cluster which you
need to add capacity to.  Anytime to add nodes with bootstrap it is recommended
you run cleanup on nodes which you removed data from (and this is useful to
see how much space you are now using).  Cleanup does a major compaction and
if you happen to have one disk larger than others, most of the data ends
up there (this happened to me when I added some ebs's to some ec2 nodes, I
distributed the sstables and everything was cool, I had more io, things were
great, then I need to add another node, did a cleanup, boom everything is
on one disk and io sucks again).  I also don't quite know what happens when
a major compaction occurs which would combine sstables and fill up the
largest disk?

However after discussion I completely understand why things were done this
way, it's difficut to manage the space and really it should be relegated to
the disk subsystem of the OS (ie, RAID0, JBOD, LVM, etc).  I would almost
recommend just keeping things simple and removing multiple data directories
from the config altogether and just documenting that you should plan on using
OS level mechanisms for growing diskspace and io.

-Anthony

On Wed, Mar 10, 2010 at 04:43:36PM -0600, Stu Hood wrote:
> Yea, I suppose major compactions are the wildcard here. Nonetheless, the situation where you only have 1 SSTable should be very rare.
> 
> I'll open a ticket though, because we really ought to be able to utilize those disks more thoroughly, and I have some ideas there.
> 
> 
> -----Original Message-----
> From: "Anthony Molinaro" <an...@alumni.caltech.edu>
> Sent: Wednesday, March 10, 2010 3:38pm
> To: cassandra-user@incubator.apache.org
> Subject: Re: Effective allocation of multiple disks
> 
> This is incorrect, as discussed a few weeks ago.  I have a setup with multiple
> disks, and as soon as compaction occurs all the data ends up on one disk.  If
> you need the additional io, you will want raid0.  But simply listing multiple
> DataFileDirectories will not work.
> 
> -Anthony
> 
> On Wed, Mar 10, 2010 at 02:08:13AM -0600, Stu Hood wrote:
> > You can list multiple DataFileDirectories, and Cassandra will scatter files across all of them. Use 1 disk for the commitlog, and 3 disks for data directories.
> > 
> > See http://wiki.apache.org/cassandra/CassandraHardware#Disk
> > 
> > Thanks,
> > Stu
> > 
> > -----Original Message-----
> > From: "Eric Rosenberry" <ep...@gmail.com>
> > Sent: Wednesday, March 10, 2010 2:00am
> > To: cassandra-user@incubator.apache.org
> > Subject: Effective allocation of multiple disks
> > 
> > Based on the documentation, it is clear that with Cassandra you want to have
> > one disk for commitlog, and one disk for data.
> > 
> > My question is: If you think your workload is going to require more io
> > performance to the data disks than a single disk can handle, how would you
> > recommend effectively utilizing additional disks?
> > 
> > It would seem a number of vendors sell 1U boxes with four 3.5 inch disks.
> >  If we use one for commitlog, is there a way to have Cassandra itself
> > equally split data across the three remaining disks?  Or is this something
> > that needs to be handled by the hardware level, or operating system/file
> > system level?
> > 
> > Options include a hardware RAID controller in a RAID 0 stripe (this is more
> > $$$ and for what gain?), or utilizing a volume manager like LVM.
> > 
> > Along those same lines, if you do implement some type of striping, what RAID
> > stripe size is recommended?  (I think Todd Burruss asked this earlier but I
> > did not see a response)
> > 
> > Thanks for any input!
> > 
> > -Eric
> > 
> > 
> 
> -- 
> ------------------------------------------------------------------------
> Anthony Molinaro                           <an...@alumni.caltech.edu>
> 
> 

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Effective allocation of multiple disks

Posted by Stu Hood <st...@rackspace.com>.

Yea, I suppose major compactions are the wildcard here. Nonetheless, the situation where you only have 1 SSTable should be very rare.

I'll open a ticket though, because we really ought to be able to utilize those disks more thoroughly, and I have some ideas there.


-----Original Message-----
From: "Anthony Molinaro" <an...@alumni.caltech.edu>
Sent: Wednesday, March 10, 2010 3:38pm
To: cassandra-user@incubator.apache.org
Subject: Re: Effective allocation of multiple disks

This is incorrect, as discussed a few weeks ago.  I have a setup with multiple
disks, and as soon as compaction occurs all the data ends up on one disk.  If
you need the additional io, you will want raid0.  But simply listing multiple
DataFileDirectories will not work.

-Anthony

On Wed, Mar 10, 2010 at 02:08:13AM -0600, Stu Hood wrote:
> You can list multiple DataFileDirectories, and Cassandra will scatter files across all of them. Use 1 disk for the commitlog, and 3 disks for data directories.
> 
> See http://wiki.apache.org/cassandra/CassandraHardware#Disk
> 
> Thanks,
> Stu
> 
> -----Original Message-----
> From: "Eric Rosenberry" <ep...@gmail.com>
> Sent: Wednesday, March 10, 2010 2:00am
> To: cassandra-user@incubator.apache.org
> Subject: Effective allocation of multiple disks
> 
> Based on the documentation, it is clear that with Cassandra you want to have
> one disk for commitlog, and one disk for data.
> 
> My question is: If you think your workload is going to require more io
> performance to the data disks than a single disk can handle, how would you
> recommend effectively utilizing additional disks?
> 
> It would seem a number of vendors sell 1U boxes with four 3.5 inch disks.
>  If we use one for commitlog, is there a way to have Cassandra itself
> equally split data across the three remaining disks?  Or is this something
> that needs to be handled by the hardware level, or operating system/file
> system level?
> 
> Options include a hardware RAID controller in a RAID 0 stripe (this is more
> $$$ and for what gain?), or utilizing a volume manager like LVM.
> 
> Along those same lines, if you do implement some type of striping, what RAID
> stripe size is recommended?  (I think Todd Burruss asked this earlier but I
> did not see a response)
> 
> Thanks for any input!
> 
> -Eric
> 
> 

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Effective allocation of multiple disks

Posted by Anthony Molinaro <an...@alumni.caltech.edu>.

This is incorrect, as discussed a few weeks ago.  I have a setup with multiple
disks, and as soon as compaction occurs all the data ends up on one disk.  If
you need the additional io, you will want raid0.  But simply listing multiple
DataFileDirectories will not work.

-Anthony

On Wed, Mar 10, 2010 at 02:08:13AM -0600, Stu Hood wrote:
> You can list multiple DataFileDirectories, and Cassandra will scatter files across all of them. Use 1 disk for the commitlog, and 3 disks for data directories.
> 
> See http://wiki.apache.org/cassandra/CassandraHardware#Disk
> 
> Thanks,
> Stu
> 
> -----Original Message-----
> From: "Eric Rosenberry" <ep...@gmail.com>
> Sent: Wednesday, March 10, 2010 2:00am
> To: cassandra-user@incubator.apache.org
> Subject: Effective allocation of multiple disks
> 
> Based on the documentation, it is clear that with Cassandra you want to have
> one disk for commitlog, and one disk for data.
> 
> My question is: If you think your workload is going to require more io
> performance to the data disks than a single disk can handle, how would you
> recommend effectively utilizing additional disks?
> 
> It would seem a number of vendors sell 1U boxes with four 3.5 inch disks.
>  If we use one for commitlog, is there a way to have Cassandra itself
> equally split data across the three remaining disks?  Or is this something
> that needs to be handled by the hardware level, or operating system/file
> system level?
> 
> Options include a hardware RAID controller in a RAID 0 stripe (this is more
> $$$ and for what gain?), or utilizing a volume manager like LVM.
> 
> Along those same lines, if you do implement some type of striping, what RAID
> stripe size is recommended?  (I think Todd Burruss asked this earlier but I
> did not see a response)
> 
> Thanks for any input!
> 
> -Eric
> 
> 

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Effective allocation of multiple disks

Posted by Eric Rosenberry <ep...@gmail.com>.

Ahh, thanks!  I had read that, but I had assumed the reference to "use one
or more devices for DataFileDirectories" was referring to somehow making
multiple physical devices into one logical device via some underlying RAID
system.

So then as far as free space on the disks go, I have seen references to
keeping utilization below 50% to handle compaction.  Would it not be true to
say that you only need as much free space as the to handle another copy of
the largest data file you have?  (i.e. perhaps less than 50% of the disk)

Due to the compaction space requirement, would it be more efficient to do
RAID 0 somewhere under the hood?

Just simply being able to specify multiple DataFileDirectories does does
indeed sound appealing...

Thanks.

-Eric

On Wed, Mar 10, 2010 at 12:08 AM, Stu Hood <st...@rackspace.com> wrote:

> You can list multiple DataFileDirectories, and Cassandra will scatter files
> across all of them. Use 1 disk for the commitlog, and 3 disks for data
> directories.
>
> See http://wiki.apache.org/cassandra/CassandraHardware#Disk
>
> Thanks,
> Stu
>
> -----Original Message-----
> From: "Eric Rosenberry" <ep...@gmail.com>
> Sent: Wednesday, March 10, 2010 2:00am
> To: cassandra-user@incubator.apache.org
> Subject: Effective allocation of multiple disks
>
> Based on the documentation, it is clear that with Cassandra you want to
> have
> one disk for commitlog, and one disk for data.
>
> My question is: If you think your workload is going to require more io
> performance to the data disks than a single disk can handle, how would you
> recommend effectively utilizing additional disks?
>
> It would seem a number of vendors sell 1U boxes with four 3.5 inch disks.
>  If we use one for commitlog, is there a way to have Cassandra itself
> equally split data across the three remaining disks?  Or is this something
> that needs to be handled by the hardware level, or operating system/file
> system level?
>
> Options include a hardware RAID controller in a RAID 0 stripe (this is more
> $$$ and for what gain?), or utilizing a volume manager like LVM.
>
> Along those same lines, if you do implement some type of striping, what
> RAID
> stripe size is recommended?  (I think Todd Burruss asked this earlier but I
> did not see a response)
>
> Thanks for any input!
>
> -Eric
>
>
>

RE: Effective allocation of multiple disks

Posted by Stu Hood <st...@rackspace.com>.

You can list multiple DataFileDirectories, and Cassandra will scatter files across all of them. Use 1 disk for the commitlog, and 3 disks for data directories.

See http://wiki.apache.org/cassandra/CassandraHardware#Disk

Thanks,
Stu

-----Original Message-----
From: "Eric Rosenberry" <ep...@gmail.com>
Sent: Wednesday, March 10, 2010 2:00am
To: cassandra-user@incubator.apache.org
Subject: Effective allocation of multiple disks

Based on the documentation, it is clear that with Cassandra you want to have
one disk for commitlog, and one disk for data.

My question is: If you think your workload is going to require more io
performance to the data disks than a single disk can handle, how would you
recommend effectively utilizing additional disks?

It would seem a number of vendors sell 1U boxes with four 3.5 inch disks.
 If we use one for commitlog, is there a way to have Cassandra itself
equally split data across the three remaining disks?  Or is this something
that needs to be handled by the hardware level, or operating system/file
system level?

Options include a hardware RAID controller in a RAID 0 stripe (this is more
$$$ and for what gain?), or utilizing a volume manager like LVM.

Along those same lines, if you do implement some type of striping, what RAID
stripe size is recommended?  (I think Todd Burruss asked this earlier but I
did not see a response)

Thanks for any input!

-Eric