You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Casey Deccio <ca...@deccio.net> on 2012/09/14 21:39:34 UTC

Disk configuration in new cluster node

I'm building a new "cluster" (to replace the broken setup I've written
about in previous posts) that will consist of only two nodes.  I understand
that I'll be sacrificing high availability of writes if one of the nodes
goes down, and I'm okay with that.  I'm more interested in maintaining high
consistency and high read availability.  So I've decided to use a
write-level consistency of ALL and read-level consistency of ONE.

My first question is about the drives in this setup.  If I initially set up
the system with, say, 4 drives for data and 1 drive for commitlog, and
later I decide to add more capacity to the node by adding more drives for
data (adding the new data directory entries in cassandra.yaml), will the
node balance out the load on the drives, or is it agnostic to usage of
drives underlying data directories?

My second question has to do with RAID striping.  Would it be more useful
to stripe the disk with the commitlog or the disks with the data?  Of
course, with a single striped volume for data directories, it would be more
difficult to add capacity to the node later, as I've suggested above.

Casey

Re: Disk configuration in new cluster node

Posted by Aaron Turner <sy...@gmail.com>.

On Fri, Sep 21, 2012 at 2:05 AM, aaron morton <aa...@thelastpickle.com> wrote:
>> Would it help if I partitioned the computing resources of my physical
>> machines into VMs?
>
> No.
> Just like cutting a cake into smaller pieces does not mean you can eat more
> without getting fat.
>
> In the general case, regular HDD and 1 Gbe and 8 to 16 virtual cores and 8GB
> to 16GB ram, you can expect to comfortably run up 400GB of data (maybe
> 500GB). That is replicated storage,  so 400 / 3 = 133GB if you replicate
> data 3 times.

Remember also that these numbers reflect total size of your sstables.
This is both good and bad:

1. Good, because if you use compression you can store more data.  I'm
doing time series data for network statistics and I'm seeing extremely
good compression numbers (better then 10:1)

2. Bad, because if you're doing a lot of deletes, the old data +
tombstones count against you until they're actually purged from disk.

This can create rather interesting disk usage situations where my
"rolling 48 hours" of current data CF takes significantly more disk
space then my historical CF which currently stores over 4 months worth
of data.   I'm thinking about repairing the rolling 48 hours CF more
often and reducing the gc_grace time so that compaction has a better
chance of removing stale data from disk.

-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"

Re: Disk configuration in new cluster node

Posted by aaron morton <aa...@thelastpickle.com>.

> Would it help if I partitioned the computing resources of my physical machines into VMs? 
No. 
Just like cutting a cake into smaller pieces does not mean you can eat more without getting fat.

In the general case, regular HDD and 1 Gbe and 8 to 16 virtual cores and 8GB to 16GB ram, you can expect to comfortably run up 400GB of data (maybe 500GB). That is replicated storage,  so 400 / 3 = 133GB if you replicate data 3 times. 
  
Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 19/09/2012, at 3:42 PM, Віталій Тимчишин <ti...@gmail.com> wrote:

> Network also matters. It would take a lot of time sending 6TB over 1Gb link, even fully saturating it. IMHO You can try with 10Gb, but you will need to raise your streaming/compaction limits a lot.
> Also you will need to ensure that your compaction can keep up. It is often done in one thread and I am not sure if it will be enough for you. As of parallel compaction, I don't know exact limitations and if it will be working in your case.
> 
> 2012/9/18 Casey Deccio <ca...@deccio.net>
> On Tue, Sep 18, 2012 at 1:54 AM, aaron morton <aa...@thelastpickle.com> wrote:
>> each with several disks having large capacity, totaling 10 - 12 TB.  Is this (another) bad idea?
> 
> Yes. Very bad. 
> If you had 6TB on average system with spinning disks you would measure duration of repairs and compactions in days. 
> 
> If you want to store 12 TB of data you will need more machines. 
>  
> 
> Would it help if I partitioned the computing resources of my physical machines into VMs?  For example, I put four VMs on each of three virtual machines, each with a dedicated 2TB drive.  I can now have four tokens in the ring and a RF of 3.  And of course, I can arrange them into a way that makes the most sense.  Is this getting any better, or am I missing the point?
> 
> Casey
> 
> 
> 
> -- 
> Best regards,
>  Vitalii Tymchyshyn

Re: Disk configuration in new cluster node

Posted by Віталій Тимчишин <ti...@gmail.com>.

Network also matters. It would take a lot of time sending 6TB over 1Gb
link, even fully saturating it. IMHO You can try with 10Gb, but you will
need to raise your streaming/compaction limits a lot.
Also you will need to ensure that your compaction can keep up. It is often
done in one thread and I am not sure if it will be enough for you. As of
parallel compaction, I don't know exact limitations and if it will be
working in your case.

2012/9/18 Casey Deccio <ca...@deccio.net>

> On Tue, Sep 18, 2012 at 1:54 AM, aaron morton <aa...@thelastpickle.com>wrote:
>
>> each with several disks having large capacity, totaling 10 - 12 TB.  Is
>> this (another) bad idea?
>>
>> Yes. Very bad.
>> If you had 6TB on average system with spinning disks you would measure
>> duration of repairs and compactions in days.
>>
>> If you want to store 12 TB of data you will need more machines.
>>
>>
>
> Would it help if I partitioned the computing resources of my physical
> machines into VMs?  For example, I put four VMs on each of three virtual
> machines, each with a dedicated 2TB drive.  I can now have four tokens in
> the ring and a RF of 3.  And of course, I can arrange them into a way that
> makes the most sense.  Is this getting any better, or am I missing the
> point?
>
> Casey
>

-- 
Best regards,
 Vitalii Tymchyshyn

Re: Disk configuration in new cluster node

Posted by Casey Deccio <ca...@deccio.net>.

On Tue, Sep 18, 2012 at 1:54 AM, aaron morton <aa...@thelastpickle.com>wrote:

> each with several disks having large capacity, totaling 10 - 12 TB.  Is
> this (another) bad idea?
>
> Yes. Very bad.
> If you had 6TB on average system with spinning disks you would measure
> duration of repairs and compactions in days.
>
> If you want to store 12 TB of data you will need more machines.
>
>

Would it help if I partitioned the computing resources of my physical
machines into VMs?  For example, I put four VMs on each of three virtual
machines, each with a dedicated 2TB drive.  I can now have four tokens in
the ring and a RF of 3.  And of course, I can arrange them into a way that
makes the most sense.  Is this getting any better, or am I missing the
point?

Casey

Re: Disk configuration in new cluster node

Posted by aaron morton <aa...@thelastpickle.com>.

> Given the advice to use a single RAID 0 volume, I think that's what I'll do.  By system mirror, you are referring to the volume on which the OS is installed? 
Yes. 
I was thinking about a simple RAID 1 OS volume and RAID 0 data volume setup. With the Commit Log on the OS volume so it does not compete with cassandra for iops.  
 
> sense in my case to build or maintain a large cluster.  I wanted to run a two-node setup (RF=1, RCL=ONE, WCL=ALL),
You would be taking on the operational concerns of running cassandra without any of the payoff for having a cluster. 

> each with several disks having large capacity, totaling 10 - 12 TB.  Is this (another) bad idea?
Yes. Very bad. 
If you had 6TB on average system with spinning disks you would measure duration of repairs and compactions in days. 

If you want to store 12 TB of data you will need more machines. 
 
Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 18/09/2012, at 3:53 AM, Casey Deccio <ca...@deccio.net> wrote:

> On Mon, Sep 17, 2012 at 1:19 AM, aaron morton <aa...@thelastpickle.com> wrote:
>>  4 drives for data and 1 drive for commitlog, 
> How are you configuring the drives ? It's normally best to present one big data volume, e.g. using raid 0, and put the commit log on say the system mirror.
> 
> 
> Given the advice to use a single RAID 0 volume, I think that's what I'll do.  By system mirror, you are referring to the volume on which the OS is installed?  Should the volume with the commit log also have multiple disks in a RAID 0 volume?  Alternatively, would a RAID 1 setup be reasonable for the system volume/OS, so the system itself can be resilient to disk failure, or would that kill commit performance?
> 
> Any preference to hardware RAID 0 vs. using something like mdadm?
> 
> A word of warning. If you put more than 300GB to 400GB per node you may end experience some issues such as repair, compaction or disaster recovery taking a long time. These are simply soft limits that provide a good rule of thumb for HDD based systems with 1 GigE networking.
> 
> Hmm.  My hope was to be able to run a minimal number of nodes and maximize their capacity because it doesn't make sense in my case to build or maintain a large cluster.  I wanted to run a two-node setup (RF=1, RCL=ONE, WCL=ALL), each with several disks having large capacity, totaling 10 - 12 TB.  Is this (another) bad idea?
> 
> Casey

Re: Disk configuration in new cluster node

Posted by Casey Deccio <ca...@deccio.net>.

On Mon, Sep 17, 2012 at 1:19 AM, aaron morton <aa...@thelastpickle.com>wrote:

>  4 drives for data and 1 drive for commitlog,
>
> How are you configuring the drives ? It's normally best to present one big
> data volume, e.g. using raid 0, and put the commit log on say the system
> mirror.
>
>
Given the advice to use a single RAID 0 volume, I think that's what I'll
do.  By system mirror, you are referring to the volume on which the OS is
installed?  Should the volume with the commit log also have multiple disks
in a RAID 0 volume?  Alternatively, would a RAID 1 setup be reasonable for
the system volume/OS, so the system itself can be resilient to disk
failure, or would that kill commit performance?

Any preference to hardware RAID 0 vs. using something like mdadm?

A word of warning. If you put more than 300GB to 400GB per node you may end
> experience some issues such as repair, compaction or disaster recovery
> taking a long time. These are simply soft limits that provide a good rule
> of thumb for HDD based systems with 1 GigE networking.
>

Hmm.  My hope was to be able to run a minimal number of nodes and maximize
their capacity because it doesn't make sense in my case to build or
maintain a large cluster.  I wanted to run a two-node setup (RF=1, RCL=ONE,
WCL=ALL), each with several disks having large capacity, totaling 10 - 12
TB.  Is this (another) bad idea?

Casey

Re: Disk configuration in new cluster node

Posted by Robin Verlangen <ro...@us2.nl>.

" A word of warning. If you put more than 300GB to 400GB per node you may
end experience some issues  ... "

I think this is probably the "solution" to your multiple disk problem. You
could use easily one single disk to store the data on, and one disk for the
commitlog. No issues with JBOD, RAID or whatever. If you want to improve
throughput you might consider a RAID-0 setup.

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/17 aaron morton <aa...@thelastpickle.com>

>  4 drives for data and 1 drive for commitlog,
>
> How are you configuring the drives ? It's normally best to present one big
> data volume, e.g. using raid 0, and put the commit log on say the system
> mirror.
>
> will the node balance out the load on the drives, or is it agnostic to
> usage of drives underlying data directories?
>
> It will not.
> There is a feature coming in v1.2 to add better support for JBOD
> configurations.
>
> A word of warning. If you put more than 300GB to 400GB per node you may
> end experience some issues such as repair, compaction or disaster recovery
> taking a long time. These are simply soft limits that provide a good rule
> of thumb for HDD based systems with 1 GigE networking.
>
> Hope that helps.
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 15/09/2012, at 7:39 AM, Casey Deccio <ca...@deccio.net> wrote:
>
> I'm building a new "cluster" (to replace the broken setup I've written
> about in previous posts) that will consist of only two nodes.  I understand
> that I'll be sacrificing high availability of writes if one of the nodes
> goes down, and I'm okay with that.  I'm more interested in maintaining high
> consistency and high read availability.  So I've decided to use a
> write-level consistency of ALL and read-level consistency of ONE.
>
> My first question is about the drives in this setup.  If I initially set
> up the system with, say, 4 drives for data and 1 drive for commitlog, and
> later I decide to add more capacity to the node by adding more drives for
> data (adding the new data directory entries in cassandra.yaml), will the
> node balance out the load on the drives, or is it agnostic to usage of
> drives underlying data directories?
>
> My second question has to do with RAID striping.  Would it be more useful
> to stripe the disk with the commitlog or the disks with the data?  Of
> course, with a single striped volume for data directories, it would be more
> difficult to add capacity to the node later, as I've suggested above.
>
> Casey
>
>
>

Re: Disk configuration in new cluster node

Posted by aaron morton <aa...@thelastpickle.com>.

>  4 drives for data and 1 drive for commitlog, 
How are you configuring the drives ? It's normally best to present one big data volume, e.g. using raid 0, and put the commit log on say the system mirror.

> will the node balance out the load on the drives, or is it agnostic to usage of drives underlying data directories?
It will not. 
There is a feature coming in v1.2 to add better support for JBOD configurations. 

A word of warning. If you put more than 300GB to 400GB per node you may end experience some issues such as repair, compaction or disaster recovery taking a long time. These are simply soft limits that provide a good rule of thumb for HDD based systems with 1 GigE networking.   

Hope that helps. 
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 15/09/2012, at 7:39 AM, Casey Deccio <ca...@deccio.net> wrote:

> I'm building a new "cluster" (to replace the broken setup I've written about in previous posts) that will consist of only two nodes.  I understand that I'll be sacrificing high availability of writes if one of the nodes goes down, and I'm okay with that.  I'm more interested in maintaining high consistency and high read availability.  So I've decided to use a write-level consistency of ALL and read-level consistency of ONE.
> 
> My first question is about the drives in this setup.  If I initially set up the system with, say, 4 drives for data and 1 drive for commitlog, and later I decide to add more capacity to the node by adding more drives for data (adding the new data directory entries in cassandra.yaml), will the node balance out the load on the drives, or is it agnostic to usage of drives underlying data directories?
> 
> My second question has to do with RAID striping.  Would it be more useful to stripe the disk with the commitlog or the disks with the data?  Of course, with a single striped volume for data directories, it would be more difficult to add capacity to the node later, as I've suggested above.
> 
> Casey