You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by "Jacob, Arun" <Ar...@disney.com> on 2011/08/24 20:54:14 UTC

Cassandra Node Requirements

I'm trying to determine a node configuration for Cassandra. From what I've been able to determine from reading around:


 1.  we need to cap data size at 50% of total node storage capacity for compaction
 2.  with RF=3, that means that I need to effectively assume that I have 1/6th of total storage capacity.
 3.  SSDs are preferred, but of course  reduce storage capacity
 4.  using standard storage means you bump up your RAM to keep as much in memory as possible.

Right now we are looking at storage requirements of 42 – 60TB, assuming a baseline of 3TB/day and expiring data after 14-20 days (depending on use case),  I would assume based on above that we need 252- 360TB total storage max.

My questions:

 1.  is 8TB (meaning 1.33 actual TB storage/node) a reasonable per node storage size for Cassandra? I don’t want to use SSDs due to reduced storage capacity -- I don't want to buy 100s of nodes to support that reduced storage capacity of SSDs.  Given that I will be using standard drives, what is a reasonable/effective per node storage capacity?
 2.  other than splitting the commit log onto a separate drive, is there any other drive allocation I should be doing?
 3.  Assuming I'm not using SSDs, what would be a good memory size for a node? I've heard anything from 32-48 GB, but need more guidance.

Anything else that anyone has run into? What are common configurations being used by others?

Thanks in advance,

-- Arun



Re: Cassandra Node Requirements

Posted by Philippe <wa...@gmail.com>.
>
> Sort of.  There's some fine print, such as the 50% number is only if
> you're manually forcing major compactions, which is not recommended,
> but a bigger thing to know is that 1.0 will introduce "leveled
> compaction" [1] inspired by leveldb.  The free space requirement will
> then be a small number of megabytes.
>
> [1] https://issues.apache.org/jira/browse/CASSANDRA-1608

And in the mean time, plan for more storage as I and others have reported in
other threads that repairs have caused disks to fill up (in my case, I think
it's because I had multiple repairs running at the same time).

Re: Cassandra Node Requirements

Posted by Jonathan Ellis <jb...@gmail.com>.
On Wed, Aug 24, 2011 at 1:54 PM, Jacob, Arun <Ar...@disney.com> wrote:
> we need to cap data size at 50% of total node storage capacity for
> compaction

Sort of.  There's some fine print, such as the 50% number is only if
you're manually forcing major compactions, which is not recommended,
but a bigger thing to know is that 1.0 will introduce "leveled
compaction" [1] inspired by leveldb.  The free space requirement will
then be a small number of megabytes.

[1] https://issues.apache.org/jira/browse/CASSANDRA-1608

> SSDs are preferred, but of course  reduce storage capacity
> using standard storage means you bump up your RAM to keep as much in memory
> as possible.

You want to keep "as much in ram as possible" either way; whether you
need SSDs or not depends on whether that's adequate for your "hot"
working set.

> is 8TB (meaning 1.33 actual TB storage/node) a reasonable per node storage
> size for Cassandra?

That's fine, but keep in mind that repairing that much data if you
lose a node could take a while.  Other things being equal, I'd prefer
more nodes with less capacity.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Cassandra Node Requirements

Posted by "Jacob, Arun" <Ar...@disney.com>.
Thanks for the links and the answers. The vagueness of my initial questions reflects the fact that I'm trying to configure for a general case — I will clarify below:

I need to account for a variety of use cases.
(1) they will be both read and write heavy.  I was assuming that SSDs would be really good to handle the heavy read load, but with the amount of data I need to store, SSDs arent economical.
(2) I should have clarified, the main use case has  95% of writes going to a single column family. The other CFs are going to be much smaller in relation to the primary CF, which will be sized to sizes below. Given that this is the case, are my assumptions about storage correct for that use case?
(3) In  that use case, the majority of reads will actually come from the most recently inserted 7% of the data. In other use cases, reads will be random. Another use case uses Solandra, and I am assuming that use case results in random reads.

Assuming 250-360TB storage, need for the primary use case, I'm still trying to determine  how many nodes  I need to stand up to service that much data. What is a reasonable amount of storage per node? You mentioned memory to storage ratio: I'm assuming that ratio trends higher with the more random reads you do. Could you provide an example ratio for a heavy read use case?

Thanks,

-- Arun

From: Edward Capriolo <ed...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Wed, 24 Aug 2011 14:54:56 -0700
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Cassandra Node Requirements



On Wed, Aug 24, 2011 at 2:54 PM, Jacob, Arun <Ar...@disney.com>> wrote:
I'm trying to determine a node configuration for Cassandra. From what I've been able to determine from reading around:


 1.  we need to cap data size at 50% of total node storage capacity for compaction
 2.  with RF=3, that means that I need to effectively assume that I have 1/6th of total storage capacity.
 3.  SSDs are preferred, but of course  reduce storage capacity
 4.  using standard storage means you bump up your RAM to keep as much in memory as possible.

Right now we are looking at storage requirements of 42 – 60TB, assuming a baseline of 3TB/day and expiring data after 14-20 days (depending on use case),  I would assume based on above that we need 252- 360TB total storage max.

My questions:

 1.  is 8TB (meaning 1.33 actual TB storage/node) a reasonable per node storage size for Cassandra? I don’t want to use SSDs due to reduced storage capacity -- I don't want to buy 100s of nodes to support that reduced storage capacity of SSDs.  Given that I will be using standard drives, what is a reasonable/effective per node storage capacity?
 2.  other than splitting the commit log onto a separate drive, is there any other drive allocation I should be doing?
 3.  Assuming I'm not using SSDs, what would be a good memory size for a node? I've heard anything from 32-48 GB, but need more guidance.

Anything else that anyone has run into? What are common configurations being used by others?

Thanks in advance,

-- Arun




I would suggest checking out:
http://wiki.apache.org/cassandra/CassandraHardware
http://wiki.apache.org/cassandra/LargeDataSetConsiderations
http://www.slideshare.net/edwardcapriolo/real-world-capacity

1. we need to cap data size at 50% of total node storage capacity for compaction

False. You need 50% the capacity of your largest column family free with some other room for overhead. This changes all your numbers.

3. SSDs are preferred, but of course  reduce storage capacity

Avoid generalizations. Many use cases may get little benefit from SSD disks.

4. using standard storage means you bump up your RAM to keep as much in memory as possible.

In most cases you want to maintain some RAM / Hard disk ratio. SSD setups still likely need sizable RAM.

Your 3 questions are hard to answer because what hardware you need workload dependent. If really depends on active set, what percent of the data is active at any time. It also depends on your latency requirements, if you are modeling something like the way-back machine, that has different usage profile then a stock ticker application, that is again different from the usage patterns of an email system.

Generally people come to Cassandra because they are looking for low latency access to read and write data. This is hard to achieve on 8TB of disk. The size of the bloom filters and index files are themselves substantial with 8TB of data. You will also require a large amount of RAM on this disk to minimize disk seeks (or a super like SSD raid-0 (does this sound like a bad idea to you? It does to me :))

The only way to answer the question of how much hardware your need is with load testing. The Yahoo Cloud Serving Benchmark can help you fill up a node and test it with different load patterns to see how it performs.



Re: Cassandra Node Requirements

Posted by Edward Capriolo <ed...@gmail.com>.
On Wed, Aug 24, 2011 at 2:54 PM, Jacob, Arun <Ar...@disney.com> wrote:

> I'm trying to determine a node configuration for Cassandra. From what I've
> been able to determine from reading around:
>
>
>    1. we need to cap data size at 50% of total node storage capacity for
>    compaction
>    2. with RF=3, that means that I need to effectively assume that I have
>    1/6th of total storage capacity.
>    3. SSDs are preferred, but of course  reduce storage capacity
>    4. using standard storage means you bump up your RAM to keep as much in
>    memory as possible.
>
> Right now we are looking at storage requirements of 42 – 60TB, assuming a
> baseline of 3TB/day and expiring data after 14-20 days (depending on use
> case),  I would assume based on above that we need 252- 360TB total storage
> max.
>
> My questions:
>
>    1. is 8TB (meaning 1.33 actual TB storage/node) a reasonable per node
>    storage size for Cassandra? I don’t want to use SSDs due to reduced storage
>    capacity -- I don't want to buy 100s of nodes to support that reduced
>    storage capacity of SSDs.  Given that I will be using standard drives, what
>    is a reasonable/effective per node storage capacity?
>    2. other than splitting the commit log onto a separate drive, is there
>    any other drive allocation I should be doing?
>    3. Assuming I'm not using SSDs, what would be a good memory size for a
>    node? I've heard anything from 32-48 GB, but need more guidance.
>
>
> Anything else that anyone has run into? What are common configurations
> being used by others?
>
> Thanks in advance,
>
> -- Arun
>
>
>

I would suggest checking out:
http://wiki.apache.org/cassandra/CassandraHardware
http://wiki.apache.org/cassandra/LargeDataSetConsiderations
http://www.slideshare.net/edwardcapriolo/real-world-capacity

1. we need to cap data size at 50% of total node storage capacity for
compaction

False. You need 50% the capacity of your largest column family free with
some other room for overhead. This changes all your numbers.

3. SSDs are preferred, but of course  reduce storage capacity

Avoid generalizations. Many use cases may get little benefit from SSD disks.

4. using standard storage means you bump up your RAM to keep as much in
memory as possible.

In most cases you want to maintain some RAM / Hard disk ratio. SSD setups
still likely need sizable RAM.

Your 3 questions are hard to answer because what hardware you need workload
dependent. If really depends on active set, what percent of the data is
active at any time. It also depends on your latency requirements, if you are
modeling something like the way-back machine, that has different usage
profile then a stock ticker application, that is again different from the
usage patterns of an email system.

Generally people come to Cassandra because they are looking for low latency
access to read and write data. This is hard to achieve on 8TB of disk. The
size of the bloom filters and index files are themselves substantial with
8TB of data. You will also require a large amount of RAM on this disk to
minimize disk seeks (or a super like SSD raid-0 (does this sound like a bad
idea to you? It does to me :))

The only way to answer the question of how much hardware your need is with
load testing. The Yahoo Cloud Serving Benchmark can help you fill up a node
and test it with different load patterns to see how it performs.