You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by Apache Wiki <wi...@apache.org> on 2010/01/13 16:42:24 UTC

[Cassandra Wiki] Update of "CassandraHardware" by JonathanEllis

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "CassandraHardware" page has been changed by JonathanEllis.
http://wiki.apache.org/cassandra/CassandraHardware?action=diff&rev1=4&rev2=5

--------------------------------------------------

Cassandra assumes that all nodes have equal capacity. Violating this assumption will lead to poor performance. Rather than keeping your hardware price fixed and adding increasingly powerful machines as moore's law kicks in, keep capacity (relatively) fixed and add increasingly inexpensive ones. (This is easier if you start with relatively powerful machines.)

=== Memory ===
- The most recently written data resides in memory tables (aka [[MemtableThresholds|memtables]]), but older data that has been flushed to disk can be kept in the OS's file-system cache. In other words, ''the more memory, the better'', with 1GB being the minimum recommended in a virtualized environment. With dedicated hardware there is no reason to use less than 4GB, and at the high end, you see clusters with 16 or 32 GB.
+ The most recently written data resides in memory tables (aka [[MemtableThresholds|memtables]]), but older data that has been flushed to disk can be kept in the OS's file-system cache. In other words, ''the more memory, the better'', with 1GB being the minimum we typically recommended in a virtualized environment. With dedicated hardware there is no reason to use less than 4GB, and at the high end, you see clusters with 16 or 32 GB (to handle data sets of multiple TB per machine).

=== CPU ===
Many workloads will actually be CPU-bound in Cassandra before being memory-bound. Cassandra is highly concurrent and will make good use of however many cores you can give it. For high-end clusters, quad- or 8-core boxes are good. If you're running on virtualized machines, consider using a provider such as Rackspace Cloud Servers that allows CPU bursting.

=== Disk ===
- The short answer here is, ''at least 2 disks'', one to keep your `CommitLogDirectory` on, the other to use in `DataFileDirectories`. The exact answer though depends a lot on your usage so it's important to understand what is going on here.
+ The short answer here is that ideally you will have at least 2 disks, one to keep your `CommitLogDirectory` on, the other to use in `DataFileDirectories`. The exact answer though depends a lot on your usage so it's important to understand what is going on here.

Cassandra persists data to disk for two very different purposes. The first, when a new write is made so that it can be replayed after a crash or system shutdown. The second when thresholds are exceeded and memtables are flushed to disk as SSTables.

- Commit logs receive every write made to a Cassandra node and have the potential to block client operations, but they are only ever read on node start-up. SSTables writes on the other hand occur asynchronously, but are read to satisfy client look-ups. SSTables are also periodically merged and rewritten in a process called ''compaction''. Another important distinction is that commit logs are purged after the corresponding data has been flushed to disk as an SSTable, so `CommitLogDirectory` only holds uncommitted data while the directories in `DataFileDirectories` store all of the data written to a node.
+ Commit logs receive every write made to a Cassandra node and have the potential to block client operations, but they are only ever read on node start-up. SSTable (data file) writes on the other hand occur asynchronously, but are read to satisfy client look-ups. SSTables are also periodically merged and rewritten in a process called ''compaction''. Another important difference between commitlog and sstables is that commit logs are purged after the corresponding data has been flushed to disk as an SSTable, so `CommitLogDirectory` only holds uncommitted data while the directories in `DataFileDirectories` store all of the data written to a node.

- So to summarize, use a different device for your `CommitLogDirectory`; it needn't be large, but it should be fast enough to receive all of your writes. Then, use one or more devices for `DataFileDirectories` and make sure they are both large enough to house all of your data, and fast enough to satisfy your reads and to keep up with flushing and compaction.
+ So to summarize, if you use a different device for your `CommitLogDirectory` it needn't be large, but it should be fast enough to receive all of your writes. Then, use one or more devices for `DataFileDirectories` and make sure they are both large enough to house all of your data, and fast enough to satisfy your reads and to keep up with flushing and compaction.