You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by AJ <aj...@dude.podzone.net> on 2011/06/08 09:29:51 UTC

Misc Performance Questions

Is there a performance hit when dropping a CF?  What if it contains .5 
TB of data?  If not, is there a quick and painless way to drop a large 
amount of data w/minimal perf hit?

Is there a performance hit running multiple keyspaces on a cluster 
versus only one keyspace given a constant total data size?  Is there 
some quantity limit?

Using a Random Partitioner, but with a RF = 1, will the rows still be 
spread-out evenly on the cluster or will there be an affinity to a 
single node (like the one receiving the data from the client)?

I see a lot of mention of using RAID-0, but not RAID-5/6.  Why?  Even 
though Cass can tolerate a down node due to data loss, it would still be 
more efficient to just rebuild a bad hdd live, right?

Maybe perf related:  Will there be a problem having multiple keyspaces 
on a cluster all with different replication factors, from 1-3?

Thanks!

Re: Misc Performance Questions

Posted by Richard Low <rl...@acunu.com>.

On Wed, Jun 8, 2011 at 12:30 PM, AJ <aj...@dude.podzone.net> wrote:

>> There is however a difference in running multiple column families
>> versus putting everything in the same column family and separating
>> them with e.g. a key prefix.  E.g. if you have a large data set and a
>> small one, it will be quicker to query the small one if it is in its
>> own column family.
>>
>
> I assumed that a read would be O(1) for any size CF since Cass is
> implemented with hashmaps.  Do you know why size matters?  (forgive the pun)
>

You may not notice a difference, but it can happen.

For a query, each SSTable is queried.  If there is more data then
there are (most likely) more SSTables to query, slowing it down.  For
point queries, this isn't so bad because the Bloom filters will help,
but for range queries you will notice a big difference.  You will have
to do more seeks to seek over unwanted data.

It will also help buffer caching to separate them - the small SSTables
are more likely to remain in cache.

-- 
Richard Low
Acunu | http://www.acunu.com | @acunu

Re: Misc Performance Questions

Posted by AJ <aj...@dude.podzone.net>.

Thank you Richard!

On 6/8/2011 2:57 AM, Richard Low wrote:
<snip>
> There is however a difference in running multiple column families
> versus putting everything in the same column family and separating
> them with e.g. a key prefix.  E.g. if you have a large data set and a
> small one, it will be quicker to query the small one if it is in its
> own column family.
>

I assumed that a read would be O(1) for any size CF since Cass is 
implemented with hashmaps.  Do you know why size matters?  (forgive the pun)

Re: Misc Performance Questions

Posted by Richard Low <rl...@acunu.com>.

Hi AJ,

On Wed, Jun 8, 2011 at 9:29 AM, AJ <aj...@dude.podzone.net> wrote:

> Is there a performance hit when dropping a CF?  What if it contains .5 TB of
> data?  If not, is there a quick and painless way to drop a large amount of
> data w/minimal perf hit?

Dropping a CF is quick - it snapshots the files (which creates hard
links) and removes the CF definition.  To actually delete the data,
remove the snapshot files from your data directory.

> Is there a performance hit running multiple keyspaces on a cluster versus
> only one keyspace given a constant total data size?  Is there some quantity
> limit?

There is a tiny amount of memory used per keyspace, but unless you
have very many keyspaces you won't notice any impact of running
multiple keyspaces.

There is however a difference in running multiple column families
versus putting everything in the same column family and separating
them with e.g. a key prefix.  E.g. if you have a large data set and a
small one, it will be quicker to query the small one if it is in its
own column family.

> Using a Random Partitioner, but with a RF = 1, will the rows still be
> spread-out evenly on the cluster or will there be an affinity to a single
> node (like the one receiving the data from the client)?

The rows will be spread out the same way - RF=1 doesn't affect the
load balancing.

> I see a lot of mention of using RAID-0, but not RAID-5/6.  Why?  Even though
> Cass can tolerate a down node due to data loss, it would still be more
> efficient to just rebuild a bad hdd live, right?

There's a trade-off - RAID-0 will give better performance, but
rebuilds are over a network.  WIth RF > 1, RAID-0 is enough so that
that you're unlikely to lose data, but as you say, replacing a failed
node will be slower.

> Maybe perf related:  Will there be a problem having multiple keyspaces on a
> cluster all with different replication factors, from 1-3?

No.

Richard.

-- 
Richard Low
Acunu | http://www.acunu.com | @acunu