You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Trevor Francis <tr...@tgrahamcapital.com> on 2012/04/18 18:33:47 UTC

Single Vs. Multiple Keyspaces

We are launching a data-intensive application that will store in upwards of 50 million 150-byte records per day per user. We have identified Cassandra as our database technology and Flume as what we will use to seed the data from log files into the database.

Each user is given their own server instance, but the schema of the data for each user will be the same.

We will be performing realtime analysis on this information as part of our application and was considering the advantages/disadvantages of all users using the same keyspace. All data will be treated the same as far as replication factor and the only difference is we won't be displaying one user's info to another user. They will be compartmentalized and one user's data will not affect or ever be compared against another user.

Conceptualize this as a each user has their own Apache server and that server spits out 50 million records per day and each user will only be analyzing the data for their particular server, not anyone elses. The log formats are exactly the same.

My experience lies in relational databases and not key-value stores, like Cassandra. So, in the mysql world we would put each user in their own database to avoid the locking contention and to make queries faster.

If we don't post info into different keyspaces, i assume we will have to add an additional field to our records to identify the user that owns that particular record. How does a single large Keyspace affect query speed, etc. etc.

Trevor Francis

Re: Single Vs. Multiple Keyspaces

Posted by aaron morton <aa...@thelastpickle.com>.

I would suggest you build one cluster, using all your nodes, and create one keyspace for all users.

There are lots of reasons, here a few:

* many nodes in a single clusters spreads the load and gives you fault tolerance. 
* read and write requests can be distributed in a many node cluster.
* cassandra caches and os level file caches will shared
* cassandra does not suffer from locking and contention during reads and writes
* you can prefix row keys to create "virtual keyspaces"  

Hope that helps. 

Aaron

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 19/04/2012, at 4:33 AM, Trevor Francis wrote:

> We are launching a data-intensive application that will store in upwards of 50 million 150-byte records per day per user. We have identified Cassandra as our database technology and Flume as what we will use to seed the data from log files into the database. 
> 
> Each user is given their own server instance, but the schema of the data for each user will be the same.
> 
> We will be performing realtime analysis on this information as part of our application and was considering the advantages/disadvantages of all users using the same keyspace. All data will be treated the same as far as replication factor and the only difference is we won't be displaying one user's info to another user. They will be compartmentalized and one user's data will not affect or ever be compared against another user.
> 
> Conceptualize this as a each user has their own Apache server and that server spits out 50 million records per day and each user will only be analyzing the data for their particular server, not anyone elses. The log formats are exactly the same.
> 
> My experience lies in relational databases and not key-value stores, like Cassandra. So, in the mysql world we would put each user in their own database to avoid the locking contention and to make queries faster. 
> 
> If we don't post info into different keyspaces, i assume we will have to add an additional field to our records to identify the user that owns that particular record. How does a single large Keyspace affect query speed, etc. etc.
> 
> 
> 
> Trevor Francis
> 
>