You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Oleg Ivanov <na...@gmail.com> on 2010/04/28 16:32:52 UTC

Correct data model for Cassandra

Hello,

our company has a huge table in a relational database which keeps statistics
of some financional operations.
It looks like the following:
SERVER_ID - server, which served the transaction
ACCOUNT_FROM - account1
ACCOUNT_TO - account2
HOUR - time range for this statistics row (from 0 minutes to 59 minutes of
the hour)
SERVICE_ID - a service
GATEWAY_ID - a gateway
TRANSACTIONS_COUNT - transactions count per the hour
AMOUNT -  $ per the hour

There are different reports, which use data from the table - examples are:
show the amount per account, show the transactions count per server and so
on.. Only TRANSACTIONS_COUNT and AMOUNT columns is actually "data columns",
others act like a "filter" to group the results.
The table is huge and some queries run more than 2 minutes - this is
unacceptable, we need almost real time responses. Moreover, the cost of
database servers' support grows and grows. So there is an idea to move
statistical data of this kind to the Cassandra.

The question is - what is the correct data model for the Cassandra?

I'm thinking of creating a SuperColumn consisted of all columns described
above and map an auto generated key (ROW_ID) to it. Then create all indexes
like SERVER_ID -> ROW_ID, ACCOUNT_FROM -> ROW_ID and so on. But it looks a
little bit weird.

How to do things right? )

Thanks in advance,
Alex

Re: Correct data model for Cassandra

Posted by Oleg Ivanov <na...@gmail.com>.

Thanks Ellis,
so the common scenario is to store data in one CF and any index (inverted?)
in another CF?




2010/4/30 Jonathan Ellis <jb...@gmail.com>

> the correct data model is one where you can pull the data you want out
> as a slice of a row, or (sometimes) as a slice of sequential rows.
> usually this involves writing the same data to multiple columnfamilies
> at insertion time, so when you do queries you don't need to do joins.
>
>

Re: Correct data model for Cassandra

Posted by Jonathan Ellis <jb...@gmail.com>.

the correct data model is one where you can pull the data you want out
as a slice of a row, or (sometimes) as a slice of sequential rows.
usually this involves writing the same data to multiple columnfamilies
at insertion time, so when you do queries you don't need to do joins.

On Wed, Apr 28, 2010 at 9:32 AM, Oleg Ivanov <na...@gmail.com> wrote:
> Hello,
>
> our company has a huge table in a relational database which keeps statistics
> of some financional operations.
> It looks like the following:
> SERVER_ID - server, which served the transaction
> ACCOUNT_FROM - account1
> ACCOUNT_TO - account2
> HOUR - time range for this statistics row (from 0 minutes to 59 minutes of
> the hour)
> SERVICE_ID - a service
> GATEWAY_ID - a gateway
> TRANSACTIONS_COUNT - transactions count per the hour
> AMOUNT -  $ per the hour
>
> There are different reports, which use data from the table - examples are:
> show the amount per account, show the transactions count per server and so
> on.. Only TRANSACTIONS_COUNT and AMOUNT columns is actually "data columns",
> others act like a "filter" to group the results.
> The table is huge and some queries run more than 2 minutes - this is
> unacceptable, we need almost real time responses. Moreover, the cost of
> database servers' support grows and grows. So there is an idea to move
> statistical data of this kind to the Cassandra.
>
> The question is - what is the correct data model for the Cassandra?
>
> I'm thinking of creating a SuperColumn consisted of all columns described
> above and map an auto generated key (ROW_ID) to it. Then create all indexes
> like SERVER_ID -> ROW_ID, ACCOUNT_FROM -> ROW_ID and so on. But it looks a
> little bit weird.
>
> How to do things right? )
>
> Thanks in advance,
> Alex
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com