You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Arun Cherian <ar...@gmail.com> on 2010/12/10 22:42:53 UTC

Cassandra for Ad-hoc Aggregation and formula calculation

Hi,

I have been reading up on Cassandra for the past few weeks and I am
highly impressed by the features it offers. At work, we are starting
work on a product that will handle several million CDR (Call Data
Record, basically can be thought of as a .CSV file) per day. We will
have to store the data, and perform aggregations and calculations on
them. A few veteran RDBMS admin friends (we are a small .NET shop, we
don't have any in-house DB talent) recommended Infobright and noSQL to
us, and hence my search. I was wondering if Cassandra is a good fit
for

1. Storing several million data records per day (each record will be a
few KB in size) without any data loss.
2. Aggregation of certain fields in the stored records, like Avg
across time period.
3. Using certain existing fields to calculate new values on the fly
and store it too.
4. We were wondering if pre-aggregation was a good choice (calculating
aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we
need ad-hoc aggregation, does Cassandra support that over this amount
of data?

Thanks,
Arun

Re: Cassandra for Ad-hoc Aggregation and formula calculation

Posted by Aaron Morton <aa...@thelastpickle.com>.
Nice email Dan.

I would also add if you are still in the initial stages take a look at Hadoop+Pig. If your source data is write once read many it may be a better fit, but then you would also need to calculate the aggregates and store them somewhere. 

So Cassandra *may* be just what you want. The ability to keep large amounts of data online with a high performance, and remove other servers form your stack is a definite plus.

Aaron 
On 11/12/2010, at 7:01 PM, Dan Hendry <da...@gmail.com> wrote:

> Perhaps other, more experienced and reputable contributors to this list can comment but to be frank: Cassandra is probably not for you (at least for now). I personally feel Cassandra is one of the stronger NoSQL options out there and has the potential to become the defacto standard; but its not quite there yet and does not inherently meet your requirements.
> 
> To give you some background, I started experimenting with Cassandra as a personal project to avoid having to look through days worth of server logs (and because I thought it was cool). The project ballooned and has become my organizations primary metrics and analytics platform which currently processes 200 million+ events/records per day. I doubt any traditional database solution could have performed as well as Cassandra but the development and operations process has not been without severe growing pains. 
> 
>> 1. Storing several million data records per day (each record will be a
>> few KB in size) without any data loss.
> 
> Absolutely, no problems on this front. A cluster of moderately beefy servers will handle this with no complaints. As long as you are careful to avoid hotspots in your data distribution, Cassandra truly is damn near linearly scalable with hardware. 
> 
>> 2. Aggregation of certain fields in the stored records, like Avg
>> across time period.
> 
> Cassandra cannot do this on its own (by design and for good reason). There have been efforts to add support for higher level data processing languages (such as pig and hive) but they are not 'out of the box solutions' and in my experience, difficult to get working properly. I ended up writing my own data processing/report generation framework that works ridiculously well for my particular case. In relation to your requirements, calculating averages across fields would probably have to be implemented manually (and executed as a periodic, automated task). Although non-trivial this isn’t quite as bad as you might think.
> 
>> 3. Using certain existing fields to calculate new values on the fly
>> and store it too.
> 
> Not quite sure what you are asking here. To go back to the last point to calculate anything new, you are probably going to have to load all the records on which that calculation depends into a separate process/server. Generally, I would say Cassandra isn’t particularly good at 'on the fly' data aggregation tasks (certainly not at all to the extent an SQL database is). To be fair, thats also not what it is explicitly designed for or advertised to do well. 
> 
>> 4. We were wondering if pre-aggregation was a good choice (calculating
>> aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we
>> need ad-hoc aggregation, does Cassandra support that over this amount
>> of data?
> 
> Cassandra is GREAT for accessing/storing/retrieving/post-processing anything that can be pre-computed. If you have been doing any amount of reading, you will likely have heard that in SQL you model data, in Cassandra (and most other NoSQL databases) you model your queries (sorry for ripping off whoever said this originally). If there is one thing/concept I can say that I have learned about Cassandra is pre-compute (or asynchronously compute) anything you possibly can and don’t be afraid to write a ridiculous amount to the Cassandra database. In terms of ad-hoc aggregation, there is no nice simple scripting language for Cassandra data processing (eg SQL). That said, you can do most things pretty quick with a bit of code. Consider that loading a few hundred to a few thousand record (< 3k) can be pretty quick (< 100 ms, often < 10 ms particularly if they are cached). Our organization basically uses the following approach: 'use Cassandra for generating continuous 10 second accuracy time series reports but MySQL and a production DB replica for any ad-hoc single value report the boss wants NOW'.
> 
> 
> Based on what you have described, it sounds like you are thinking about your problem from a SQL-like point of view: store data once then query/filter/aggregate it in multiple different ways to obtain useful information. If possible try to leverage the power of Cassandra and store it in efficient and per-query pre-optimized forms. For example, I can imagine the average call duration being an important parameter in a system analyzing call data records. Instead of storing all the information about a call in one place, store the 'call duration' in a separate column family, each row containing a single integer representing call duarations for a given hour (column name being the TimeUUID). My metrics system does something similar to this and loads batches of 15,000 records (column slice) in < 200 ms. By parallelizing across 10 threads loading from different rows, I can process the average, standard deviation and a factor roughly meaning 'how close to Gaussian' for 1 million records in < 5 seconds. 
> 
> To reiterate, Cassandra is not the solution if you are looking for 'Database: I command thee to give me the average of field x.' That said, I have found its overall data-processing capabilities to be reasonably impressive.
> 
> Dan
> 
> -----Original Message-----
> From: Arun Cherian [mailto:archerian@gmail.com] 
> Sent: December-10-10 16:43
> To: user@cassandra.apache.org
> Subject: Cassandra for Ad-hoc Aggregation and formula calculation
> 
> Hi,
> 
> I have been reading up on Cassandra for the past few weeks and I am
> highly impressed by the features it offers. At work, we are starting
> work on a product that will handle several million CDR (Call Data
> Record, basically can be thought of as a .CSV file) per day. We will
> have to store the data, and perform aggregations and calculations on
> them. A few veteran RDBMS admin friends (we are a small .NET shop, we
> don't have any in-house DB talent) recommended Infobright and noSQL to
> us, and hence my search. I was wondering if Cassandra is a good fit
> for
> 
> 1. Storing several million data records per day (each record will be a
> few KB in size) without any data loss.
> 2. Aggregation of certain fields in the stored records, like Avg
> across time period.
> 3. Using certain existing fields to calculate new values on the fly
> and store it too.
> 4. We were wondering if pre-aggregation was a good choice (calculating
> aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we
> need ad-hoc aggregation, does Cassandra support that over this amount
> of data?
> 
> Thanks,
> Arun
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 9.0.872 / Virus Database: 271.1.1/3307 - Release Date: 12/10/10 02:37:00
> 

RE: Cassandra for Ad-hoc Aggregation and formula calculation

Posted by Dan Hendry <da...@gmail.com>.
Perhaps other, more experienced and reputable contributors to this list can comment but to be frank: Cassandra is probably not for you (at least for now). I personally feel Cassandra is one of the stronger NoSQL options out there and has the potential to become the defacto standard; but its not quite there yet and does not inherently meet your requirements.

To give you some background, I started experimenting with Cassandra as a personal project to avoid having to look through days worth of server logs (and because I thought it was cool). The project ballooned and has become my organizations primary metrics and analytics platform which currently processes 200 million+ events/records per day. I doubt any traditional database solution could have performed as well as Cassandra but the development and operations process has not been without severe growing pains. 

> 1. Storing several million data records per day (each record will be a
> few KB in size) without any data loss.

Absolutely, no problems on this front. A cluster of moderately beefy servers will handle this with no complaints. As long as you are careful to avoid hotspots in your data distribution, Cassandra truly is damn near linearly scalable with hardware. 

> 2. Aggregation of certain fields in the stored records, like Avg
> across time period.

Cassandra cannot do this on its own (by design and for good reason). There have been efforts to add support for higher level data processing languages (such as pig and hive) but they are not 'out of the box solutions' and in my experience, difficult to get working properly. I ended up writing my own data processing/report generation framework that works ridiculously well for my particular case. In relation to your requirements, calculating averages across fields would probably have to be implemented manually (and executed as a periodic, automated task). Although non-trivial this isn’t quite as bad as you might think.

> 3. Using certain existing fields to calculate new values on the fly
> and store it too.

Not quite sure what you are asking here. To go back to the last point to calculate anything new, you are probably going to have to load all the records on which that calculation depends into a separate process/server. Generally, I would say Cassandra isn’t particularly good at 'on the fly' data aggregation tasks (certainly not at all to the extent an SQL database is). To be fair, thats also not what it is explicitly designed for or advertised to do well. 

> 4. We were wondering if pre-aggregation was a good choice (calculating
> aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we
> need ad-hoc aggregation, does Cassandra support that over this amount
> of data?

Cassandra is GREAT for accessing/storing/retrieving/post-processing anything that can be pre-computed. If you have been doing any amount of reading, you will likely have heard that in SQL you model data, in Cassandra (and most other NoSQL databases) you model your queries (sorry for ripping off whoever said this originally). If there is one thing/concept I can say that I have learned about Cassandra is pre-compute (or asynchronously compute) anything you possibly can and don’t be afraid to write a ridiculous amount to the Cassandra database. In terms of ad-hoc aggregation, there is no nice simple scripting language for Cassandra data processing (eg SQL). That said, you can do most things pretty quick with a bit of code. Consider that loading a few hundred to a few thousand record (< 3k) can be pretty quick (< 100 ms, often < 10 ms particularly if they are cached). Our organization basically uses the following approach: 'use Cassandra for generating continuous 10 second accuracy time series reports but MySQL and a production DB replica for any ad-hoc single value report the boss wants NOW'.


Based on what you have described, it sounds like you are thinking about your problem from a SQL-like point of view: store data once then query/filter/aggregate it in multiple different ways to obtain useful information. If possible try to leverage the power of Cassandra and store it in efficient and per-query pre-optimized forms. For example, I can imagine the average call duration being an important parameter in a system analyzing call data records. Instead of storing all the information about a call in one place, store the 'call duration' in a separate column family, each row containing a single integer representing call duarations for a given hour (column name being the TimeUUID). My metrics system does something similar to this and loads batches of 15,000 records (column slice) in < 200 ms. By parallelizing across 10 threads loading from different rows, I can process the average, standard deviation and a factor roughly meaning 'how close to Gaussian' for 1 million records in < 5 seconds. 

To reiterate, Cassandra is not the solution if you are looking for 'Database: I command thee to give me the average of field x.' That said, I have found its overall data-processing capabilities to be reasonably impressive.

Dan

-----Original Message-----
From: Arun Cherian [mailto:archerian@gmail.com] 
Sent: December-10-10 16:43
To: user@cassandra.apache.org
Subject: Cassandra for Ad-hoc Aggregation and formula calculation

Hi,

I have been reading up on Cassandra for the past few weeks and I am
highly impressed by the features it offers. At work, we are starting
work on a product that will handle several million CDR (Call Data
Record, basically can be thought of as a .CSV file) per day. We will
have to store the data, and perform aggregations and calculations on
them. A few veteran RDBMS admin friends (we are a small .NET shop, we
don't have any in-house DB talent) recommended Infobright and noSQL to
us, and hence my search. I was wondering if Cassandra is a good fit
for

1. Storing several million data records per day (each record will be a
few KB in size) without any data loss.
2. Aggregation of certain fields in the stored records, like Avg
across time period.
3. Using certain existing fields to calculate new values on the fly
and store it too.
4. We were wondering if pre-aggregation was a good choice (calculating
aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we
need ad-hoc aggregation, does Cassandra support that over this amount
of data?

Thanks,
Arun
No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.872 / Virus Database: 271.1.1/3307 - Release Date: 12/10/10 02:37:00