You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by G P <gi...@hotmail.com> on 2016/06/20 16:12:07 UTC

Estimating partition size for C*2.X and C*3.X and Time Series Data Modelling.

Hello,

I'm currently enrolled in a master's degree and my thesis project involves the usage of Big Data tools in the context of Smart Grid applications. I explored sever storage solutions and found Cassandra to be fitting to my problem.
The data is mostly Time Series data, incoming from multiple PLCs, currently being captured and stored by a proprietary SCADA software connected to a MSSQL server. Reading into C* storage engine and how Time Series should be modelled, it is inevitable that I have to use a sort of time bucketing for splitting into multiple partitions.

Here is the issue, in the MSSQL server, each PLC has very wide tables (5 at the moment for one building) with around 36 columns of data being collected every 10 seconds. Data is being queried as much as 15 columns at a time with time ranges varying between 1 hour and a whole month. A simple mapping of the same tables in MSSQL to C* is not recommended due to the way C*2.X stores its data.

I took the DS220: Data Modelling Course, that showcases two formulas for estimating a partition size based on the Table design.

[cid:image003.png@01D1CB16.9A41FD30]
[cid:image004.png@01D1CB16.9A41FD30]
Note: This Ps formula does not account for column name length, TTLs, counter columns, and additional overhead.

If my calculations are correct, with a table such as the one below and a the time resolution of 10 seconds, the Ps (Partition Size) would be shy of 10 MB (value often recommended) if I partitioned it weekly.

CREATE TABLE TEST (
    BuildingAnalyzer text,
    Time timestamp,
    P1 double,
    P2 double,
    P3 double,
    Acte1 int,
    Acte2 int,
    Acte3 int,
    PRIMARY KEY (BuildingAnalyzer, Time)
)

However, as of C*3.0, a major refactor of the storage engine brought efficiency in storage costs. From what I could gather in [1], clustering columns and column name are no longer repeated for each value in a record and, among other things, the timestamps for conflict resolution (the 8 × Nv of the 2nd formula) can be stored only once per record if they have the same value and are encoded as varints.

I also read [2], which explains the storage in intricate detail, adding too much complexity to a simple estimation formula.

Is there any way to estimate partition size of a table with similar formulas as the ones above?
Should I just model my tables similar to what is done with metric collection (table with columns, "parametername" and "value")?


[1]    http://www.datastax.com/2015/12/storage-engine-30

[2]    http://thelastpickle.com/blog/2016/03/04/introductiont-to-the-apache-cassandra-3-storage-engine.html

Sorry for the long wall of text,
Best regards,
Gil Pinheiro.