You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Benjamin Lerer (JIRA)" <ji...@apache.org> on 2016/01/12 10:08:40 UTC
[jira] [Commented] (CASSANDRA-9778) CQL support for time series
aggregation
[ https://issues.apache.org/jira/browse/CASSANDRA-9778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15093569#comment-15093569 ]
Benjamin Lerer commented on CASSANDRA-9778:
-------------------------------------------
While working on CASSANDRA-10707, I started to think about that windowed aggregates problem.
I really think that it is a usefull functionality that we should have. I have used it for years, with MySQL, for analysing data.
Regarding the implementation, even if in general I also prefer to follow the SQL syntax, I do not believe it will be a good fit for Cassandra.
If we have a table like:
{code}
CREATE TABLE trades
{
symbol text,
date date,
time time,
priceMantissa int,
priceExponent tinyint,
volume int,
PRIMARY KEY ((symbol, date), time)
};
{code}
The trades will be inserted with an increasing time and sorted in the same order, which is really the use case targeted by this ticket and by CASSANDRA-10707. As we can have to process a large amount of data, we want to try to limit ourself to the cases where we can build the groups on the flight (which is not a requirement in the SQL world).
If we want to get the number of trades per minutes with the SQL syntax we will have to write:
{{SELECT hour(time), minute(time), count((*)) FROM Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' GROUP BY hour(time), minute(time);}}
which is fine. The problem is that if the user invert by mistake the functions like that:
{{SELECT hour(time), minute(time), count((*)) FROM Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' GROUP BY minute(time), hour(time);}}
the query will return weird results if it is a normal SELECT and will be pretty inefficient within a MV.
The only way to prevent that would be to check the function order and make sure that we do not allow to skip functions (e.g. {{ GROUP BY hour(time), second(time)}}).
In my opinion a function like {{floor(<columnName>, <time range>)}} will be much better as it does not allow for this type of mistakes and is much more flexible (you can create 5 minutes buckets if you want to).
{{SELECT floor(time, m), count((*)) FROM Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' GROUP BY floor(time, m);}}
> CQL support for time series aggregation
> ---------------------------------------
>
> Key: CASSANDRA-9778
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9778
> Project: Cassandra
> Issue Type: New Feature
> Components: CQL
> Reporter: Jonathan Ellis
> Assignee: Benjamin Lerer
> Fix For: 3.x
>
>
> Along with MV (CASSANDRA-6477), time series aggregation or "rollups" are a common design pattern in cassandra applications. I'd like to add CQL support for this along these lines:
> {code}
> CREATE MATERIALIZED VIEW stocks_by_hour AS
> SELECT exchange, day, day_time(1h) AS hour, symbol, avg(price), sum(volume)
> FROM stocks
> GROUP BY exchange, day, symbol, hour
> PRIMARY KEY ((exchange, day), hour, symbol);
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)