You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Benjamin Lerer (JIRA)" <ji...@apache.org> on 2016/01/12 10:08:40 UTC
[jira] [Commented] (CASSANDRA-9778) CQL support for time series aggregation

    [ https://issues.apache.org/jira/browse/CASSANDRA-9778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15093569#comment-15093569 ] 

Benjamin Lerer commented on CASSANDRA-9778:
-------------------------------------------

While working on CASSANDRA-10707, I started to think about that windowed aggregates problem.

I really think that it is a usefull functionality that we should have. I have used it for years, with MySQL, for analysing data.

Regarding the implementation, even if in general I also prefer to follow the SQL syntax, I do not believe it will be a good fit for Cassandra.

If we have a table like:
{code}
CREATE TABLE trades
{
    symbol text,
    date date,
    time time,
    priceMantissa int,
    priceExponent tinyint,
    volume int,
    PRIMARY KEY ((symbol, date), time)
};
{code}
The trades will be inserted with an increasing time and sorted in the same order, which is really the use case targeted by this ticket and by CASSANDRA-10707. As we can have to process a large amount of data, we want to try to limit ourself to the cases where we can build the groups on the flight (which is not a requirement in the SQL world).

If we want to get the number of trades per minutes with the SQL syntax we will have to write:
{{SELECT hour(time), minute(time), count((*)) FROM Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' GROUP BY hour(time), minute(time);}}
which is fine. The problem is that if the user invert by mistake the functions like that:
{{SELECT hour(time), minute(time), count((*)) FROM Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' GROUP BY minute(time), hour(time);}}
the query will return weird results if it is a normal SELECT and will be pretty inefficient within a MV.
The only way to prevent that would be to check the function order and make sure that we do not allow to skip functions (e.g. {{ GROUP BY  hour(time), second(time)}}).

In my opinion a function like {{floor(<columnName>, <time range>)}} will be much better as it does not allow for this type of mistakes and is much more flexible (you can create 5 minutes buckets if you want to).
{{SELECT floor(time, m), count((*)) FROM Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' GROUP BY floor(time, m);}}
 
    
  

> CQL support for time series aggregation
> ---------------------------------------
>
>                 Key: CASSANDRA-9778
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9778
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: CQL
>            Reporter: Jonathan Ellis
>            Assignee: Benjamin Lerer
>             Fix For: 3.x
>
>
> Along with MV (CASSANDRA-6477), time series aggregation or "rollups" are a common design pattern in cassandra applications.  I'd like to add CQL support for this along these lines:
> {code}
> CREATE MATERIALIZED VIEW stocks_by_hour AS
> SELECT exchange, day, day_time(1h) AS hour, symbol, avg(price), sum(volume)
> FROM stocks
> GROUP BY exchange, day, symbol, hour
> PRIMARY KEY  ((exchange, day), hour, symbol);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)