You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Anton Slutsky (JIRA)" <ji...@apache.org> on 2015/02/17 06:28:12 UTC

[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

    [ https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323669#comment-14323669 ] 

Anton Slutsky commented on CASSANDRA-4914:
------------------------------------------

Hello all,

I noticed that some of the aggregate functions discussed on this thread made it into the trunk.  I'm a little concerned with the implementation.  It looks like aggregates, such as sum, avg, etc. are implemented in code by basically looping through the result set pages and computing the desired aggregates in code.  I'm worried that, since Cassandra is meant for large volumes of data, this is not at all a feasible implementation for real world cases.  I tried using avg on a more or less sizable dataset and observed two things -- first, my select statement would time out even with bumped up read timeout setting and second, CPU that's running the average computation is quite busy.

Obviously, there's only so much that can be done in terms of computing these aggregates without resorting to some sort of distributed computation framework, but I'd like to suggest a slightly different approach.  I wonder if we can just rethink how we think about aggregate functions in context of large data.  Perhaps, what we could do is consider a probabilistic aggregates instead of raw computable ones?  That is, instead of striving to compute an aggregate on an entire resultset, maybe we can compute the aggregate with a stated probability of that aggregate being true.

For example:

select probabilistic_avg(my_col) from my_table;

would return something like a map:

{"avg":101.1, "prob":0.78}

where "avg" is our probabilistic avg and "prob" is the probability of it being what we say it is.

Of course, that wont be as good as the real thing, but it still has value in many cases, I think.  And it can be implemented in a scalable way with some scratch system tables.

I'm happy to give it a stab if this is of interest to anyone.

> Aggregation functions in CQL
> ----------------------------
>
>                 Key: CASSANDRA-4914
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Vijay
>            Assignee: Benjamin Lerer
>              Labels: cql, docs
>             Fix For: 3.0
>
>         Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt
>
>
> The requirement is to do aggregation of data in Cassandra (Wide row of column values of int, double, float etc).
> With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for the columns within a row).
> Example:
> SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;                                    
>  empid | deptid | first_name | last_name | salary
> -------+--------+------------+-----------+--------
>    130 |      3 |     joe    |     doe   |   10.1
>    130 |      2 |     joe    |     doe   |    100
>    130 |      1 |     joe    |     doe   |  1e+03
>  
> SELECT sum(salary), empid FROM emp WHERE empID IN (130);                                    
>  sum(salary) | empid
> -------------+--------
>    1110.1    |  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)