You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Cristian O (JIRA)" <ji...@apache.org> on 2015/03/18 15:39:38 UTC

[jira] [Issue Comment Deleted] (CASSANDRA-8826) Distributed aggregates

     [ https://issues.apache.org/jira/browse/CASSANDRA-8826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cristian O updated CASSANDRA-8826:
----------------------------------
    Comment: was deleted

(was: Hi Benedict,

Very nicely put :) Let's see if reason can eventually prevail...

I'm not sure if you're aware of Vertica. It's the pioneering columnar
store, originally created by M. Stonebreaker
(highly recommended to look into this guy and find his papers if you're not
aware of him)

Vertica is probably one of the best available analytics database, however
it's commercial and quite expensive.

There's a paper on Vertica describing its architecture here:

http://www.vldb.org/pvldb/vol7/p1259-gupta.pdf

You'll see that it's distribution model and even parts of the storage
engine design are remarkably similar to Cassandra. This is not accidental
as they are both shared nothing architectures.

Cassandra is quite well suited to implement some of the main analytical use
cases with probably minimal effort, and there would
be a lot of interest in this market if it succeeds.

As I mentioned yesterday, a very interesting use case is to do simple
aggregations over large amounts of data points (mainly time series) very
fast (under 5 secs) for a large number of users (many concurrent requests).

Spark/MR do not have the right architecture for this, in the OSS world a
direct competitor would be Impala (almost shared nothing) and HBase perhaps
which I hear it's trying to position itself towards this.


Cheers,
Cristian






)

> Distributed aggregates
> ----------------------
>
>                 Key: CASSANDRA-8826
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8826
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Robert Stupp
>            Priority: Minor
>
> Aggregations have been implemented in CASSANDRA-4914.
> All calculation is performed on the coordinator. This means, that all data is pulled by the coordinator and processed there.
> This ticket's about to distribute aggregates to make them more efficient. Currently some related tickets (esp. CASSANDRA-8099) are currently in progress - we should wait for them to land before talking about implementation.
> Another playgrounds (not covered by this ticket), that might be related is about _distributed filtering_.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)