You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by DuyHai Doan <do...@gmail.com> on 2014/10/01 01:17:24 UTC
Re: Cassandra and frequent updates

Hello Matthias

 According to your description, an event-sourcing design would be a good
fit for your scenario.

 In Cassandra, instead of "updating" existing data, why don't you just
store new values (it can be delta only, not a problem) with a monotonic
increasing date ?

 This way, in your analysis job, you can sequentially read through the
deltas and compute any intermediate or final state for each datum.

 Now, to mitigate the high insertion (churn) rate you were mentioning,
bucketing techniques (
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra) may
apply to distribute the load across nodes and avoid the partition being too
wide.

To manage deletion of old data, 3 strategies:

1) Deleting entire partition to reclaim space. Efficient because it will
create only 1 row tombstone but you need to schedule the delete of
partition yourself. Configure gc_grace_second & compaction parameters wisely

2) Rely on the TTL feature of Cassandra, be careful to configure
gc_grace_second & compaction parameters wisely.

3) Inserting all data in 1 table per day, truncating old tables to reclaim
disk space


 Regards

 Duy Hai DOAN

On Tue, Sep 30, 2014 at 12:55 PM, Matthias Hübner <ma...@gmail.com>
wrote:

> Hi all,
>
> i'm unsure if cassandra is appropriate for my use case:
>
> Maintain a query model.
>
> Collect data from several sources (asynchronously) and merge it into
> aggregates (rows) in one cassandra table.
> The data is mostly updated, except from initial load or adding new data
> ranges.
> Some source delivers complete new data each day, other only deltas.
> The aggregated data is a set of some flat and list columns, lists will be
> updated at once (like a column).
> Also the updates are in parallel / asynchronously on the same rows,
> spreaded over a day.
> The are no deletes on rows or columns.
> The table will carry around 100 million rows.
>
> Analysis job
>
> A job runs asynchronously one or more times a day to scan the "query
> model" table with few criterias and
> reads ranges of complete rows to generate a kind of analysis output.
> The output, also a several rows of aggregates are inserted into a second
> table,
> old data will never be update but deleted after some weeks.
>
>
> My hope was, to use aggregates and in place updates, locking free and fast
> writes, easy upserts and transparent scaling.
> But i have concerns because of the update scenario / high churn rate.
> Is this more an anti-pattern for cassandra?
> Or would it better to have multiple query models because of the updates
> but with the need to read from multiple tables (instead of one query model
> to with all data in one row)?
>
> Thank you,
>
> Ciao,
> Matthias
>
>
>
>