You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jacob Rhoden <ja...@me.com> on 2013/11/09 00:02:16 UTC

Best data structure for tracking most recent updates.

I need to be able to show the most recent changes that have occurred in a system, I understand inserting every update into a tracking table and deleting old updates may not be great, as I may end up creating millions of tombstones. i.e. don't do this:

create table recent_updates(uuid timeuuid primary key, message text);
insert into recent_updates(now(), 'the message');
insert into recent_updates(now(), 'the message');
....
insert into recent_updates(now(), 'the message');
// delete all but the most recent ten messages.

So how do people solve it? The following option occurs to me, but I am not sure if its the best option:

create table recent_updates(record int primary key, message text, uuid timeuuid);
insert into recent_updates(1, 'the message', now());
insert into recent_updates(2, 'the message', now());
....
insert into recent_updates(10, 'the message', now());
// rotate back to 1
insert into recent_updates(1, 'the message', now());

Doing it this way would require a query to find out what number in the sequence we are up to.

Best regards,
Jacob

Re: Best data structure for tracking most recent updates.

Posted by "Laing, Michael" <mi...@nytimes.com>.
Here are a couple ideas:

1. You can rotate tables and truncate to avoid deleting.
2. You can shard your tables (partition key) to mitigate hotspots.
3. You can use a column key to store rows in timeuuid sequence.

create table recent_updates_00 (shard text, uuid timeuuid, message text,
primary key (shard, uuid));
create table recent_updates_01 (shard text, uuid timeuuid, message text,
primary key (shard, uuid)));
...

You can determine 'shard' randomly within a range, e.g. 1 of 24 shards,
when you write. Sharding spreads the load as each shard is a row.

You determine which table to write to by current datetime, e.g. hour of
day, day of week, etc. and use the modulus based upon, e.g. every 5 hours,
every 3 days, etc. So you are only writing to 1 table at a time. Usually I
derive the datetime from the timeuuid so all is consistent. Within your
modulus range, you can truncate currently unused tables so they are ready
for reuse - truncation is overall much cheaper than deletion.

You can retrieve 'the latest' updates by doing a query like this - the
table is determined by current time, but possibly you will want to append
results from the 'prior' table if you do not satisfy your limit:

select uuid, message from recent_updates_xx where shard in ('00', '01',
...) order by uuid desc limit 10; -- get the latest 10

This is a very efficient query. You can improve efficiency somewhat by
altering the storage order in the table creates.

ml








On Fri, Nov 8, 2013 at 6:02 PM, Jacob Rhoden <ja...@me.com> wrote:

> I need to be able to show the most recent changes that have occurred in a
> system, I understand inserting every update into a tracking table and
> deleting old updates may not be great, as I may end up creating millions of
> tombstones. i.e. don't do this:
>
> create table recent_updates(uuid timeuuid primary key, message text);
> insert into recent_updates(now(), 'the message');
> insert into recent_updates(now(), 'the message');
> ....
> insert into recent_updates(now(), 'the message');
> // delete all but the most recent ten messages.
>
> So how do people solve it? The following option occurs to me, but I am not
> sure if its the best option:
>
> create table recent_updates(record int primary key, message text, uuid
> timeuuid);
> insert into recent_updates(1, 'the message', now());
> insert into recent_updates(2, 'the message', now());
> ....
> insert into recent_updates(10, 'the message', now());
> // rotate back to 1
> insert into recent_updates(1, 'the message', now());
>
> Doing it this way would require a query to find out what number in the
> sequence we are up to.
>
> Best regards,
> Jacob
>