You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Steven Mac <ug...@hotmail.com> on 2011/01/11 19:07:05 UTC

Advice wanted on modeling

Hi,

I've been experimenting quite a bit with Cassandra and think I'm getting to understand it, but I would like some advice on modeling my data in Cassandra for an application I'm developing.

The application will have a large number of records, with the records consisting of a fixed part and a number (n) of periodic parts.
* The fixed part is updated occasionally.
* The periodic parts are never updated, but a new one is added every 5 to 10 minutes. Only the last n periodic parts need to be kept, so that the oldest one can be deleted after adding a new part.
* The records will always be read completely (meaning fixed part and all periodic parts). Reads are less frequent than writes.
The application will be running continuosly, at least for a few weeks, so there will be many, many stale periodic parts, so I'm a bit worried about data comsumption and compactions.

With respect to modeling the above in Cassandra I have the following questions:

Does anyone want to provide insights into the alternatives below:

1) For every period, add a new column to each record and delete the oldest column with a batch_mutate. This obviously causes many tombstones.
2) For every period, overwrite the oldest column for each record with the new one (cyclic/modulo behaviour). AFAIK this does not cause any tombstones, but will probably cause the SSTables to get polluted.
3) (0.7 only) For every period, create a new CF and add columns to it with a batch_mutate and drop the oldest CF. The obsolete data can be cleaned up immediately, but I'm not sure if this is proper/recommended use of dynamic CFs.
4) Don't use Cassandra at all and investigate other storage solutions. Suggestions would be welcome if you favour this approach.

Also I'm wondering whether I should be putting the fixed and periodic parts together in one Super CF, or whether it would be better to separate the fixed part into one CF and the periodic parts in another. Since I'll be reading all data of a record at the same time, my preference would go to a Super CF, but I'm open to anyone wanting to talk me out of this ;-)

Thanks, Steven.

RE: Advice wanted on modeling

Posted by Steven Mac <ug...@hotmail.com>.

> Date: Thu, 13 Jan 2011 01:29:33 +0100
> Subject: Re: Advice wanted on modeling
> From: peter.schuller@infidyne.com
> To: user@cassandra.apache.org
> 
> > The application will have a large number of records, with the records
> > consisting of a fixed part and a number (n) of periodic parts.
> > * The fixed part is updated occasionally.
> > * The periodic parts are never updated, but a new one is added every 5 to 10
> > minutes. Only the last n periodic parts need to be kept, so that the oldest
> > one can be deleted after adding a new part.
> > * The records will always be read completely (meaning fixed part and all
> > periodic parts). Reads are less frequent than writes.
> > The application will be running continuosly, at least for a few weeks, so
> > there will be many, many stale periodic parts, so I'm a bit worried about
> > data comsumption and compactions.
> 
> I was going to hit send on a partial recommendation but realized I
> don't really have enough information given that you seem to be making
> pretty specific optimizations.
> 
> You say writes are more frequent than reads. To what extent - are
> reads *very* infrequent to the point that the performance of the reads
> are almost completely irrelevant?

What exactly is a write? Is a record update or is it a batch of record updates
that is executed in one operation? In my case I'm batching about a thousand
record updates (new periodic parts) into a single batch_mutate. A read would
constitute fetching all parts of a single record. In the text below I'm using the
term update to mean a record update.

I expect about a few reads typically for every thousand updates (<1%), although
read pressure will vary considerably over time. I don't expect more than a hundred
reads for every thousand updates (about 10%). Read performance is not irrelevant,
but definitely subordinate to write performance, which is crucial (and one of the
reasons I selected Cassandra).

> You seem worried about tombstones and data size. Is the issue that
> you're expecting huge amounts of data and disk space/compaction
> frequency is an issue?

Yes, I am expecting huge amounts of data and without compaction I would
soon (few days to a week) run out of disk space.

> Are you expecting write load to be high such that performance of
> writes (and compaction) is a concern, or is it mostly about slowly
> building up huge amounts of data that you want to be compact on disk?

I'm not sure here. My write load is high, estimated at a thousand records
per second (batched, of course).

Re: Advice wanted on modeling

Posted by Peter Schuller <pe...@infidyne.com>.

> The application will have a large number of records, with the records
> consisting of a fixed part and a number (n) of periodic parts.
> * The fixed part is updated occasionally.
> * The periodic parts are never updated, but a new one is added every 5 to 10
> minutes. Only the last n periodic parts need to be kept, so that the oldest
> one can be deleted after adding a new part.
> * The records will always be read completely (meaning fixed part and all
> periodic parts). Reads are less frequent than writes.
> The application will be running continuosly, at least for a few weeks, so
> there will be many, many stale periodic parts, so I'm a bit worried about
> data comsumption and compactions.

I was going to hit send on a partial recommendation but realized I
don't really have enough information given that you seem to be making
pretty specific optimizations.

You say writes are more frequent than reads. To what extent - are
reads *very* infrequent to the point that the performance of the reads
are almost completely irrelevant?

You seem worried about tombstones and data size. Is the issue that
you're expecting huge amounts of data and disk space/compaction
frequency is an issue?

Are you expecting write load to be high such that performance of
writes (and compaction) is a concern, or is it mostly about slowly
building up huge amounts of data that you want to be compact on disk?

-- 
/ Peter Schuller

Question about fat rows

Posted by Héctor Izquierdo Seliva <iz...@strands.com>.

Hi everyone.

I have a question about data modeling in my application. I have to store
items of a customer, and I can do it in one fat row per customer where
the column name is the id and the value a json serialized object, or one
entry per item with the same layout. This data is updated almost every
day, sometimes several times per day.

My question is, which scheme will give me a better read performance? I
was hoping on saving keys so I could cache all the keys in this CF, but
I'm worried about read performance with very updated fat rows.

Any help or hints would be appreciated.

Thanks!