You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Steven Mac <ug...@hotmail.com> on 2011/01/17 20:05:38 UTC

Super CF or two CFs?

How can I best map an object containing two maps, one of which is updated very frequently and the other only occasionally?

a) As one super CF, which each map in a separate supercolumn and the map entries being the subcolumns?
b) As two CFs, one for each map.

I'd like to discuss the why behind a choice, in order to learn about the impact of a design choice on performance, SStable size/disk usage, compactions, etc.

Regards, Steven.

PS: Objects will always be read as a whole.

RE: Super CF or two CFs?

Posted by Steven Mac <ug...@hotmail.com>.

Thanks for the clarification.

Hugo & Steven.

Subject: Re: Super CF or two CFs?
From: aaron@thelastpickle.com
Date: Tue, 18 Jan 2011 23:51:25 +1300
To: user@cassandra.apache.org

Sorry was not suggesting super CF is better in the first para, I think it applies to any CF.
The role of compaction is to (among other things) reduce the number of SSTables for each CF. The logical endpoint of this process would be a single file for each CF, giving the lowest possible IO. The volatility of your data (overwrites and new colums for a row) fights against this process. In reality it will not get to that endstate. Even in the best case I think it will only go down to 3 sstables. See http://wiki.apache.org/cassandra/MemtableSSTable

If you do have a some data that is highly volatile, and you have performance problems. Then changing compaction thresholds is a recommended approach I think. See the comments in Cassandra.yaml.
My argument is for you to keep data in one CF if you want to read it together. As always store the data to serve the read requests. Do some tests and see where your bottle necks may be for your HW and usage. I may be wrong.
IMHO in this discussion Super or Standard CF will make little performance difference, other the super CF limitations mentioned.
Aaron
On 18/01/2011, at 11:14 PM, Steven Mac <ug...@hotmail.com> wrote:

Thanks for the answer. It provides me the insight I'm looking for.

However, I'm also a bit confused as your first paragraph seems to indicate that using a SCF is better, whereas the last sentence states just the opposite. Do I interpret correctly that this is because of the compactions that put all non-volatile data together in one sstable, leading to compact sstable if the non-volatile data is put into a separate CF? Can this then be generalised into a rule of thumb to separate non-volatile data from volatile data into separate CFs, or am I going too far then?

I will definitely be trying out both suggestions and post my findings.

Hugo.

Subject: Re: Super CF or two CFs?
From: aaron@thelastpickle.com
Date: Tue, 18 Jan 2011 21:54:25 +1300
To: user@cassandra.apache.org

With regard to overwrites, and assuming you always want to get all the data for a stock ticker. Any read on the volatile data will potentially touch many sstables, this IO is unavoidable to read this data so we may as well read as many cols as possible at this time. Whereas if you split the data into two cf's you would incure all the IO for the volatile data plus IO for the non volatile, and have to make two calls. (Or use different keys and make a multiget_slice call, the IO argument still stands)
Thanks to compaction less volatile data, say cols that are written once a day, week or month, will be tend to accrete into fewer sstables. To that end it may make sense to schedule compactions to run after weekly bulk operations. Also take a look at the per CF compaction thresholds.
I'd recommend trying one standard CF (with the quotes packed as suggested) to start with, run some tests and let us know how you go. There are some small penalties to using super Cfs, see the limitations page on the wiki.
Hope that helps.Aaron

On 18/01/2011, at 9:29 PM, Steven Mac <ug...@hotmail.com> wrote:

Some of the fields are indeed written in one shot, but others (such as label and categories) are added later, so I think the question still stands.

Hugo.

From: driftx@gmail.com
Date: Mon, 17 Jan 2011 18:47:28 -0600
Subject: Re: Super CF or two CFs?
To: user@cassandra.apache.org

On Mon, Jan 17, 2011 at 5:12 PM, Steven Mac <ug...@hotmail.com> wrote:

I guess I was maybe trying to simplify the question too much. In reality I do not have one volatile part, but multiple ones (say all trading data of day). Each would be a supercolumn identified by the time slot, with the individual fields as subcolumns.

If you're always going to write these attributes in one shot, then just serialize them and use a simple CF, there's no need for a SCF.
-Brandon

Re: Super CF or two CFs?

Posted by Aaron Morton <aa...@thelastpickle.com>.

Sorry was not suggesting super CF is better in the first para, I think it applies to any CF.

The role of compaction is to (among other things) reduce the number of SSTables for each CF. The logical endpoint of this process would be a single file for each CF, giving the lowest possible IO. The volatility of your data (overwrites and new colums for a row) fights against this process. In reality it will not get to that endstate. Even in the best case I think it will only go down to 3 sstables. See http://wiki.apache.org/cassandra/MemtableSSTable

If you do have a some data that is highly volatile, and you have performance problems. Then changing compaction thresholds is a recommended approach I think. See the comments in Cassandra.yaml.

My argument is for you to keep data in one CF if you want to read it together. As always store the data to serve the read requests. Do some tests and see where your bottle necks may be for your HW and usage. I may be wrong.

IMHO in this discussion Super or Standard CF will make little performance difference, other the super CF limitations mentioned.

Aaron

On 18/01/2011, at 11:14 PM, Steven Mac <ug...@hotmail.com> wrote:

> Thanks for the answer. It provides me the insight I'm looking for.
> 
> However, I'm also a bit confused as your first paragraph seems to indicate that using a SCF is better, whereas the last sentence states just the opposite. Do I interpret correctly that this is because of the compactions that put all non-volatile data together in one sstable, leading to compact sstable if the non-volatile data is put into a separate CF? Can this then be generalised into a rule of thumb to separate non-volatile data from volatile data into separate CFs, or am I going too far then?
> 
> I will definitely be trying out both suggestions and post my findings.
> 
> Hugo.
> 
> Subject: Re: Super CF or two CFs?
> From: aaron@thelastpickle.com
> Date: Tue, 18 Jan 2011 21:54:25 +1300
> To: user@cassandra.apache.org
> 
> With regard to overwrites, and assuming you always want to get all the data for a stock ticker. Any read on the volatile data will potentially touch many sstables, this IO is unavoidable to read this data so we may as well read as many cols as possible at this time. Whereas if you split the data into two cf's you would incure all the IO for the volatile data plus IO for the non volatile, and have to make two calls. (Or use different keys and make a multiget_slice call, the IO argument still stands)
> 
> Thanks to compaction less volatile data, say cols that are written once a day, week or month, will be tend to accrete into fewer sstables. To that end it may make sense to schedule compactions to run after weekly bulk operations. Also take a look at the per CF compaction thresholds.
> 
> I'd recommend trying one standard CF (with the quotes packed as suggested) to start with, run some tests and let us know how you go. There are some small penalties to using super Cfs, see the limitations page on the wiki.
> 
> Hope that helps.
> Aaron
> 
> 
> 
> On 18/01/2011, at 9:29 PM, Steven Mac <ug...@hotmail.com> wrote:
> 
> Some of the fields are indeed written in one shot, but others (such as label and categories) are added later, so I think the question still stands.
> 
> Hugo.
> 
> From: driftx@gmail.com
> Date: Mon, 17 Jan 2011 18:47:28 -0600
> Subject: Re: Super CF or two CFs?
> To: user@cassandra.apache.org
> 
> On Mon, Jan 17, 2011 at 5:12 PM, Steven Mac <ug...@hotmail.com> wrote:
> I guess I was maybe trying to simplify the question too much. In reality I do not have one volatile part, but multiple ones (say all trading data of day). Each would be a supercolumn identified by the time slot, with the individual fields as subcolumns.
> 
> If you're always going to write these attributes in one shot, then just serialize them and use a simple CF, there's no need for a SCF.
> 
> -Brandon

RE: Super CF or two CFs?

Posted by Steven Mac <ug...@hotmail.com>.

Thanks for the answer. It provides me the insight I'm looking for.

However, I'm also a bit confused as your first paragraph seems to indicate that using a SCF is better, whereas the last sentence states just the opposite. Do I interpret correctly that this is because of the compactions that put all non-volatile data together in one sstable, leading to compact sstable if the non-volatile data is put into a separate CF? Can this then be generalised into a rule of thumb to separate non-volatile data from volatile data into separate CFs, or am I going too far then?

I will definitely be trying out both suggestions and post my findings.

Hugo.

Subject: Re: Super CF or two CFs?
From: aaron@thelastpickle.com
Date: Tue, 18 Jan 2011 21:54:25 +1300
To: user@cassandra.apache.org

With regard to overwrites, and assuming you always want to get all the data for a stock ticker. Any read on the volatile data will potentially touch many sstables, this IO is unavoidable to read this data so we may as well read as many cols as possible at this time. Whereas if you split the data into two cf's you would incure all the IO for the volatile data plus IO for the non volatile, and have to make two calls. (Or use different keys and make a multiget_slice call, the IO argument still stands)
Thanks to compaction less volatile data, say cols that are written once a day, week or month, will be tend to accrete into fewer sstables. To that end it may make sense to schedule compactions to run after weekly bulk operations. Also take a look at the per CF compaction thresholds.
I'd recommend trying one standard CF (with the quotes packed as suggested) to start with, run some tests and let us know how you go. There are some small penalties to using super Cfs, see the limitations page on the wiki.
Hope that helps.Aaron

On 18/01/2011, at 9:29 PM, Steven Mac <ug...@hotmail.com> wrote:

Some of the fields are indeed written in one shot, but others (such as label and categories) are added later, so I think the question still stands.

Hugo.

From: driftx@gmail.com
Date: Mon, 17 Jan 2011 18:47:28 -0600
Subject: Re: Super CF or two CFs?
To: user@cassandra.apache.org

On Mon, Jan 17, 2011 at 5:12 PM, Steven Mac <ug...@hotmail.com> wrote:

I guess I was maybe trying to simplify the question too much. In reality I do not have one volatile part, but multiple ones (say all trading data of day). Each would be a supercolumn identified by the time slot, with the individual fields as subcolumns.

If you're always going to write these attributes in one shot, then just serialize them and use a simple CF, there's no need for a SCF.
-Brandon

Re: Super CF or two CFs?

Posted by Aaron Morton <aa...@thelastpickle.com>.

With regard to overwrites, and assuming you always want to get all the data for a stock ticker. Any read on the volatile data will potentially touch many sstables, this IO is unavoidable to read this data so we may as well read as many cols as possible at this time. Whereas if you split the data into two cf's you would incure all the IO for the volatile data plus IO for the non volatile, and have to make two calls. (Or use different keys and make a multiget_slice call, the IO argument still stands)

Thanks to compaction less volatile data, say cols that are written once a day, week or month, will be tend to accrete into fewer sstables. To that end it may make sense to schedule compactions to run after weekly bulk operations. Also take a look at the per CF compaction thresholds.

I'd recommend trying one standard CF (with the quotes packed as suggested) to start with, run some tests and let us know how you go. There are some small penalties to using super Cfs, see the limitations page on the wiki.

Hope that helps.
Aaron

On 18/01/2011, at 9:29 PM, Steven Mac <ug...@hotmail.com> wrote:

> Some of the fields are indeed written in one shot, but others (such as label and categories) are added later, so I think the question still stands.
> 
> Hugo.
> 
> From: driftx@gmail.com
> Date: Mon, 17 Jan 2011 18:47:28 -0600
> Subject: Re: Super CF or two CFs?
> To: user@cassandra.apache.org
> 
> On Mon, Jan 17, 2011 at 5:12 PM, Steven Mac <ug...@hotmail.com> wrote:
> I guess I was maybe trying to simplify the question too much. In reality I do not have one volatile part, but multiple ones (say all trading data of day). Each would be a supercolumn identified by the time slot, with the individual fields as subcolumns.
> 
> If you're always going to write these attributes in one shot, then just serialize them and use a simple CF, there's no need for a SCF.
> 
> -Brandon

RE: Super CF or two CFs?

Posted by Steven Mac <ug...@hotmail.com>.

Some of the fields are indeed written in one shot, but others (such as label and categories) are added later, so I think the question still stands.

Hugo.

From: driftx@gmail.com
Date: Mon, 17 Jan 2011 18:47:28 -0600
Subject: Re: Super CF or two CFs?
To: user@cassandra.apache.org

On Mon, Jan 17, 2011 at 5:12 PM, Steven Mac <ug...@hotmail.com> wrote:

I guess I was maybe trying to simplify the question too much. In reality I do not have one volatile part, but multiple ones (say all trading data of day). Each would be a supercolumn identified by the time slot, with the individual fields as subcolumns.

If you're always going to write these attributes in one shot, then just serialize them and use a simple CF, there's no need for a SCF.
-Brandon

Re: Super CF or two CFs?

Posted by Brandon Williams <dr...@gmail.com>.

On Mon, Jan 17, 2011 at 5:12 PM, Steven Mac <ug...@hotmail.com> wrote:

>  I guess I was maybe trying to simplify the question too much. In reality I
> do not have one volatile part, but multiple ones (say all trading data of
> day). Each would be a supercolumn identified by the time slot, with the
> individual fields as subcolumns.
>

If you're always going to write these attributes in one shot, then just
serialize them and use a simple CF, there's no need for a SCF.

-Brandon

RE: Super CF or two CFs?

Posted by Steven Mac <ug...@hotmail.com>.

I guess I was maybe trying to simplify the question too much. In reality I do not have one volatile part, but multiple ones (say all trading data of day). Each would be a supercolumn identified by the time slot, with the individual fields as subcolumns.

Of course, I could prefix the time slot identifier to the field names and make do with a normal CF, but couldn't this be done for any super column? In other words, why have it at all?

Steven.

> Date: Mon, 17 Jan 2011 22:58:14 +0000
> Subject: Re: Super CF or two CFs?
> From: stephen.alan.connolly@gmail.com
> To: user@cassandra.apache.org
> 
> On 17 January 2011 22:36, Steven Mac <ug...@hotmail.com> wrote:
> > Sure, consider stock data, where the stock symbol is the row key. The stock
> > data consists of a rather stable part and a very volatile part, both of
> > which would be a super column. The stable super column would contain
> > subcolumns such as company name, address, and some annual or quarterly data.
> > The volatile super column would contain periodic stock data, such as current
> > price, last trade times, volumes, buyers, sellers, etc.
> >
> > The volatile super columns would be updated every few minutes, many rows at
> > once using a single batch_mutate. The data would be read using a get on a
> > single row key, returning both supercolumns and all subcolumns.
> >
> > The data could also be split over two column families, one for the stable
> > part and one for the volatile part. The updates would be the same, while a
> > read would require two get operations.
> 
> I'm not seeing why you need to use supercolumns for this at all.
> 
> Standard columns would seem just fine in this case (as long as you
> have good naming for your columns)
> 
> And you probably only need one column family... but people more expert
> than me could advise better...
> 
> I guess the question I have is why you feel the solution should
> involve supercolumns
> 
> -Stephen
> 
> >
> > Regards, Steven.
> >
> > ________________________________
> > Date: Mon, 17 Jan 2011 12:20:46 -0800
> > Subject: Re: Super CF or two CFs?
> > From: daveviner@gmail.com
> > To: user@cassandra.apache.org
> >
> > can you give an example of the data and how you'd access it?
> > what would your expected columns (and/or supercolumns) be?
> >
> > Dave Viner
> > On Mon, Jan 17, 2011 at 11:05 AM, Steven Mac <ug...@hotmail.com> wrote:
> >
> > How can I best map an object containing two maps, one of which is updated
> > very frequently and the other only occasionally?
> >
> > a) As one super CF, which each map in a separate supercolumn and the map
> > entries being the subcolumns?
> > b) As two CFs, one for each map.
> >
> > I'd like to discuss the why behind a choice, in order to learn about the
> > impact of a design choice on performance, SStable size/disk usage,
> > compactions, etc.
> >
> > Regards, Steven.
> >
> > PS: Objects will always be read as a whole.
> >

Re: Super CF or two CFs?

Posted by Stephen Connolly <st...@gmail.com>.

On 17 January 2011 22:36, Steven Mac <ug...@hotmail.com> wrote:
> Sure, consider stock data, where the stock symbol is the row key. The stock
> data consists of a rather stable part and a very volatile part, both of
> which would be a super column. The stable super column would contain
> subcolumns such as company name, address, and some annual or quarterly data.
> The volatile super column would contain periodic stock data, such as current
> price, last trade times, volumes, buyers, sellers, etc.
>
> The volatile super columns would be updated every few minutes, many rows at
> once using a single batch_mutate. The data would be read using a get on a
> single row key, returning both supercolumns and all subcolumns.
>
> The data could also be split over two column families, one for the stable
> part and one for the volatile part. The updates would be the same, while a
> read would require two get operations.

I'm not seeing why you need to use supercolumns for this at all.

Standard columns would seem just fine in this case (as long as you
have good naming for your columns)

And you probably only need one column family... but people more expert
than me could advise better...

I guess the question I have is why you feel the solution should
involve supercolumns

-Stephen

>
> Regards, Steven.
>
> ________________________________
> Date: Mon, 17 Jan 2011 12:20:46 -0800
> Subject: Re: Super CF or two CFs?
> From: daveviner@gmail.com
> To: user@cassandra.apache.org
>
> can you give an example of the data and how you'd access it?
> what would your expected columns (and/or supercolumns) be?
>
> Dave Viner
> On Mon, Jan 17, 2011 at 11:05 AM, Steven Mac <ug...@hotmail.com> wrote:
>
> How can I best map an object containing two maps, one of which is updated
> very frequently and the other only occasionally?
>
> a) As one super CF, which each map in a separate supercolumn and the map
> entries being the subcolumns?
> b) As two CFs, one for each map.
>
> I'd like to discuss the why behind a choice, in order to learn about the
> impact of a design choice on performance, SStable size/disk usage,
> compactions, etc.
>
> Regards, Steven.
>
> PS: Objects will always be read as a whole.
>

RE: Super CF or two CFs?

Posted by Steven Mac <ug...@hotmail.com>.

Sure, consider stock data, where the stock symbol is the row key. The stock data consists of a rather stable part and a very volatile part, both of which would be a super column. The stable super column would contain subcolumns such as company name, address, and some annual or quarterly data. The volatile super column would contain periodic stock data, such as current price, last trade times, volumes, buyers, sellers, etc.

The volatile super columns would be updated every few minutes, many rows at once using a single batch_mutate. The data would be read using a get on a single row key, returning both supercolumns and all subcolumns.

The data could also be split over two column families, one for the stable part and one for the volatile part. The updates would be the same, while a read would require two get operations.

Regards, Steven.

Date: Mon, 17 Jan 2011 12:20:46 -0800
Subject: Re: Super CF or two CFs?
From: daveviner@gmail.com
To: user@cassandra.apache.org

can you give an example of the data and how you'd access it?what would your expected columns (and/or supercolumns) be?

Dave Viner
On Mon, Jan 17, 2011 at 11:05 AM, Steven Mac <ug...@hotmail.com> wrote:

How can I best map an object containing two maps, one of which is updated very frequently and the other only occasionally?

a) As one super CF, which each map in a separate supercolumn and the map entries being the subcolumns?

b) As two CFs, one for each map.

I'd like to discuss the why behind a choice, in order to learn about the impact of a design choice on performance, SStable size/disk usage, compactions, etc.

Regards, Steven.

PS: Objects will always be read as a whole.

Re: Super CF or two CFs?

Posted by Dave Viner <da...@gmail.com>.

can you give an example of the data and how you'd access it?
what would your expected columns (and/or supercolumns) be?

Dave Viner

On Mon, Jan 17, 2011 at 11:05 AM, Steven Mac <ug...@hotmail.com> wrote:

>  How can I best map an object containing two maps, one of which is updated
> very frequently and the other only occasionally?
>
> a) As one super CF, which each map in a separate supercolumn and the map
> entries being the subcolumns?
> b) As two CFs, one for each map.
>
> I'd like to discuss the why behind a choice, in order to learn about the
> impact of a design choice on performance, SStable size/disk usage,
> compactions, etc.
>
> Regards, Steven.
>
> PS: Objects will always be read as a whole.
>