You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Yiming Sun <yi...@gmail.com> on 2012/03/28 22:40:47 UTC

data size difference between supercolumn and regular column

Hi,

We are trying to estimate the amount of storage we need for a production
cassandra cluster.  While I was doing the calculation, I noticed a very
dramatic difference in terms of storage space used by cassandra data files.

Our previous setup consists of a single-node cassandra 0.8.x with no
replication, and the data is stored using supercolumns, and the data files
total about 534GB on disk.

A few weeks ago, I put together a cluster consisting of 3 nodes running
cassandra 1.0 with replication factor of 2, and the data is flattened out
and stored using regular columns.  And the aggregated data file size is
only 488GB (would be 244GB if no replication).

This is a very dramatic reduction in terms of storage needs, and is
certainly good news in terms of how much storage we need to provision.
 However, because of the dramatic reduction, I also would like to make sure
it is absolutely correct before submitting it - and also get a sense of why
there was such a difference. -- I know cassandra 1.0 does data compression,
but does the schema change from supercolumn to regular column also help
reduce storage usage?  Thanks.

-- Y.

Re: data size difference between supercolumn and regular column

Posted by Yiming Sun <yi...@gmail.com>.
Thanks for the advice, Maki, especially on the ulimit!  Yes, we will play
with the configuration and figure out some optimal sstable size.

-- Y.

On Wed, Apr 4, 2012 at 9:49 AM, Watanabe Maki <wa...@gmail.com>wrote:

> LeveledCompaction will use less disk space(load), but need more IO.
> If your traffic is too high for your disk, you will have many pending
> compaction tasks, and large number of sstables which wait to be compacted.
> Also the default sstable_size_in_mb  (5MB) will be too small for large
> data set. You should better to have test iteration with different size
> configuration.
> Don't forget to unlimit number of file descriptors, and monitor tpstats
> and iostat.
>
> maki
>
> From iPhone
>
>
> On 2012/04/04, at 22:19, Yiming Sun <yi...@gmail.com> wrote:
>
> Cool, I will look into this new leveled compaction strategy and give it a
> try.
>
> BTW, Aaron, I think the last word of your message meant to say
> "compression", correct?
>
> -- Y.
>
> On Mon, Apr 2, 2012 at 9:37 PM, aaron morton <aa...@thelastpickle.com>wrote:
>
>> If you have a workload with overwrites you will end up with some data
>> needing compaction. Running a nightly manual compaction would remove this,
>> but it will also soak up some IO so it may not be the best solution.
>>
>> I do not know if Leveled compaction would result in a smaller disk load
>> for the same workload.
>>
>> I agree with other people, turn on compaction.
>>
>> Cheers
>>
>>   -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 3/04/2012, at 9:19 AM, Yiming Sun wrote:
>>
>> Yup Jeremiah, I learned a hard lesson on how cassandra behaves when it
>> runs out of disk space :-S.    I didn't try the compression, but when it
>> ran out of disk space, or near running out, compaction would fail because
>> it needs space to create some tmp data files.
>>
>> I shall get a tatoo that says keep it around 50% -- this is valuable tip.
>>
>> -- Y.
>>
>> On Sun, Apr 1, 2012 at 11:25 PM, Jeremiah Jordan <
>> JEREMIAH.JORDAN@morningstar.com> wrote:
>>
>>>  Is that 80% with compression?  If not, the first thing to do is turn on
>>> compression.  Cassandra doesn't behave well when it runs out of disk space.
>>>  You really want to try and stay around 50%,  60-70% works, but only if it
>>> is spread across multiple column families, and even then you can run into
>>> issues when doing repairs.
>>>
>>>  -Jeremiah
>>>
>>>
>>>
>>>  On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote:
>>>
>>> Thanks Aaron.  Well I guess it is possible the data files from
>>> sueprcolumns could've been reduced in size after compaction.
>>>
>>>  This bring yet another question.  Say I am on a shoestring budget and
>>> can only put together a cluster with very limited storage space.  The first
>>> iteration of pushing data into cassandra would drive the disk usage up into
>>> the 80% range.  As time goes by, there will be updates to the data, and
>>> many columns will be overwritten.  If I just push the updates in, the disks
>>> will run out of space on all of the cluster nodes.  What would be the best
>>> way to handle such a situation if I cannot to buy larger disks? Do I need
>>> to delete the rows/columns that are going to be updated, do a compaction,
>>> and then insert the updates?  Or is there a better way?  Thanks
>>>
>>>  -- Y.
>>>
>>> On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <aa...@thelastpickle.com>wrote:
>>>
>>>>   does cassandra 1.0 perform some default compression?
>>>>
>>>>  No.
>>>>
>>>>  The on disk size depends to some degree on the work load.
>>>>
>>>>  If there are a lot of overwrites or deleted you may have rows/columns
>>>> that need to be compacted. You may have some big old SSTables that have not
>>>> been compacted for a while.
>>>>
>>>>  There is some overhead involved in the super columns: the super col
>>>> name, length of the name and the number of columns.
>>>>
>>>>  Cheers
>>>>
>>>>     -----------------
>>>> Aaron Morton
>>>> Freelance Developer
>>>> @aaronmorton
>>>> http://www.thelastpickle.com
>>>>
>>>>  On 29/03/2012, at 9:47 AM, Yiming Sun wrote:
>>>>
>>>> Actually, after I read an article on cassandra 1.0 compression just now
>>>> (
>>>> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression),
>>>> I am more puzzled.  In our schema, we didn't specify any compression
>>>> options -- does cassandra 1.0 perform some default compression? or is the
>>>> data reduction purely because of the schema change?  Thanks.
>>>>
>>>>  -- Y.
>>>>
>>>> On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yi...@gmail.com>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>  We are trying to estimate the amount of storage we need for a
>>>>> production cassandra cluster.  While I was doing the calculation, I noticed
>>>>> a very dramatic difference in terms of storage space used by cassandra data
>>>>> files.
>>>>>
>>>>>  Our previous setup consists of a single-node cassandra 0.8.x with no
>>>>> replication, and the data is stored using supercolumns, and the data files
>>>>> total about 534GB on disk.
>>>>>
>>>>>  A few weeks ago, I put together a cluster consisting of 3 nodes
>>>>> running cassandra 1.0 with replication factor of 2, and the data is
>>>>> flattened out and stored using regular columns.  And the aggregated data
>>>>> file size is only 488GB (would be 244GB if no replication).
>>>>>
>>>>>  This is a very dramatic reduction in terms of storage needs, and is
>>>>> certainly good news in terms of how much storage we need to provision.
>>>>>  However, because of the dramatic reduction, I also would like to make sure
>>>>> it is absolutely correct before submitting it - and also get a sense of why
>>>>> there was such a difference. -- I know cassandra 1.0 does data compression,
>>>>> but does the schema change from supercolumn to regular column also help
>>>>> reduce storage usage?  Thanks.
>>>>>
>>>>>  -- Y.
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

Re: data size difference between supercolumn and regular column

Posted by Watanabe Maki <wa...@gmail.com>.
LeveledCompaction will use less disk space(load), but need more IO.
If your traffic is too high for your disk, you will have many pending compaction tasks, and large number of sstables which wait to be compacted.
Also the default sstable_size_in_mb  (5MB) will be too small for large data set. You should better to have test iteration with different size configuration.
Don't forget to unlimit number of file descriptors, and monitor tpstats and iostat.

maki

From iPhone


On 2012/04/04, at 22:19, Yiming Sun <yi...@gmail.com> wrote:

> Cool, I will look into this new leveled compaction strategy and give it a try.
> 
> BTW, Aaron, I think the last word of your message meant to say "compression", correct?
> 
> -- Y.
> 
> On Mon, Apr 2, 2012 at 9:37 PM, aaron morton <aa...@thelastpickle.com> wrote:
> If you have a workload with overwrites you will end up with some data needing compaction. Running a nightly manual compaction would remove this, but it will also soak up some IO so it may not be the best solution. 
> 
> I do not know if Leveled compaction would result in a smaller disk load for the same workload. 
> 
> I agree with other people, turn on compaction. 
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 3/04/2012, at 9:19 AM, Yiming Sun wrote:
> 
>> Yup Jeremiah, I learned a hard lesson on how cassandra behaves when it runs out of disk space :-S.    I didn't try the compression, but when it ran out of disk space, or near running out, compaction would fail because it needs space to create some tmp data files.
>> 
>> I shall get a tatoo that says keep it around 50% -- this is valuable tip.
>> 
>> -- Y.
>> 
>> On Sun, Apr 1, 2012 at 11:25 PM, Jeremiah Jordan <JE...@morningstar.com> wrote:
>> Is that 80% with compression?  If not, the first thing to do is turn on compression.  Cassandra doesn't behave well when it runs out of disk space.  You really want to try and stay around 50%,  60-70% works, but only if it is spread across multiple column families, and even then you can run into issues when doing repairs.
>> 
>> -Jeremiah
>> 
>> 
>> 
>> On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote:
>> 
>>> Thanks Aaron.  Well I guess it is possible the data files from sueprcolumns could've been reduced in size after compaction.
>>> 
>>> This bring yet another question.  Say I am on a shoestring budget and can only put together a cluster with very limited storage space.  The first iteration of pushing data into cassandra would drive the disk usage up into the 80% range.  As time goes by, there will be updates to the data, and many columns will be overwritten.  If I just push the updates in, the disks will run out of space on all of the cluster nodes.  What would be the best way to handle such a situation if I cannot to buy larger disks? Do  I need to delete the rows/columns that are going to be updated, do a compaction, and then insert the updates?  Or is there a better way?  Thanks
>>> 
>>> -- Y.
>>> 
>>> On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <aa...@thelastpickle.com> wrote:
>>>> does cassandra 1.0 perform some default compression? 
>>> No. 
>>> 
>>> The on disk size depends to some degree on the work load. 
>>> 
>>> If there are a lot of overwrites or deleted you may have rows/columns that need to be compacted. You may have some big old SSTables that have not been compacted for a while. 
>>> 
>>> There is some overhead involved in the super columns: the super col name, length of the name and the number of columns.  
>>> 
>>> Cheers
>>> 
>>> -----------------
>>> Aaron Morton
>>> Freelance Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>> 
>>> On 29/03/2012, at 9:47 AM, Yiming Sun wrote:
>>> 
>>>> Actually, after I read an article on cassandra 1.0 compression just now ( http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression), I am more puzzled.  In our schema, we didn't specify any compression options -- does cassandra 1.0 perform some default compression? or is the data reduction purely because of the schema change?  Thanks.
>>>> 
>>>> -- Y.
>>>> 
>>>> On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yi...@gmail.com> wrote:
>>>> Hi,
>>>> 
>>>> We are trying to estimate the amount of storage we need for a production cassandra cluster.  While I was doing the calculation, I noticed a very dramatic difference in terms of storage space used by cassandra data files.
>>>> 
>>>> Our previous setup consists of a single-node cassandra 0.8.x with no replication, and the data is stored using supercolumns, and the data files total about 534GB on disk.
>>>> 
>>>> A few weeks ago, I put together a cluster consisting of 3 nodes running cassandra 1.0 with replication factor of 2, and the data is flattened out and stored using regular columns.  And the aggregated data file size is only 488GB (would be 244GB if no replication).
>>>> 
>>>> This is a very dramatic reduction in terms of storage needs, and is certainly good news in terms of how much storage we need to provision.  However, because of the dramatic reduction, I also would like to make sure it is absolutely correct before submitting it - and also get a sense of why there was such a difference. -- I know cassandra 1.0 does data compression, but does the schema change from supercolumn to regular column also help reduce storage usage?  Thanks.
>>>> 
>>>> -- Y.
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 

Re: data size difference between supercolumn and regular column

Posted by Yiming Sun <yi...@gmail.com>.
Cool, I will look into this new leveled compaction strategy and give it a
try.

BTW, Aaron, I think the last word of your message meant to say
"compression", correct?

-- Y.

On Mon, Apr 2, 2012 at 9:37 PM, aaron morton <aa...@thelastpickle.com>wrote:

> If you have a workload with overwrites you will end up with some data
> needing compaction. Running a nightly manual compaction would remove this,
> but it will also soak up some IO so it may not be the best solution.
>
> I do not know if Leveled compaction would result in a smaller disk load
> for the same workload.
>
> I agree with other people, turn on compaction.
>
> Cheers
>
>   -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 3/04/2012, at 9:19 AM, Yiming Sun wrote:
>
> Yup Jeremiah, I learned a hard lesson on how cassandra behaves when it
> runs out of disk space :-S.    I didn't try the compression, but when it
> ran out of disk space, or near running out, compaction would fail because
> it needs space to create some tmp data files.
>
> I shall get a tatoo that says keep it around 50% -- this is valuable tip.
>
> -- Y.
>
> On Sun, Apr 1, 2012 at 11:25 PM, Jeremiah Jordan <
> JEREMIAH.JORDAN@morningstar.com> wrote:
>
>>  Is that 80% with compression?  If not, the first thing to do is turn on
>> compression.  Cassandra doesn't behave well when it runs out of disk space.
>>  You really want to try and stay around 50%,  60-70% works, but only if it
>> is spread across multiple column families, and even then you can run into
>> issues when doing repairs.
>>
>>  -Jeremiah
>>
>>
>>
>>  On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote:
>>
>> Thanks Aaron.  Well I guess it is possible the data files from
>> sueprcolumns could've been reduced in size after compaction.
>>
>>  This bring yet another question.  Say I am on a shoestring budget and
>> can only put together a cluster with very limited storage space.  The first
>> iteration of pushing data into cassandra would drive the disk usage up into
>> the 80% range.  As time goes by, there will be updates to the data, and
>> many columns will be overwritten.  If I just push the updates in, the disks
>> will run out of space on all of the cluster nodes.  What would be the best
>> way to handle such a situation if I cannot to buy larger disks? Do I need
>> to delete the rows/columns that are going to be updated, do a compaction,
>> and then insert the updates?  Or is there a better way?  Thanks
>>
>>  -- Y.
>>
>> On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <aa...@thelastpickle.com>wrote:
>>
>>>   does cassandra 1.0 perform some default compression?
>>>
>>>  No.
>>>
>>>  The on disk size depends to some degree on the work load.
>>>
>>>  If there are a lot of overwrites or deleted you may have rows/columns
>>> that need to be compacted. You may have some big old SSTables that have not
>>> been compacted for a while.
>>>
>>>  There is some overhead involved in the super columns: the super col
>>> name, length of the name and the number of columns.
>>>
>>>  Cheers
>>>
>>>     -----------------
>>> Aaron Morton
>>> Freelance Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>>
>>>  On 29/03/2012, at 9:47 AM, Yiming Sun wrote:
>>>
>>> Actually, after I read an article on cassandra 1.0 compression just now
>>> (
>>> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression),
>>> I am more puzzled.  In our schema, we didn't specify any compression
>>> options -- does cassandra 1.0 perform some default compression? or is the
>>> data reduction purely because of the schema change?  Thanks.
>>>
>>>  -- Y.
>>>
>>> On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yi...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>>  We are trying to estimate the amount of storage we need for a
>>>> production cassandra cluster.  While I was doing the calculation, I noticed
>>>> a very dramatic difference in terms of storage space used by cassandra data
>>>> files.
>>>>
>>>>  Our previous setup consists of a single-node cassandra 0.8.x with no
>>>> replication, and the data is stored using supercolumns, and the data files
>>>> total about 534GB on disk.
>>>>
>>>>  A few weeks ago, I put together a cluster consisting of 3 nodes
>>>> running cassandra 1.0 with replication factor of 2, and the data is
>>>> flattened out and stored using regular columns.  And the aggregated data
>>>> file size is only 488GB (would be 244GB if no replication).
>>>>
>>>>  This is a very dramatic reduction in terms of storage needs, and is
>>>> certainly good news in terms of how much storage we need to provision.
>>>>  However, because of the dramatic reduction, I also would like to make sure
>>>> it is absolutely correct before submitting it - and also get a sense of why
>>>> there was such a difference. -- I know cassandra 1.0 does data compression,
>>>> but does the schema change from supercolumn to regular column also help
>>>> reduce storage usage?  Thanks.
>>>>
>>>>  -- Y.
>>>>
>>>
>>>
>>>
>>
>>
>
>

Re: data size difference between supercolumn and regular column

Posted by Tamar Fraenkel <ta...@tok-media.com>.
Do you have a good reference for maintenance scripts for Cassandra ring?
Thanks,
*Tamar Fraenkel *
Senior Software Engineer, TOK Media

[image: Inline image 1]

tamar@tok-media.com
Tel:   +972 2 6409736
Mob:  +972 54 8356490
Fax:   +972 2 5612956





On Tue, Apr 3, 2012 at 4:37 AM, aaron morton <aa...@thelastpickle.com>wrote:

> If you have a workload with overwrites you will end up with some data
> needing compaction. Running a nightly manual compaction would remove this,
> but it will also soak up some IO so it may not be the best solution.
>
> I do not know if Leveled compaction would result in a smaller disk load
> for the same workload.
>
> I agree with other people, turn on compaction.
>
> Cheers
>
>   -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 3/04/2012, at 9:19 AM, Yiming Sun wrote:
>
> Yup Jeremiah, I learned a hard lesson on how cassandra behaves when it
> runs out of disk space :-S.    I didn't try the compression, but when it
> ran out of disk space, or near running out, compaction would fail because
> it needs space to create some tmp data files.
>
> I shall get a tatoo that says keep it around 50% -- this is valuable tip.
>
> -- Y.
>
> On Sun, Apr 1, 2012 at 11:25 PM, Jeremiah Jordan <
> JEREMIAH.JORDAN@morningstar.com> wrote:
>
>>  Is that 80% with compression?  If not, the first thing to do is turn on
>> compression.  Cassandra doesn't behave well when it runs out of disk space.
>>  You really want to try and stay around 50%,  60-70% works, but only if it
>> is spread across multiple column families, and even then you can run into
>> issues when doing repairs.
>>
>>  -Jeremiah
>>
>>
>>
>>  On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote:
>>
>> Thanks Aaron.  Well I guess it is possible the data files from
>> sueprcolumns could've been reduced in size after compaction.
>>
>>  This bring yet another question.  Say I am on a shoestring budget and
>> can only put together a cluster with very limited storage space.  The first
>> iteration of pushing data into cassandra would drive the disk usage up into
>> the 80% range.  As time goes by, there will be updates to the data, and
>> many columns will be overwritten.  If I just push the updates in, the disks
>> will run out of space on all of the cluster nodes.  What would be the best
>> way to handle such a situation if I cannot to buy larger disks? Do I need
>> to delete the rows/columns that are going to be updated, do a compaction,
>> and then insert the updates?  Or is there a better way?  Thanks
>>
>>  -- Y.
>>
>> On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <aa...@thelastpickle.com>wrote:
>>
>>>   does cassandra 1.0 perform some default compression?
>>>
>>>  No.
>>>
>>>  The on disk size depends to some degree on the work load.
>>>
>>>  If there are a lot of overwrites or deleted you may have rows/columns
>>> that need to be compacted. You may have some big old SSTables that have not
>>> been compacted for a while.
>>>
>>>  There is some overhead involved in the super columns: the super col
>>> name, length of the name and the number of columns.
>>>
>>>  Cheers
>>>
>>>     -----------------
>>> Aaron Morton
>>> Freelance Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>>
>>>  On 29/03/2012, at 9:47 AM, Yiming Sun wrote:
>>>
>>> Actually, after I read an article on cassandra 1.0 compression just now
>>> (
>>> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression),
>>> I am more puzzled.  In our schema, we didn't specify any compression
>>> options -- does cassandra 1.0 perform some default compression? or is the
>>> data reduction purely because of the schema change?  Thanks.
>>>
>>>  -- Y.
>>>
>>> On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yi...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>>  We are trying to estimate the amount of storage we need for a
>>>> production cassandra cluster.  While I was doing the calculation, I noticed
>>>> a very dramatic difference in terms of storage space used by cassandra data
>>>> files.
>>>>
>>>>  Our previous setup consists of a single-node cassandra 0.8.x with no
>>>> replication, and the data is stored using supercolumns, and the data files
>>>> total about 534GB on disk.
>>>>
>>>>  A few weeks ago, I put together a cluster consisting of 3 nodes
>>>> running cassandra 1.0 with replication factor of 2, and the data is
>>>> flattened out and stored using regular columns.  And the aggregated data
>>>> file size is only 488GB (would be 244GB if no replication).
>>>>
>>>>  This is a very dramatic reduction in terms of storage needs, and is
>>>> certainly good news in terms of how much storage we need to provision.
>>>>  However, because of the dramatic reduction, I also would like to make sure
>>>> it is absolutely correct before submitting it - and also get a sense of why
>>>> there was such a difference. -- I know cassandra 1.0 does data compression,
>>>> but does the schema change from supercolumn to regular column also help
>>>> reduce storage usage?  Thanks.
>>>>
>>>>  -- Y.
>>>>
>>>
>>>
>>>
>>
>>
>
>

Re: data size difference between supercolumn and regular column

Posted by aaron morton <aa...@thelastpickle.com>.
If you have a workload with overwrites you will end up with some data needing compaction. Running a nightly manual compaction would remove this, but it will also soak up some IO so it may not be the best solution. 

I do not know if Leveled compaction would result in a smaller disk load for the same workload. 

I agree with other people, turn on compaction. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 3/04/2012, at 9:19 AM, Yiming Sun wrote:

> Yup Jeremiah, I learned a hard lesson on how cassandra behaves when it runs out of disk space :-S.    I didn't try the compression, but when it ran out of disk space, or near running out, compaction would fail because it needs space to create some tmp data files.
> 
> I shall get a tatoo that says keep it around 50% -- this is valuable tip.
> 
> -- Y.
> 
> On Sun, Apr 1, 2012 at 11:25 PM, Jeremiah Jordan <JE...@morningstar.com> wrote:
> Is that 80% with compression?  If not, the first thing to do is turn on compression.  Cassandra doesn't behave well when it runs out of disk space.  You really want to try and stay around 50%,  60-70% works, but only if it is spread across multiple column families, and even then you can run into issues when doing repairs.
> 
> -Jeremiah
> 
> 
> 
> On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote:
> 
>> Thanks Aaron.  Well I guess it is possible the data files from sueprcolumns could've been reduced in size after compaction.
>> 
>> This bring yet another question.  Say I am on a shoestring budget and can only put together a cluster with very limited storage space.  The first iteration of pushing data into cassandra would drive the disk usage up into the 80% range.  As time goes by, there will be updates to the data, and many columns will be overwritten.  If I just push the updates in, the disks will run out of space on all of the cluster nodes.  What would be the best way to handle such a situation if I cannot to buy larger disks? Do I need to delete the rows/columns that are going to be updated, do a compaction, and then insert the updates?  Or is there a better way?  Thanks
>> 
>> -- Y.
>> 
>> On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <aa...@thelastpickle.com> wrote:
>>> does cassandra 1.0 perform some default compression? 
>> No. 
>> 
>> The on disk size depends to some degree on the work load. 
>> 
>> If there are a lot of overwrites or deleted you may have rows/columns that need to be compacted. You may have some big old SSTables that have not been compacted for a while. 
>> 
>> There is some overhead involved in the super columns: the super col name, length of the name and the number of columns.  
>> 
>> Cheers
>> 
>> -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 29/03/2012, at 9:47 AM, Yiming Sun wrote:
>> 
>>> Actually, after I read an article on cassandra 1.0 compression just now ( http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression), I am more puzzled.  In our schema, we didn't specify any compression options -- does cassandra 1.0 perform some default compression? or is the data reduction purely because of the schema change?  Thanks.
>>> 
>>> -- Y.
>>> 
>>> On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yi...@gmail.com> wrote:
>>> Hi,
>>> 
>>> We are trying to estimate the amount of storage we need for a production cassandra cluster.  While I was doing the calculation, I noticed a very dramatic difference in terms of storage space used by cassandra data files.
>>> 
>>> Our previous setup consists of a single-node cassandra 0.8.x with no replication, and the data is stored using supercolumns, and the data files total about 534GB on disk.
>>> 
>>> A few weeks ago, I put together a cluster consisting of 3 nodes running cassandra 1.0 with replication factor of 2, and the data is flattened out and stored using regular columns.  And the aggregated data file size is only 488GB (would be 244GB if no replication).
>>> 
>>> This is a very dramatic reduction in terms of storage needs, and is certainly good news in terms of how much storage we need to provision.  However, because of the dramatic reduction, I also would like to make sure it is absolutely correct before submitting it - and also get a sense of why there was such a difference. -- I know cassandra 1.0 does data compression, but does the schema change from supercolumn to regular column also help reduce storage usage?  Thanks.
>>> 
>>> -- Y.
>>> 
>> 
>> 
> 
> 


Re: data size difference between supercolumn and regular column

Posted by Yiming Sun <yi...@gmail.com>.
Yup Jeremiah, I learned a hard lesson on how cassandra behaves when it runs
out of disk space :-S.    I didn't try the compression, but when it ran out
of disk space, or near running out, compaction would fail because it needs
space to create some tmp data files.

I shall get a tatoo that says keep it around 50% -- this is valuable tip.

-- Y.

On Sun, Apr 1, 2012 at 11:25 PM, Jeremiah Jordan <
JEREMIAH.JORDAN@morningstar.com> wrote:

>  Is that 80% with compression?  If not, the first thing to do is turn on
> compression.  Cassandra doesn't behave well when it runs out of disk space.
>  You really want to try and stay around 50%,  60-70% works, but only if it
> is spread across multiple column families, and even then you can run into
> issues when doing repairs.
>
>  -Jeremiah
>
>
>
>  On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote:
>
> Thanks Aaron.  Well I guess it is possible the data files from
> sueprcolumns could've been reduced in size after compaction.
>
>  This bring yet another question.  Say I am on a shoestring budget and
> can only put together a cluster with very limited storage space.  The first
> iteration of pushing data into cassandra would drive the disk usage up into
> the 80% range.  As time goes by, there will be updates to the data, and
> many columns will be overwritten.  If I just push the updates in, the disks
> will run out of space on all of the cluster nodes.  What would be the best
> way to handle such a situation if I cannot to buy larger disks? Do I need
> to delete the rows/columns that are going to be updated, do a compaction,
> and then insert the updates?  Or is there a better way?  Thanks
>
>  -- Y.
>
> On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <aa...@thelastpickle.com>wrote:
>
>>   does cassandra 1.0 perform some default compression?
>>
>>  No.
>>
>>  The on disk size depends to some degree on the work load.
>>
>>  If there are a lot of overwrites or deleted you may have rows/columns
>> that need to be compacted. You may have some big old SSTables that have not
>> been compacted for a while.
>>
>>  There is some overhead involved in the super columns: the super col
>> name, length of the name and the number of columns.
>>
>>  Cheers
>>
>>     -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>>  On 29/03/2012, at 9:47 AM, Yiming Sun wrote:
>>
>> Actually, after I read an article on cassandra 1.0 compression just now (
>> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression),
>> I am more puzzled.  In our schema, we didn't specify any compression
>> options -- does cassandra 1.0 perform some default compression? or is the
>> data reduction purely because of the schema change?  Thanks.
>>
>>  -- Y.
>>
>> On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yi...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>  We are trying to estimate the amount of storage we need for a
>>> production cassandra cluster.  While I was doing the calculation, I noticed
>>> a very dramatic difference in terms of storage space used by cassandra data
>>> files.
>>>
>>>  Our previous setup consists of a single-node cassandra 0.8.x with no
>>> replication, and the data is stored using supercolumns, and the data files
>>> total about 534GB on disk.
>>>
>>>  A few weeks ago, I put together a cluster consisting of 3 nodes
>>> running cassandra 1.0 with replication factor of 2, and the data is
>>> flattened out and stored using regular columns.  And the aggregated data
>>> file size is only 488GB (would be 244GB if no replication).
>>>
>>>  This is a very dramatic reduction in terms of storage needs, and is
>>> certainly good news in terms of how much storage we need to provision.
>>>  However, because of the dramatic reduction, I also would like to make sure
>>> it is absolutely correct before submitting it - and also get a sense of why
>>> there was such a difference. -- I know cassandra 1.0 does data compression,
>>> but does the schema change from supercolumn to regular column also help
>>> reduce storage usage?  Thanks.
>>>
>>>  -- Y.
>>>
>>
>>
>>
>
>

Re: data size difference between supercolumn and regular column

Posted by Jeremiah Jordan <JE...@morningstar.com>.
Is that 80% with compression?  If not, the first thing to do is turn on compression.  Cassandra doesn't behave well when it runs out of disk space.  You really want to try and stay around 50%,  60-70% works, but only if it is spread across multiple column families, and even then you can run into issues when doing repairs.

-Jeremiah


On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote:

Thanks Aaron.  Well I guess it is possible the data files from sueprcolumns could've been reduced in size after compaction.

This bring yet another question.  Say I am on a shoestring budget and can only put together a cluster with very limited storage space.  The first iteration of pushing data into cassandra would drive the disk usage up into the 80% range.  As time goes by, there will be updates to the data, and many columns will be overwritten.  If I just push the updates in, the disks will run out of space on all of the cluster nodes.  What would be the best way to handle such a situation if I cannot to buy larger disks? Do I need to delete the rows/columns that are going to be updated, do a compaction, and then insert the updates?  Or is there a better way?  Thanks

-- Y.

On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <aa...@thelastpickle.com>> wrote:
does cassandra 1.0 perform some default compression?
No.

The on disk size depends to some degree on the work load.

If there are a lot of overwrites or deleted you may have rows/columns that need to be compacted. You may have some big old SSTables that have not been compacted for a while.

There is some overhead involved in the super columns: the super col name, length of the name and the number of columns.

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com<http://www.thelastpickle.com/>

On 29/03/2012, at 9:47 AM, Yiming Sun wrote:

Actually, after I read an article on cassandra 1.0 compression just now ( http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression), I am more puzzled.  In our schema, we didn't specify any compression options -- does cassandra 1.0 perform some default compression? or is the data reduction purely because of the schema change?  Thanks.

-- Y.

On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yi...@gmail.com>> wrote:
Hi,

We are trying to estimate the amount of storage we need for a production cassandra cluster.  While I was doing the calculation, I noticed a very dramatic difference in terms of storage space used by cassandra data files.

Our previous setup consists of a single-node cassandra 0.8.x with no replication, and the data is stored using supercolumns, and the data files total about 534GB on disk.

A few weeks ago, I put together a cluster consisting of 3 nodes running cassandra 1.0 with replication factor of 2, and the data is flattened out and stored using regular columns.  And the aggregated data file size is only 488GB (would be 244GB if no replication).

This is a very dramatic reduction in terms of storage needs, and is certainly good news in terms of how much storage we need to provision.  However, because of the dramatic reduction, I also would like to make sure it is absolutely correct before submitting it - and also get a sense of why there was such a difference. -- I know cassandra 1.0 does data compression, but does the schema change from supercolumn to regular column also help reduce storage usage?  Thanks.

-- Y.





Re: data size difference between supercolumn and regular column

Posted by Yiming Sun <yi...@gmail.com>.
Thanks Aaron.  Well I guess it is possible the data files from sueprcolumns
could've been reduced in size after compaction.

This bring yet another question.  Say I am on a shoestring budget and can
only put together a cluster with very limited storage space.  The first
iteration of pushing data into cassandra would drive the disk usage up into
the 80% range.  As time goes by, there will be updates to the data, and
many columns will be overwritten.  If I just push the updates in, the disks
will run out of space on all of the cluster nodes.  What would be the best
way to handle such a situation if I cannot to buy larger disks? Do I need
to delete the rows/columns that are going to be updated, do a compaction,
and then insert the updates?  Or is there a better way?  Thanks

-- Y.

On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <aa...@thelastpickle.com>wrote:

> does cassandra 1.0 perform some default compression?
>
> No.
>
> The on disk size depends to some degree on the work load.
>
> If there are a lot of overwrites or deleted you may have rows/columns that
> need to be compacted. You may have some big old SSTables that have not been
> compacted for a while.
>
> There is some overhead involved in the super columns: the super col name,
> length of the name and the number of columns.
>
> Cheers
>
>   -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 29/03/2012, at 9:47 AM, Yiming Sun wrote:
>
> Actually, after I read an article on cassandra 1.0 compression just now (
> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression),
> I am more puzzled.  In our schema, we didn't specify any compression
> options -- does cassandra 1.0 perform some default compression? or is the
> data reduction purely because of the schema change?  Thanks.
>
> -- Y.
>
> On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yi...@gmail.com> wrote:
>
>> Hi,
>>
>> We are trying to estimate the amount of storage we need for a production
>> cassandra cluster.  While I was doing the calculation, I noticed a very
>> dramatic difference in terms of storage space used by cassandra data files.
>>
>> Our previous setup consists of a single-node cassandra 0.8.x with no
>> replication, and the data is stored using supercolumns, and the data files
>> total about 534GB on disk.
>>
>> A few weeks ago, I put together a cluster consisting of 3 nodes running
>> cassandra 1.0 with replication factor of 2, and the data is flattened out
>> and stored using regular columns.  And the aggregated data file size is
>> only 488GB (would be 244GB if no replication).
>>
>> This is a very dramatic reduction in terms of storage needs, and is
>> certainly good news in terms of how much storage we need to provision.
>>  However, because of the dramatic reduction, I also would like to make sure
>> it is absolutely correct before submitting it - and also get a sense of why
>> there was such a difference. -- I know cassandra 1.0 does data compression,
>> but does the schema change from supercolumn to regular column also help
>> reduce storage usage?  Thanks.
>>
>> -- Y.
>>
>
>
>

Re: data size difference between supercolumn and regular column

Posted by aaron morton <aa...@thelastpickle.com>.
> does cassandra 1.0 perform some default compression? 
No. 

The on disk size depends to some degree on the work load. 

If there are a lot of overwrites or deleted you may have rows/columns that need to be compacted. You may have some big old SSTables that have not been compacted for a while. 

There is some overhead involved in the super columns: the super col name, length of the name and the number of columns.  

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 29/03/2012, at 9:47 AM, Yiming Sun wrote:

> Actually, after I read an article on cassandra 1.0 compression just now ( http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression), I am more puzzled.  In our schema, we didn't specify any compression options -- does cassandra 1.0 perform some default compression? or is the data reduction purely because of the schema change?  Thanks.
> 
> -- Y.
> 
> On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yi...@gmail.com> wrote:
> Hi,
> 
> We are trying to estimate the amount of storage we need for a production cassandra cluster.  While I was doing the calculation, I noticed a very dramatic difference in terms of storage space used by cassandra data files.
> 
> Our previous setup consists of a single-node cassandra 0.8.x with no replication, and the data is stored using supercolumns, and the data files total about 534GB on disk.
> 
> A few weeks ago, I put together a cluster consisting of 3 nodes running cassandra 1.0 with replication factor of 2, and the data is flattened out and stored using regular columns.  And the aggregated data file size is only 488GB (would be 244GB if no replication).
> 
> This is a very dramatic reduction in terms of storage needs, and is certainly good news in terms of how much storage we need to provision.  However, because of the dramatic reduction, I also would like to make sure it is absolutely correct before submitting it - and also get a sense of why there was such a difference. -- I know cassandra 1.0 does data compression, but does the schema change from supercolumn to regular column also help reduce storage usage?  Thanks.
> 
> -- Y.
> 


Re: data size difference between supercolumn and regular column

Posted by Yiming Sun <yi...@gmail.com>.
Actually, after I read an article on cassandra 1.0 compression just now (
http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression), I
am more puzzled.  In our schema, we didn't specify any compression options
-- does cassandra 1.0 perform some default compression? or is the data
reduction purely because of the schema change?  Thanks.

-- Y.

On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yi...@gmail.com> wrote:

> Hi,
>
> We are trying to estimate the amount of storage we need for a production
> cassandra cluster.  While I was doing the calculation, I noticed a very
> dramatic difference in terms of storage space used by cassandra data files.
>
> Our previous setup consists of a single-node cassandra 0.8.x with no
> replication, and the data is stored using supercolumns, and the data files
> total about 534GB on disk.
>
> A few weeks ago, I put together a cluster consisting of 3 nodes running
> cassandra 1.0 with replication factor of 2, and the data is flattened out
> and stored using regular columns.  And the aggregated data file size is
> only 488GB (would be 244GB if no replication).
>
> This is a very dramatic reduction in terms of storage needs, and is
> certainly good news in terms of how much storage we need to provision.
>  However, because of the dramatic reduction, I also would like to make sure
> it is absolutely correct before submitting it - and also get a sense of why
> there was such a difference. -- I know cassandra 1.0 does data compression,
> but does the schema change from supercolumn to regular column also help
> reduce storage usage?  Thanks.
>
> -- Y.
>