You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by John Sanda <jo...@gmail.com> on 2013/12/06 21:08:32 UTC

calculating sizes on disk

I am trying to do some disk capacity planning. I have been referring the
datastax docs[1] and this older blog post[2]. I have a column family with
the following,

row key - 4 bytes
column name - 8 bytes
column value - 8 bytes
max number of non-deleted columns per row - 20160

Is there an effective way to calculate the sizes (or at least a decent
approximation) of the bloom filters and partition indexes on disk?

[1] Calculating user data
size<http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html?pagename=docs&version=1.2&file=index#cassandra/architecture/../../cassandra/architecture/architecturePlanningUserData_t.html>
[2] Cassandra Storage Sizing <http://btoddb-cass-storage.blogspot.com/>

-- 

- John

Re: calculating sizes on disk

Posted by John Sanda <jo...@gmail.com>.
I should have also mentioned that I have tried using the calculations from
the storage sizing post. My lack of success may be due to the post basing
things off of Cassandra 0.8 as well as a lack of understanding in how to do
some of the calculations.


On Fri, Dec 6, 2013 at 3:08 PM, John Sanda <jo...@gmail.com> wrote:

> I am trying to do some disk capacity planning. I have been referring the
> datastax docs[1] and this older blog post[2]. I have a column family with
> the following,
>
> row key - 4 bytes
> column name - 8 bytes
> column value - 8 bytes
> max number of non-deleted columns per row - 20160
>
> Is there an effective way to calculate the sizes (or at least a decent
> approximation) of the bloom filters and partition indexes on disk?
>
> [1] Calculating user data size<http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html?pagename=docs&version=1.2&file=index#cassandra/architecture/../../cassandra/architecture/architecturePlanningUserData_t.html>
> [2] Cassandra Storage Sizing <http://btoddb-cass-storage.blogspot.com/>
>
> --
>
> - John
>



-- 

- John

Re: calculating sizes on disk

Posted by Steven Siebert <sm...@gmail.com>.
Nice work John. If you learn any more, please share.

S


On Sat, Dec 7, 2013 at 11:50 AM, John Sanda <jo...@gmail.com> wrote:

> I finally got the math right for the partition index after tracing through
> SSTableWriter.IndexWriter.append(DecoratedKey key, RowIndexEntry
> indexEntry). I should also note that I am working off of the source for
> 1.2.9. Here is the break down for what gets written to disk in the append()
> call (my keys are 4 bytes  while column names and values are both 8 bytes).
>
> // key
> key length - 2 bytes
> key - 4 bytes
>
> // index entry
> index entry position - 8 bytes
> index entry size - 4 bytes
>
> // when the index entry contains columns index, the following entries will
> be written ceil(total_row_size / column_index_size_in_kb) times
> local deletion time - 4 bytes
> marked for delete at - 8 bytes
> columns index entry first name length - 2 bytes
> columns index entry first name - 8 bytes
> columns index entry last name length - 2 bytes
> columns index entry last name - 8 bytes
> columns index entry offset - 8 bytes
> columns index entry width - 8 bytes
>
> I also went through the serialization code for bloom filters, but I do not
> understand the math. Even with my slightly improved understanding, I am
> still uncertain about how effective any sizing analysis will be since the
> numbers of rows and columns will vary per SSTable.
>
>
> On Fri, Dec 6, 2013 at 3:53 PM, John Sanda <jo...@gmail.com> wrote:
>
>> I have done that, but it only gets me so far because the cluster and app
>> that manages it is run by 3rd parties. Ideally, I would like to provide my
>> end users with a formula or heuristic for establishing some sort of
>> baselines that at least gives them a general idea for planning. Generating
>> data as you have suggested and as I have done is helpful, but it is hard
>> for users to extrapolate out from that.
>>
>>
>> On Fri, Dec 6, 2013 at 3:47 PM, Jacob Rhoden <ja...@me.com> wrote:
>>
>>> Not sure what your end setup will be, but I would probably just spin up
>>> a cluster and fill it with typical data to and measure the size on disk.
>>>
>>> ______________________________
>>> Sent from iPhone
>>>
>>> On 7 Dec 2013, at 6:08 am, John Sanda <jo...@gmail.com> wrote:
>>>
>>> I am trying to do some disk capacity planning. I have been referring the
>>> datastax docs[1] and this older blog post[2]. I have a column family with
>>> the following,
>>>
>>> row key - 4 bytes
>>> column name - 8 bytes
>>> column value - 8 bytes
>>> max number of non-deleted columns per row - 20160
>>>
>>> Is there an effective way to calculate the sizes (or at least a decent
>>> approximation) of the bloom filters and partition indexes on disk?
>>>
>>> [1] Calculating user data size<http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html?pagename=docs&version=1.2&file=index#cassandra/architecture/../../cassandra/architecture/architecturePlanningUserData_t.html>
>>> [2] Cassandra Storage Sizing <http://btoddb-cass-storage.blogspot.com/>
>>>
>>> --
>>>
>>> - John
>>>
>>>
>>
>>
>> --
>>
>> - John
>>
>
>
>
> --
>
> - John
>

Re: calculating sizes on disk

Posted by John Sanda <jo...@gmail.com>.
I finally got the math right for the partition index after tracing through
SSTableWriter.IndexWriter.append(DecoratedKey key, RowIndexEntry
indexEntry). I should also note that I am working off of the source for
1.2.9. Here is the break down for what gets written to disk in the append()
call (my keys are 4 bytes  while column names and values are both 8 bytes).

// key
key length - 2 bytes
key - 4 bytes

// index entry
index entry position - 8 bytes
index entry size - 4 bytes

// when the index entry contains columns index, the following entries will
be written ceil(total_row_size / column_index_size_in_kb) times
local deletion time - 4 bytes
marked for delete at - 8 bytes
columns index entry first name length - 2 bytes
columns index entry first name - 8 bytes
columns index entry last name length - 2 bytes
columns index entry last name - 8 bytes
columns index entry offset - 8 bytes
columns index entry width - 8 bytes

I also went through the serialization code for bloom filters, but I do not
understand the math. Even with my slightly improved understanding, I am
still uncertain about how effective any sizing analysis will be since the
numbers of rows and columns will vary per SSTable.


On Fri, Dec 6, 2013 at 3:53 PM, John Sanda <jo...@gmail.com> wrote:

> I have done that, but it only gets me so far because the cluster and app
> that manages it is run by 3rd parties. Ideally, I would like to provide my
> end users with a formula or heuristic for establishing some sort of
> baselines that at least gives them a general idea for planning. Generating
> data as you have suggested and as I have done is helpful, but it is hard
> for users to extrapolate out from that.
>
>
> On Fri, Dec 6, 2013 at 3:47 PM, Jacob Rhoden <ja...@me.com> wrote:
>
>> Not sure what your end setup will be, but I would probably just spin up a
>> cluster and fill it with typical data to and measure the size on disk.
>>
>> ______________________________
>> Sent from iPhone
>>
>> On 7 Dec 2013, at 6:08 am, John Sanda <jo...@gmail.com> wrote:
>>
>> I am trying to do some disk capacity planning. I have been referring the
>> datastax docs[1] and this older blog post[2]. I have a column family with
>> the following,
>>
>> row key - 4 bytes
>> column name - 8 bytes
>> column value - 8 bytes
>> max number of non-deleted columns per row - 20160
>>
>> Is there an effective way to calculate the sizes (or at least a decent
>> approximation) of the bloom filters and partition indexes on disk?
>>
>> [1] Calculating user data size<http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html?pagename=docs&version=1.2&file=index#cassandra/architecture/../../cassandra/architecture/architecturePlanningUserData_t.html>
>> [2] Cassandra Storage Sizing <http://btoddb-cass-storage.blogspot.com/>
>>
>> --
>>
>> - John
>>
>>
>
>
> --
>
> - John
>



-- 

- John

Re: calculating sizes on disk

Posted by Tim Wintle <ti...@gmail.com>.
I have found that in (limited) practice that it's fairly hard to estimate
due to compression and compaction behaviour. I think measuring and
extrapolating (with an understanding of the datastructures) is the most
effective.

Tim

Sent from my phone
On 6 Dec 2013 20:54, "John Sanda" <jo...@gmail.com> wrote:

> I have done that, but it only gets me so far because the cluster and app
> that manages it is run by 3rd parties. Ideally, I would like to provide my
> end users with a formula or heuristic for establishing some sort of
> baselines that at least gives them a general idea for planning. Generating
> data as you have suggested and as I have done is helpful, but it is hard
> for users to extrapolate out from that.
>
>
> On Fri, Dec 6, 2013 at 3:47 PM, Jacob Rhoden <ja...@me.com> wrote:
>
>> Not sure what your end setup will be, but I would probably just spin up a
>> cluster and fill it with typical data to and measure the size on disk.
>>
>> ______________________________
>> Sent from iPhone
>>
>> On 7 Dec 2013, at 6:08 am, John Sanda <jo...@gmail.com> wrote:
>>
>> I am trying to do some disk capacity planning. I have been referring the
>> datastax docs[1] and this older blog post[2]. I have a column family with
>> the following,
>>
>> row key - 4 bytes
>> column name - 8 bytes
>> column value - 8 bytes
>> max number of non-deleted columns per row - 20160
>>
>> Is there an effective way to calculate the sizes (or at least a decent
>> approximation) of the bloom filters and partition indexes on disk?
>>
>> [1] Calculating user data size<http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html?pagename=docs&version=1.2&file=index#cassandra/architecture/../../cassandra/architecture/architecturePlanningUserData_t.html>
>> [2] Cassandra Storage Sizing <http://btoddb-cass-storage.blogspot.com/>
>>
>> --
>>
>> - John
>>
>>
>
>
> --
>
> - John
>

Re: calculating sizes on disk

Posted by John Sanda <jo...@gmail.com>.
I have done that, but it only gets me so far because the cluster and app
that manages it is run by 3rd parties. Ideally, I would like to provide my
end users with a formula or heuristic for establishing some sort of
baselines that at least gives them a general idea for planning. Generating
data as you have suggested and as I have done is helpful, but it is hard
for users to extrapolate out from that.


On Fri, Dec 6, 2013 at 3:47 PM, Jacob Rhoden <ja...@me.com> wrote:

> Not sure what your end setup will be, but I would probably just spin up a
> cluster and fill it with typical data to and measure the size on disk.
>
> ______________________________
> Sent from iPhone
>
> On 7 Dec 2013, at 6:08 am, John Sanda <jo...@gmail.com> wrote:
>
> I am trying to do some disk capacity planning. I have been referring the
> datastax docs[1] and this older blog post[2]. I have a column family with
> the following,
>
> row key - 4 bytes
> column name - 8 bytes
> column value - 8 bytes
> max number of non-deleted columns per row - 20160
>
> Is there an effective way to calculate the sizes (or at least a decent
> approximation) of the bloom filters and partition indexes on disk?
>
> [1] Calculating user data size<http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html?pagename=docs&version=1.2&file=index#cassandra/architecture/../../cassandra/architecture/architecturePlanningUserData_t.html>
> [2] Cassandra Storage Sizing <http://btoddb-cass-storage.blogspot.com/>
>
> --
>
> - John
>
>


-- 

- John

Re: calculating sizes on disk

Posted by Jacob Rhoden <ja...@me.com>.
Not sure what your end setup will be, but I would probably just spin up a cluster and fill it with typical data to and measure the size on disk.

______________________________
Sent from iPhone

> On 7 Dec 2013, at 6:08 am, John Sanda <jo...@gmail.com> wrote:
> 
> I am trying to do some disk capacity planning. I have been referring the datastax docs[1] and this older blog post[2]. I have a column family with the following,
> 
> row key - 4 bytes
> column name - 8 bytes
> column value - 8 bytes
> max number of non-deleted columns per row - 20160
> 
> Is there an effective way to calculate the sizes (or at least a decent approximation) of the bloom filters and partition indexes on disk?
> 
> [1] Calculating user data size
> [2] Cassandra Storage Sizing
> 
> -- 
> 
> - John