You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Yulian Oifa <oi...@gmail.com> on 2014/04/13 17:26:22 UTC

Cassandra disk usage

I have column family with 2 raws.
2 raws have overall 100 million columns.
Each columns have name of 15 chars ( digits ) and same 15 chars in value (
also digits ).
Each column should have 30 bytes.
Therefore all data should contain approximately 3GB.
Cassandra cluster has 3 servers , and data is stored in quorum ( 2 servers
).
Therefore each server should have 3GB*2/3=2GB of data for this column
family.
Table is almost never changed , data is only removed from this table ,
which possibly created tombstones , but it should not increase the usage.
However when i check the data i see that each server has more then 4GB of
data ( more then twice of what should be).

server 1:
-rw-r--r-- 1 root root 3506446057 Dec 26 12:02 freeNumbers-g-264-Data.db
-rw-r--r-- 1 root root  814699666 Dec 26 12:24 freeNumbers-g-281-Data.db
-rw-r--r-- 1 root root  198432466 Dec 26 12:27 freeNumbers-g-284-Data.db
-rw-r--r-- 1 root root   35883918 Apr 12 20:07 freeNumbers-g-336-Data.db

server 2:
-rw-r--r-- 1 root root 3448432307 Dec 26 11:57 freeNumbers-g-285-Data.db
-rw-r--r-- 1 root root  762399716 Dec 26 12:22 freeNumbers-g-301-Data.db
-rw-r--r-- 1 root root  220887062 Dec 26 12:23 freeNumbers-g-304-Data.db
-rw-r--r-- 1 root root   54914466 Dec 26 12:26 freeNumbers-g-306-Data.db
-rw-r--r-- 1 root root   53639516 Dec 26 12:26 freeNumbers-g-305-Data.db
-rw-r--r-- 1 root root   53007967 Jan  8 15:45 freeNumbers-g-314-Data.db
-rw-r--r-- 1 root root     413717 Apr 12 18:33 freeNumbers-g-359-Data.db


server 3:
-rw-r--r-- 1 root root 4490657264 Apr 11 18:20 freeNumbers-g-358-Data.db
-rw-r--r-- 1 root root     389171 Apr 12 20:58 freeNumbers-g-360-Data.db
-rw-r--r-- 1 root root       4276 Apr 11 18:20
freeNumbers-g-358-Statistics.db
-rw-r--r-- 1 root root       4276 Apr 11 18:24
freeNumbers-g-359-Statistics.db
-rw-r--r-- 1 root root       4276 Apr 12 20:58
freeNumbers-g-360-Statistics.db
-rw-r--r-- 1 root root        976 Apr 11 18:20 freeNumbers-g-358-Filter.db
-rw-r--r-- 1 root root        208 Apr 11 18:24 freeNumbers-g-359-Data.db
-rw-r--r-- 1 root root         78 Apr 11 18:20 freeNumbers-g-358-Index.db
-rw-r--r-- 1 root root         52 Apr 11 18:24 freeNumbers-g-359-Index.db
-rw-r--r-- 1 root root         52 Apr 12 20:58 freeNumbers-g-360-Index.db
-rw-r--r-- 1 root root         16 Apr 11 18:24 freeNumbers-g-359-Filter.db
-rw-r--r-- 1 root root         16 Apr 12 20:58 freeNumbers-g-360-Filter.db

When i try to compact i get the following notification from first server :
INFO [CompactionExecutor:1604] 2014-04-13 18:23:07,260
CompactionController.java (line 146) Compacting large row
USER_DATA/freeNumbers:8bdf9678-6d70-11e3-85ab-80e385abf85d (4555076689
bytes) incrementally

Which confirms that there is around 4.5GB of data on that server only.
Why does cassandra takes so much data???

Best regards
Yulian Oifa

Re: Cassandra disk usage

Posted by Yulian Oifa <oi...@gmail.com>.
Hello
The load of data on 3 nodes is :

Address         DC          Rack        Status State   Load
Owns    Token

113427455640312821154458202477256070485
172.19.10.1     19          10          Up     Normal  22.16 GB
33.33%  0
172.19.10.2     19          10          Up     Normal  19.89 GB
33.33%  56713727820156410577229101238628035242
172.19.10.3     19          10          Up     Normal  30.74 GB
33.33%  113427455640312821154458202477256070485

Best regards
Yulian Oifa



On Sun, Apr 13, 2014 at 9:17 PM, Mark Reddy <ma...@boxever.com> wrote:

> i I will change the data i am storing to decrease the usage , in value i
>> will find some small value to store.Previously i used same value since this
>> table is index only for search purposed and does not really has value.
>
>
> If you don't need a value, you don't have to store anything. You can store
> the column name and leave the value empty, this is a common practice.
>
> 1) What should be recommended read and write consistency and replication
>> factor for 3 nodes with option of future increase server numbers?
>
>
> Both consistency level and replication factor are tuneable depending on
> your application constraints. I'd say a CL or quorum and RF of 3 is the
> general practice.
>
> Still it has 1.5X of overall data how can this be resolved and what is
>> reason for that?
>
>
> As Michał pointed out there is a 15 byte column overhead to consider
> here, where:
>
> total_column_size = column_name_size + column_value_size + 15
>
>
> This link might shed some light on this:
> http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningUserData_t.html
>
> Also i see that data is in different size on all nodes , does that means
>> that servers are out of sync
>
>
> How much is it out by? Data size may differ due to deletes, as you
> mentioned you do deletes. What is the output of 'nodetool ring'?
>
>
> On Sun, Apr 13, 2014 at 6:42 PM, Michal Michalski <
> michal.michalski@boxever.com> wrote:
>
>> > Each columns have name of 15 chars ( digits ) and same 15 chars in
>> value ( also digits ).
>> > Each column should have 30 bytes.
>>
>> Remember about the standard Cassandra's column overhead which is, as far
>> as I remember, 15 bytes, so it's 45 bytes in total - 50% more than you
>> estimated, which kind of matches your 3 GB vs 4.5 GB case.
>>
>> There's also a per-row overhead, but I'm not sure about its size in
>> current C* versions - I remember it was about 25 bytes or so some time ago,
>> but it's not important in your case.
>>
>> Kind regards,
>> Michał Michalski,
>> michal.michalski@boxever.com
>>
>>
>> On 13 April 2014 17:48, Yulian Oifa <oi...@gmail.com> wrote:
>>
>>> Hello Mark and thanks for you reply.
>>> 1) i store is as UTF8String.All digits are from 0x30 to 0x39 and should
>>> take 1 byte each digit. Since all characters are digits it should have 15
>>> bytes.
>>> 2) I will change the data i am storing to decrease the usage , in value
>>> i will find some small value to store.Previously i used same value since
>>> this table is index only for search purposed and does not really has value.
>>> 3) You are right i read and write in quorum and it was my mistake ( i
>>> though that if i write in quorum then data will be written to 2 nodes only).
>>> If i check the keyspace
>>> create keyspace USER_DATA
>>>   with placement_strategy = 'NetworkTopologyStrategy'
>>>   and strategy_options = [{19 : 3}]
>>>   and durable_writes = true;
>>>
>>> it has replication factor of 3.
>>> Therefore i have several questions
>>> 1) What should be recommended read and write consistency and replication
>>> factor for 3 nodes with option of future increase server numbers?
>>> 2) Still it has 1.5X of overall data how can this be resolved and what
>>> is reason for that?
>>> 3) Also i see that data is in different size on all nodes , does that
>>> means that servers are out of sync???
>>>
>>> Thanks and best regards
>>> Yulian Oifa
>>>
>>>
>>> On Sun, Apr 13, 2014 at 7:03 PM, Mark Reddy <ma...@boxever.com>wrote:
>>>
>>>> What are you storing these 15 chars as; string, int, double, etc.? 15
>>>> chars does not translate to 15 bytes.
>>>>
>>>> You may be mixing up replication factor and quorum when you say "Cassandra
>>>> cluster has 3 servers, and data is stored in quorum ( 2 servers )."
>>>> You read and write at quorum (N/2)+1 where N=total_number_of_nodes and your
>>>> data is replicated to the number of nodes you specify in your replication
>>>> factor. Could you clarify?
>>>>
>>>> Also if you are concerned about disk usage, why are you storing the
>>>> same 15 char value in both the column name and value? You could just store
>>>> it as the name and half your data usage :)
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Apr 13, 2014 at 4:26 PM, Yulian Oifa <oi...@gmail.com>wrote:
>>>>
>>>>> I have column family with 2 raws.
>>>>> 2 raws have overall 100 million columns.
>>>>> Each columns have name of 15 chars ( digits ) and same 15 chars in
>>>>> value ( also digits ).
>>>>> Each column should have 30 bytes.
>>>>> Therefore all data should contain approximately 3GB.
>>>>> Cassandra cluster has 3 servers , and data is stored in quorum ( 2
>>>>> servers ).
>>>>> Therefore each server should have 3GB*2/3=2GB of data for this column
>>>>> family.
>>>>> Table is almost never changed , data is only removed from this table ,
>>>>> which possibly created tombstones , but it should not increase the usage.
>>>>> However when i check the data i see that each server has more then 4GB
>>>>> of data ( more then twice of what should be).
>>>>>
>>>>> server 1:
>>>>> -rw-r--r-- 1 root root 3506446057 Dec 26 12:02
>>>>> freeNumbers-g-264-Data.db
>>>>> -rw-r--r-- 1 root root  814699666 Dec 26 12:24
>>>>> freeNumbers-g-281-Data.db
>>>>> -rw-r--r-- 1 root root  198432466 Dec 26 12:27
>>>>> freeNumbers-g-284-Data.db
>>>>> -rw-r--r-- 1 root root   35883918 Apr 12 20:07
>>>>> freeNumbers-g-336-Data.db
>>>>>
>>>>> server 2:
>>>>> -rw-r--r-- 1 root root 3448432307 Dec 26 11:57
>>>>> freeNumbers-g-285-Data.db
>>>>> -rw-r--r-- 1 root root  762399716 Dec 26 12:22
>>>>> freeNumbers-g-301-Data.db
>>>>> -rw-r--r-- 1 root root  220887062 Dec 26 12:23
>>>>> freeNumbers-g-304-Data.db
>>>>> -rw-r--r-- 1 root root   54914466 Dec 26 12:26
>>>>> freeNumbers-g-306-Data.db
>>>>> -rw-r--r-- 1 root root   53639516 Dec 26 12:26
>>>>> freeNumbers-g-305-Data.db
>>>>> -rw-r--r-- 1 root root   53007967 Jan  8 15:45
>>>>> freeNumbers-g-314-Data.db
>>>>> -rw-r--r-- 1 root root     413717 Apr 12 18:33
>>>>> freeNumbers-g-359-Data.db
>>>>>
>>>>>
>>>>> server 3:
>>>>> -rw-r--r-- 1 root root 4490657264 Apr 11 18:20
>>>>> freeNumbers-g-358-Data.db
>>>>> -rw-r--r-- 1 root root     389171 Apr 12 20:58
>>>>> freeNumbers-g-360-Data.db
>>>>> -rw-r--r-- 1 root root       4276 Apr 11 18:20
>>>>> freeNumbers-g-358-Statistics.db
>>>>> -rw-r--r-- 1 root root       4276 Apr 11 18:24
>>>>> freeNumbers-g-359-Statistics.db
>>>>> -rw-r--r-- 1 root root       4276 Apr 12 20:58
>>>>> freeNumbers-g-360-Statistics.db
>>>>> -rw-r--r-- 1 root root        976 Apr 11 18:20
>>>>> freeNumbers-g-358-Filter.db
>>>>> -rw-r--r-- 1 root root        208 Apr 11 18:24
>>>>> freeNumbers-g-359-Data.db
>>>>> -rw-r--r-- 1 root root         78 Apr 11 18:20
>>>>> freeNumbers-g-358-Index.db
>>>>> -rw-r--r-- 1 root root         52 Apr 11 18:24
>>>>> freeNumbers-g-359-Index.db
>>>>> -rw-r--r-- 1 root root         52 Apr 12 20:58
>>>>> freeNumbers-g-360-Index.db
>>>>> -rw-r--r-- 1 root root         16 Apr 11 18:24
>>>>> freeNumbers-g-359-Filter.db
>>>>> -rw-r--r-- 1 root root         16 Apr 12 20:58
>>>>> freeNumbers-g-360-Filter.db
>>>>>
>>>>> When i try to compact i get the following notification from first
>>>>> server :
>>>>> INFO [CompactionExecutor:1604] 2014-04-13 18:23:07,260
>>>>> CompactionController.java (line 146) Compacting large row
>>>>> USER_DATA/freeNumbers:8bdf9678-6d70-11e3-85ab-80e385abf85d (4555076689
>>>>> bytes) incrementally
>>>>>
>>>>> Which confirms that there is around 4.5GB of data on that server only.
>>>>> Why does cassandra takes so much data???
>>>>>
>>>>> Best regards
>>>>> Yulian Oifa
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Cassandra disk usage

Posted by Mark Reddy <ma...@boxever.com>.
>
> i I will change the data i am storing to decrease the usage , in value i
> will find some small value to store.Previously i used same value since this
> table is index only for search purposed and does not really has value.


If you don't need a value, you don't have to store anything. You can store
the column name and leave the value empty, this is a common practice.

1) What should be recommended read and write consistency and replication
> factor for 3 nodes with option of future increase server numbers?


Both consistency level and replication factor are tuneable depending on
your application constraints. I'd say a CL or quorum and RF of 3 is the
general practice.

Still it has 1.5X of overall data how can this be resolved and what is
> reason for that?


As Michał pointed out there is a 15 byte column overhead to consider here,
where:

total_column_size = column_name_size + column_value_size + 15


This link might shed some light on this:
http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningUserData_t.html

Also i see that data is in different size on all nodes , does that means
> that servers are out of sync


How much is it out by? Data size may differ due to deletes, as you
mentioned you do deletes. What is the output of 'nodetool ring'?


On Sun, Apr 13, 2014 at 6:42 PM, Michal Michalski <
michal.michalski@boxever.com> wrote:

> > Each columns have name of 15 chars ( digits ) and same 15 chars in
> value ( also digits ).
> > Each column should have 30 bytes.
>
> Remember about the standard Cassandra's column overhead which is, as far
> as I remember, 15 bytes, so it's 45 bytes in total - 50% more than you
> estimated, which kind of matches your 3 GB vs 4.5 GB case.
>
> There's also a per-row overhead, but I'm not sure about its size in
> current C* versions - I remember it was about 25 bytes or so some time ago,
> but it's not important in your case.
>
> Kind regards,
> Michał Michalski,
> michal.michalski@boxever.com
>
>
> On 13 April 2014 17:48, Yulian Oifa <oi...@gmail.com> wrote:
>
>> Hello Mark and thanks for you reply.
>> 1) i store is as UTF8String.All digits are from 0x30 to 0x39 and should
>> take 1 byte each digit. Since all characters are digits it should have 15
>> bytes.
>> 2) I will change the data i am storing to decrease the usage , in value i
>> will find some small value to store.Previously i used same value since this
>> table is index only for search purposed and does not really has value.
>> 3) You are right i read and write in quorum and it was my mistake ( i
>> though that if i write in quorum then data will be written to 2 nodes only).
>> If i check the keyspace
>> create keyspace USER_DATA
>>   with placement_strategy = 'NetworkTopologyStrategy'
>>   and strategy_options = [{19 : 3}]
>>   and durable_writes = true;
>>
>> it has replication factor of 3.
>> Therefore i have several questions
>> 1) What should be recommended read and write consistency and replication
>> factor for 3 nodes with option of future increase server numbers?
>> 2) Still it has 1.5X of overall data how can this be resolved and what is
>> reason for that?
>> 3) Also i see that data is in different size on all nodes , does that
>> means that servers are out of sync???
>>
>> Thanks and best regards
>> Yulian Oifa
>>
>>
>> On Sun, Apr 13, 2014 at 7:03 PM, Mark Reddy <ma...@boxever.com>wrote:
>>
>>> What are you storing these 15 chars as; string, int, double, etc.? 15
>>> chars does not translate to 15 bytes.
>>>
>>> You may be mixing up replication factor and quorum when you say "Cassandra
>>> cluster has 3 servers, and data is stored in quorum ( 2 servers )." You
>>> read and write at quorum (N/2)+1 where N=total_number_of_nodes and your
>>> data is replicated to the number of nodes you specify in your replication
>>> factor. Could you clarify?
>>>
>>> Also if you are concerned about disk usage, why are you storing the same
>>> 15 char value in both the column name and value? You could just store it as
>>> the name and half your data usage :)
>>>
>>>
>>>
>>>
>>> On Sun, Apr 13, 2014 at 4:26 PM, Yulian Oifa <oi...@gmail.com>wrote:
>>>
>>>> I have column family with 2 raws.
>>>> 2 raws have overall 100 million columns.
>>>> Each columns have name of 15 chars ( digits ) and same 15 chars in
>>>> value ( also digits ).
>>>> Each column should have 30 bytes.
>>>> Therefore all data should contain approximately 3GB.
>>>> Cassandra cluster has 3 servers , and data is stored in quorum ( 2
>>>> servers ).
>>>> Therefore each server should have 3GB*2/3=2GB of data for this column
>>>> family.
>>>> Table is almost never changed , data is only removed from this table ,
>>>> which possibly created tombstones , but it should not increase the usage.
>>>> However when i check the data i see that each server has more then 4GB
>>>> of data ( more then twice of what should be).
>>>>
>>>> server 1:
>>>> -rw-r--r-- 1 root root 3506446057 Dec 26 12:02 freeNumbers-g-264-Data.db
>>>> -rw-r--r-- 1 root root  814699666 Dec 26 12:24 freeNumbers-g-281-Data.db
>>>> -rw-r--r-- 1 root root  198432466 Dec 26 12:27 freeNumbers-g-284-Data.db
>>>> -rw-r--r-- 1 root root   35883918 Apr 12 20:07 freeNumbers-g-336-Data.db
>>>>
>>>> server 2:
>>>> -rw-r--r-- 1 root root 3448432307 Dec 26 11:57 freeNumbers-g-285-Data.db
>>>> -rw-r--r-- 1 root root  762399716 Dec 26 12:22 freeNumbers-g-301-Data.db
>>>> -rw-r--r-- 1 root root  220887062 Dec 26 12:23 freeNumbers-g-304-Data.db
>>>> -rw-r--r-- 1 root root   54914466 Dec 26 12:26 freeNumbers-g-306-Data.db
>>>> -rw-r--r-- 1 root root   53639516 Dec 26 12:26 freeNumbers-g-305-Data.db
>>>> -rw-r--r-- 1 root root   53007967 Jan  8 15:45 freeNumbers-g-314-Data.db
>>>> -rw-r--r-- 1 root root     413717 Apr 12 18:33 freeNumbers-g-359-Data.db
>>>>
>>>>
>>>> server 3:
>>>> -rw-r--r-- 1 root root 4490657264 Apr 11 18:20 freeNumbers-g-358-Data.db
>>>> -rw-r--r-- 1 root root     389171 Apr 12 20:58 freeNumbers-g-360-Data.db
>>>> -rw-r--r-- 1 root root       4276 Apr 11 18:20
>>>> freeNumbers-g-358-Statistics.db
>>>> -rw-r--r-- 1 root root       4276 Apr 11 18:24
>>>> freeNumbers-g-359-Statistics.db
>>>> -rw-r--r-- 1 root root       4276 Apr 12 20:58
>>>> freeNumbers-g-360-Statistics.db
>>>> -rw-r--r-- 1 root root        976 Apr 11 18:20
>>>> freeNumbers-g-358-Filter.db
>>>> -rw-r--r-- 1 root root        208 Apr 11 18:24 freeNumbers-g-359-Data.db
>>>> -rw-r--r-- 1 root root         78 Apr 11 18:20
>>>> freeNumbers-g-358-Index.db
>>>> -rw-r--r-- 1 root root         52 Apr 11 18:24
>>>> freeNumbers-g-359-Index.db
>>>> -rw-r--r-- 1 root root         52 Apr 12 20:58
>>>> freeNumbers-g-360-Index.db
>>>> -rw-r--r-- 1 root root         16 Apr 11 18:24
>>>> freeNumbers-g-359-Filter.db
>>>> -rw-r--r-- 1 root root         16 Apr 12 20:58
>>>> freeNumbers-g-360-Filter.db
>>>>
>>>> When i try to compact i get the following notification from first
>>>> server :
>>>> INFO [CompactionExecutor:1604] 2014-04-13 18:23:07,260
>>>> CompactionController.java (line 146) Compacting large row
>>>> USER_DATA/freeNumbers:8bdf9678-6d70-11e3-85ab-80e385abf85d (4555076689
>>>> bytes) incrementally
>>>>
>>>> Which confirms that there is around 4.5GB of data on that server only.
>>>> Why does cassandra takes so much data???
>>>>
>>>> Best regards
>>>> Yulian Oifa
>>>>
>>>>
>>>
>>
>

Re: Cassandra disk usage

Posted by Michal Michalski <mi...@boxever.com>.
> Each columns have name of 15 chars ( digits ) and same 15 chars in value
( also digits ).
> Each column should have 30 bytes.

Remember about the standard Cassandra's column overhead which is, as far as
I remember, 15 bytes, so it's 45 bytes in total - 50% more than you
estimated, which kind of matches your 3 GB vs 4.5 GB case.

There's also a per-row overhead, but I'm not sure about its size in current
C* versions - I remember it was about 25 bytes or so some time ago, but
it's not important in your case.

Kind regards,
Michał Michalski,
michal.michalski@boxever.com


On 13 April 2014 17:48, Yulian Oifa <oi...@gmail.com> wrote:

> Hello Mark and thanks for you reply.
> 1) i store is as UTF8String.All digits are from 0x30 to 0x39 and should
> take 1 byte each digit. Since all characters are digits it should have 15
> bytes.
> 2) I will change the data i am storing to decrease the usage , in value i
> will find some small value to store.Previously i used same value since this
> table is index only for search purposed and does not really has value.
> 3) You are right i read and write in quorum and it was my mistake ( i
> though that if i write in quorum then data will be written to 2 nodes only).
> If i check the keyspace
> create keyspace USER_DATA
>   with placement_strategy = 'NetworkTopologyStrategy'
>   and strategy_options = [{19 : 3}]
>   and durable_writes = true;
>
> it has replication factor of 3.
> Therefore i have several questions
> 1) What should be recommended read and write consistency and replication
> factor for 3 nodes with option of future increase server numbers?
> 2) Still it has 1.5X of overall data how can this be resolved and what is
> reason for that?
> 3) Also i see that data is in different size on all nodes , does that
> means that servers are out of sync???
>
> Thanks and best regards
> Yulian Oifa
>
>
> On Sun, Apr 13, 2014 at 7:03 PM, Mark Reddy <ma...@boxever.com>wrote:
>
>> What are you storing these 15 chars as; string, int, double, etc.? 15
>> chars does not translate to 15 bytes.
>>
>> You may be mixing up replication factor and quorum when you say "Cassandra
>> cluster has 3 servers, and data is stored in quorum ( 2 servers )." You
>> read and write at quorum (N/2)+1 where N=total_number_of_nodes and your
>> data is replicated to the number of nodes you specify in your replication
>> factor. Could you clarify?
>>
>> Also if you are concerned about disk usage, why are you storing the same
>> 15 char value in both the column name and value? You could just store it as
>> the name and half your data usage :)
>>
>>
>>
>>
>> On Sun, Apr 13, 2014 at 4:26 PM, Yulian Oifa <oi...@gmail.com>wrote:
>>
>>> I have column family with 2 raws.
>>> 2 raws have overall 100 million columns.
>>> Each columns have name of 15 chars ( digits ) and same 15 chars in value
>>> ( also digits ).
>>> Each column should have 30 bytes.
>>> Therefore all data should contain approximately 3GB.
>>> Cassandra cluster has 3 servers , and data is stored in quorum ( 2
>>> servers ).
>>> Therefore each server should have 3GB*2/3=2GB of data for this column
>>> family.
>>> Table is almost never changed , data is only removed from this table ,
>>> which possibly created tombstones , but it should not increase the usage.
>>> However when i check the data i see that each server has more then 4GB
>>> of data ( more then twice of what should be).
>>>
>>> server 1:
>>> -rw-r--r-- 1 root root 3506446057 Dec 26 12:02 freeNumbers-g-264-Data.db
>>> -rw-r--r-- 1 root root  814699666 Dec 26 12:24 freeNumbers-g-281-Data.db
>>> -rw-r--r-- 1 root root  198432466 Dec 26 12:27 freeNumbers-g-284-Data.db
>>> -rw-r--r-- 1 root root   35883918 Apr 12 20:07 freeNumbers-g-336-Data.db
>>>
>>> server 2:
>>> -rw-r--r-- 1 root root 3448432307 Dec 26 11:57 freeNumbers-g-285-Data.db
>>> -rw-r--r-- 1 root root  762399716 Dec 26 12:22 freeNumbers-g-301-Data.db
>>> -rw-r--r-- 1 root root  220887062 Dec 26 12:23 freeNumbers-g-304-Data.db
>>> -rw-r--r-- 1 root root   54914466 Dec 26 12:26 freeNumbers-g-306-Data.db
>>> -rw-r--r-- 1 root root   53639516 Dec 26 12:26 freeNumbers-g-305-Data.db
>>> -rw-r--r-- 1 root root   53007967 Jan  8 15:45 freeNumbers-g-314-Data.db
>>> -rw-r--r-- 1 root root     413717 Apr 12 18:33 freeNumbers-g-359-Data.db
>>>
>>>
>>> server 3:
>>> -rw-r--r-- 1 root root 4490657264 Apr 11 18:20 freeNumbers-g-358-Data.db
>>> -rw-r--r-- 1 root root     389171 Apr 12 20:58 freeNumbers-g-360-Data.db
>>> -rw-r--r-- 1 root root       4276 Apr 11 18:20
>>> freeNumbers-g-358-Statistics.db
>>> -rw-r--r-- 1 root root       4276 Apr 11 18:24
>>> freeNumbers-g-359-Statistics.db
>>> -rw-r--r-- 1 root root       4276 Apr 12 20:58
>>> freeNumbers-g-360-Statistics.db
>>> -rw-r--r-- 1 root root        976 Apr 11 18:20
>>> freeNumbers-g-358-Filter.db
>>> -rw-r--r-- 1 root root        208 Apr 11 18:24 freeNumbers-g-359-Data.db
>>> -rw-r--r-- 1 root root         78 Apr 11 18:20 freeNumbers-g-358-Index.db
>>> -rw-r--r-- 1 root root         52 Apr 11 18:24 freeNumbers-g-359-Index.db
>>> -rw-r--r-- 1 root root         52 Apr 12 20:58 freeNumbers-g-360-Index.db
>>> -rw-r--r-- 1 root root         16 Apr 11 18:24
>>> freeNumbers-g-359-Filter.db
>>> -rw-r--r-- 1 root root         16 Apr 12 20:58
>>> freeNumbers-g-360-Filter.db
>>>
>>> When i try to compact i get the following notification from first server
>>> :
>>> INFO [CompactionExecutor:1604] 2014-04-13 18:23:07,260
>>> CompactionController.java (line 146) Compacting large row
>>> USER_DATA/freeNumbers:8bdf9678-6d70-11e3-85ab-80e385abf85d (4555076689
>>> bytes) incrementally
>>>
>>> Which confirms that there is around 4.5GB of data on that server only.
>>> Why does cassandra takes so much data???
>>>
>>> Best regards
>>> Yulian Oifa
>>>
>>>
>>
>

Re: Cassandra disk usage

Posted by Yulian Oifa <oi...@gmail.com>.
Hello Mark and thanks for you reply.
1) i store is as UTF8String.All digits are from 0x30 to 0x39 and should
take 1 byte each digit. Since all characters are digits it should have 15
bytes.
2) I will change the data i am storing to decrease the usage , in value i
will find some small value to store.Previously i used same value since this
table is index only for search purposed and does not really has value.
3) You are right i read and write in quorum and it was my mistake ( i
though that if i write in quorum then data will be written to 2 nodes only).
If i check the keyspace
create keyspace USER_DATA
  with placement_strategy = 'NetworkTopologyStrategy'
  and strategy_options = [{19 : 3}]
  and durable_writes = true;

it has replication factor of 3.
Therefore i have several questions
1) What should be recommended read and write consistency and replication
factor for 3 nodes with option of future increase server numbers?
2) Still it has 1.5X of overall data how can this be resolved and what is
reason for that?
3) Also i see that data is in different size on all nodes , does that means
that servers are out of sync???

Thanks and best regards
Yulian Oifa


On Sun, Apr 13, 2014 at 7:03 PM, Mark Reddy <ma...@boxever.com> wrote:

> What are you storing these 15 chars as; string, int, double, etc.? 15
> chars does not translate to 15 bytes.
>
> You may be mixing up replication factor and quorum when you say "Cassandra
> cluster has 3 servers, and data is stored in quorum ( 2 servers )." You
> read and write at quorum (N/2)+1 where N=total_number_of_nodes and your
> data is replicated to the number of nodes you specify in your replication
> factor. Could you clarify?
>
> Also if you are concerned about disk usage, why are you storing the same
> 15 char value in both the column name and value? You could just store it as
> the name and half your data usage :)
>
>
>
>
> On Sun, Apr 13, 2014 at 4:26 PM, Yulian Oifa <oi...@gmail.com>wrote:
>
>> I have column family with 2 raws.
>> 2 raws have overall 100 million columns.
>> Each columns have name of 15 chars ( digits ) and same 15 chars in value
>> ( also digits ).
>> Each column should have 30 bytes.
>> Therefore all data should contain approximately 3GB.
>> Cassandra cluster has 3 servers , and data is stored in quorum ( 2
>> servers ).
>> Therefore each server should have 3GB*2/3=2GB of data for this column
>> family.
>> Table is almost never changed , data is only removed from this table ,
>> which possibly created tombstones , but it should not increase the usage.
>> However when i check the data i see that each server has more then 4GB of
>> data ( more then twice of what should be).
>>
>> server 1:
>> -rw-r--r-- 1 root root 3506446057 Dec 26 12:02 freeNumbers-g-264-Data.db
>> -rw-r--r-- 1 root root  814699666 Dec 26 12:24 freeNumbers-g-281-Data.db
>> -rw-r--r-- 1 root root  198432466 Dec 26 12:27 freeNumbers-g-284-Data.db
>> -rw-r--r-- 1 root root   35883918 Apr 12 20:07 freeNumbers-g-336-Data.db
>>
>> server 2:
>> -rw-r--r-- 1 root root 3448432307 Dec 26 11:57 freeNumbers-g-285-Data.db
>> -rw-r--r-- 1 root root  762399716 Dec 26 12:22 freeNumbers-g-301-Data.db
>> -rw-r--r-- 1 root root  220887062 Dec 26 12:23 freeNumbers-g-304-Data.db
>> -rw-r--r-- 1 root root   54914466 Dec 26 12:26 freeNumbers-g-306-Data.db
>> -rw-r--r-- 1 root root   53639516 Dec 26 12:26 freeNumbers-g-305-Data.db
>> -rw-r--r-- 1 root root   53007967 Jan  8 15:45 freeNumbers-g-314-Data.db
>> -rw-r--r-- 1 root root     413717 Apr 12 18:33 freeNumbers-g-359-Data.db
>>
>>
>> server 3:
>> -rw-r--r-- 1 root root 4490657264 Apr 11 18:20 freeNumbers-g-358-Data.db
>> -rw-r--r-- 1 root root     389171 Apr 12 20:58 freeNumbers-g-360-Data.db
>> -rw-r--r-- 1 root root       4276 Apr 11 18:20
>> freeNumbers-g-358-Statistics.db
>> -rw-r--r-- 1 root root       4276 Apr 11 18:24
>> freeNumbers-g-359-Statistics.db
>> -rw-r--r-- 1 root root       4276 Apr 12 20:58
>> freeNumbers-g-360-Statistics.db
>> -rw-r--r-- 1 root root        976 Apr 11 18:20 freeNumbers-g-358-Filter.db
>> -rw-r--r-- 1 root root        208 Apr 11 18:24 freeNumbers-g-359-Data.db
>> -rw-r--r-- 1 root root         78 Apr 11 18:20 freeNumbers-g-358-Index.db
>> -rw-r--r-- 1 root root         52 Apr 11 18:24 freeNumbers-g-359-Index.db
>> -rw-r--r-- 1 root root         52 Apr 12 20:58 freeNumbers-g-360-Index.db
>> -rw-r--r-- 1 root root         16 Apr 11 18:24 freeNumbers-g-359-Filter.db
>> -rw-r--r-- 1 root root         16 Apr 12 20:58 freeNumbers-g-360-Filter.db
>>
>> When i try to compact i get the following notification from first server :
>> INFO [CompactionExecutor:1604] 2014-04-13 18:23:07,260
>> CompactionController.java (line 146) Compacting large row
>> USER_DATA/freeNumbers:8bdf9678-6d70-11e3-85ab-80e385abf85d (4555076689
>> bytes) incrementally
>>
>> Which confirms that there is around 4.5GB of data on that server only.
>> Why does cassandra takes so much data???
>>
>> Best regards
>> Yulian Oifa
>>
>>
>

Re: Cassandra disk usage

Posted by Mark Reddy <ma...@boxever.com>.
What are you storing these 15 chars as; string, int, double, etc.? 15 chars
does not translate to 15 bytes.

You may be mixing up replication factor and quorum when you say "Cassandra
cluster has 3 servers, and data is stored in quorum ( 2 servers )." You
read and write at quorum (N/2)+1 where N=total_number_of_nodes and your
data is replicated to the number of nodes you specify in your replication
factor. Could you clarify?

Also if you are concerned about disk usage, why are you storing the same 15
char value in both the column name and value? You could just store it as
the name and half your data usage :)




On Sun, Apr 13, 2014 at 4:26 PM, Yulian Oifa <oi...@gmail.com> wrote:

> I have column family with 2 raws.
> 2 raws have overall 100 million columns.
> Each columns have name of 15 chars ( digits ) and same 15 chars in value (
> also digits ).
> Each column should have 30 bytes.
> Therefore all data should contain approximately 3GB.
> Cassandra cluster has 3 servers , and data is stored in quorum ( 2 servers
> ).
> Therefore each server should have 3GB*2/3=2GB of data for this column
> family.
> Table is almost never changed , data is only removed from this table ,
> which possibly created tombstones , but it should not increase the usage.
> However when i check the data i see that each server has more then 4GB of
> data ( more then twice of what should be).
>
> server 1:
> -rw-r--r-- 1 root root 3506446057 Dec 26 12:02 freeNumbers-g-264-Data.db
> -rw-r--r-- 1 root root  814699666 Dec 26 12:24 freeNumbers-g-281-Data.db
> -rw-r--r-- 1 root root  198432466 Dec 26 12:27 freeNumbers-g-284-Data.db
> -rw-r--r-- 1 root root   35883918 Apr 12 20:07 freeNumbers-g-336-Data.db
>
> server 2:
> -rw-r--r-- 1 root root 3448432307 Dec 26 11:57 freeNumbers-g-285-Data.db
> -rw-r--r-- 1 root root  762399716 Dec 26 12:22 freeNumbers-g-301-Data.db
> -rw-r--r-- 1 root root  220887062 Dec 26 12:23 freeNumbers-g-304-Data.db
> -rw-r--r-- 1 root root   54914466 Dec 26 12:26 freeNumbers-g-306-Data.db
> -rw-r--r-- 1 root root   53639516 Dec 26 12:26 freeNumbers-g-305-Data.db
> -rw-r--r-- 1 root root   53007967 Jan  8 15:45 freeNumbers-g-314-Data.db
> -rw-r--r-- 1 root root     413717 Apr 12 18:33 freeNumbers-g-359-Data.db
>
>
> server 3:
> -rw-r--r-- 1 root root 4490657264 Apr 11 18:20 freeNumbers-g-358-Data.db
> -rw-r--r-- 1 root root     389171 Apr 12 20:58 freeNumbers-g-360-Data.db
> -rw-r--r-- 1 root root       4276 Apr 11 18:20
> freeNumbers-g-358-Statistics.db
> -rw-r--r-- 1 root root       4276 Apr 11 18:24
> freeNumbers-g-359-Statistics.db
> -rw-r--r-- 1 root root       4276 Apr 12 20:58
> freeNumbers-g-360-Statistics.db
> -rw-r--r-- 1 root root        976 Apr 11 18:20 freeNumbers-g-358-Filter.db
> -rw-r--r-- 1 root root        208 Apr 11 18:24 freeNumbers-g-359-Data.db
> -rw-r--r-- 1 root root         78 Apr 11 18:20 freeNumbers-g-358-Index.db
> -rw-r--r-- 1 root root         52 Apr 11 18:24 freeNumbers-g-359-Index.db
> -rw-r--r-- 1 root root         52 Apr 12 20:58 freeNumbers-g-360-Index.db
> -rw-r--r-- 1 root root         16 Apr 11 18:24 freeNumbers-g-359-Filter.db
> -rw-r--r-- 1 root root         16 Apr 12 20:58 freeNumbers-g-360-Filter.db
>
> When i try to compact i get the following notification from first server :
> INFO [CompactionExecutor:1604] 2014-04-13 18:23:07,260
> CompactionController.java (line 146) Compacting large row
> USER_DATA/freeNumbers:8bdf9678-6d70-11e3-85ab-80e385abf85d (4555076689
> bytes) incrementally
>
> Which confirms that there is around 4.5GB of data on that server only.
> Why does cassandra takes so much data???
>
> Best regards
> Yulian Oifa
>
>