You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Tamar Fraenkel <ta...@tok-media.com> on 2012/09/23 17:29:47 UTC

compression

Hi!
In datastax documentation<http://www.datastax.com/docs/1.0/ddl/column_family>there
is an explanation of what CFs are a good fit for compression:

When to Use Compression

Compression is best suited for column families where there are many rows,
with each row having the same columns, or at least many columns in common.
For example, a column family containing user data such as username, email,
etc., would be a good candidate for compression. The more similar the data
across rows, the greater the compression ratio will be, and the larger the
gain in read performance.

Compression is not as good a fit for column families where each row has a
different set of columns, or where there are just a few very wide rows.
Dynamic column families such as this will not yield good compression ratios.

I have many column families where rows share some of the columns and have
varied number of unique columns per row.
For example, I have a CF where each row has ~13 shared columns, but between
0 to many unique columns. Will such CF be a good fit for compression?

More generally, is there a rule of thumb for how many shared columns (or
percentage of columns which are shared) is considered a good fit for
compression?

Thanks,

*Tamar Fraenkel *
Senior Software Engineer, TOK Media

[image: Inline image 1]

tamar@tok-media.com
Tel:   +972 2 6409736
Mob:  +972 54 8356490
Fax:   +972 2 5612956

Re: compression

Posted by Tamar Fraenkel <ta...@tok-media.com>.

Hi!
The situation didn't resolve, does anyone has a clue?
Thanks

*Tamar Fraenkel *
Senior Software Engineer, TOK Media

[image: Inline image 1]

tamar@tok-media.com
Tel:   +972 2 6409736
Mob:  +972 54 8356490
Fax:   +972 2 5612956





On Thu, Sep 27, 2012 at 10:42 AM, Tamar Fraenkel <ta...@tok-media.com>wrote:

> Hi!
> First, the problem is still there, altough I checked and all node agree on
> the schema.
> This is from ls -l
> Good Node
> -rw-r--r-- 1 cassandra cassandra        606 2012-09-27 08:01
> tk_usus_user-hc-269-CompressionInfo.db
> -rw-r--r-- 1 cassandra cassandra    2246431 2012-09-27 08:01
> tk_usus_user-hc-269-Data.db
> -rw-r--r-- 1 cassandra cassandra      11056 2012-09-27 08:01
> tk_usus_user-hc-269-Filter.db
> -rw-r--r-- 1 cassandra cassandra     129792 2012-09-27 08:01
> tk_usus_user-hc-269-Index.db
> -rw-r--r-- 1 cassandra cassandra       4336 2012-09-27 08:01
> tk_usus_user-hc-269-Statistics.db
>
> Node 2
> -rw-r--r-- 1 cassandra cassandra    4592393 2012-09-27 08:01
> tk_usus_user-hc-268-Data.db
> -rw-r--r-- 1 cassandra cassandra         69 2012-09-27 08:01
> tk_usus_user-hc-268-Digest.sha1
> -rw-r--r-- 1 cassandra cassandra      11056 2012-09-27 08:01
> tk_usus_user-hc-268-Filter.db
> -rw-r--r-- 1 cassandra cassandra     129792 2012-09-27 08:01
> tk_usus_user-hc-268-Index.db
> -rw-r--r-- 1 cassandra cassandra       4336 2012-09-27 08:01
> tk_usus_user-hc-268-Statistics.db
>
> Node 3
> -rw-r--r-- 1 cassandra cassandra   4592393 2012-09-27 08:01
> tk_usus_user-hc-278-Data.db
> -rw-r--r-- 1 cassandra cassandra        69 2012-09-27 08:01
> tk_usus_user-hc-278-Digest.sha1
> -rw-r--r-- 1 cassandra cassandra     11056 2012-09-27 08:01
> tk_usus_user-hc-278-Filter.db
> -rw-r--r-- 1 cassandra cassandra    129792 2012-09-27 08:01
> tk_usus_user-hc-278-Index.db
> -rw-r--r-- 1 cassandra cassandra      4336 2012-09-27 08:01
> tk_usus_user-hc-278-Statistics.db
>
> Looking at the logs, on the "good node" I can see
>
>  INFO [MigrationStage:1] 2012-09-24 10:08:16,511 Migration.java (line 119)
> Applying migration c22413b0-062f-11e2-0000-1bcb936807db Update column
> family to org.apache.cassandra.config.CFMetaData@1dbdcde9
> [cfId=1016,ksName=tok,cfName=tk_usus_user,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type),subcolumncomparator=<null>,comment=,rowCacheSize=0.0,keyCacheSize=200000.0,readRepairChance=1.0,replicateOnWrite=true,gcGraceSeconds=864000,defaultValidator=org.apache.cassandra.db.marshal.UTF8Type,keyValidator=org.apache.cassandra.db.marshal.UUIDType,minCompactionThreshold=4,maxCompactionThreshold=32,rowCacheSavePeriodInSeconds=0,keyCacheSavePeriodInSeconds=14400,rowCacheKeysToSave=2147483647,rowCacheProvider=org.apache.cassandra.cache.SerializingCacheProvider@3505231c,mergeShardsChance=0.1,keyAlias=java.nio.HeapByteBuffer[pos=485
> lim=488 cap=653],column_metadata={},compactionStrategyClass=class
> org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,compactionStrategyOptions={},compressionOptions={sstable_compression=org.apache.cassandra.io.compress.SnappyCompressor,
> chunk_length_kb=64},bloomFilterFpChance=<null>]
>
> But same can be seen in the logs of the two other nodes:
>  INFO [MigrationStage:1] 2012-09-24 10:08:16,767 Migration.java (line 119)
> Applying migration c22413b0-062f-11e2-0000-1bcb936807db Update column
> family to org.apache.cassandra.config.CFMetaData@24fbb95d
> [cfId=1016,ksName=tok,cfName=tk_usus_user,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type),subcolumncomparator=<null>,comment=,rowCacheSize=0.0,keyCacheSize=200000.0,readRepairChance=1.0,replicateOnWrite=true,gcGraceSeconds=864000,defaultValidator=org.apache.cassandra.db.marshal.UTF8Type,keyValidator=org.apache.cassandra.db.marshal.UUIDType,minCompactionThreshold=4,maxCompactionThreshold=32,rowCacheSavePeriodInSeconds=0,keyCacheSavePeriodInSeconds=14400,rowCacheKeysToSave=2147483647,rowCacheProvider=org.apache.cassandra.cache.SerializingCacheProvider@a469ba3,mergeShardsChance=0.1,keyAlias=java.nio.HeapByteBuffer[pos=0
> lim=3 cap=3],column_metadata={},compactionStrategyClass=class
> org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,compactionStrategyOptions={},compressionOptions={sstable_compression=org.apache.cassandra.io.compress.SnappyCompressor,
> chunk_length_kb=64},bloomFilterFpChance=<null>]
>
>  INFO [MigrationStage:1] 2012-09-24 10:08:16,705 Migration.java (line 119)
> Applying migration c22413b0-062f-11e2-0000-1bcb936807db Update column
> family to org.apache.cassandra.config.CFMetaData@216b6a58
> [cfId=1016,ksName=tok,cfName=tk_usus_user,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type),subcolumncomparator=<null>,comment=,rowCacheSize=0.0,keyCacheSize=200000.0,readRepairChance=1.0,replicateOnWrite=true,gcGraceSeconds=864000,defaultValidator=org.apache.cassandra.db.marshal.UTF8Type,keyValidator=org.apache.cassandra.db.marshal.UUIDType,minCompactionThreshold=4,maxCompactionThreshold=32,rowCacheSavePeriodInSeconds=0,keyCacheSavePeriodInSeconds=14400,rowCacheKeysToSave=2147483647,rowCacheProvider=org.apache.cassandra.cache.SerializingCacheProvider@1312c88c,mergeShardsChance=0.1,keyAlias=java.nio.HeapByteBuffer[pos=0
> lim=3 cap=3],column_metadata={},compactionStrategyClass=class
> org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,compactionStrategyOptions={},compressionOptions={sstable_compression=org.apache.cassandra.io.compress.SnappyCompressor,
> chunk_length_kb=64},bloomFilterFpChance=<null>]
>
>
> I can also see scrub messages in logs
> Good node:
>  INFO [CompactionExecutor:1774] 2012-09-24 10:09:05,402
> CompactionManager.java (line 476) Scrubbing
> SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-264-Data.db')
>  INFO [CompactionExecutor:1774] 2012-09-24 10:09:05,934
> CompactionManager.java (line 658) Scrub of
> SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-264-Data.db')
> complete: 4868 rows in new sstable and 0 empty (tombstoned) rows dropped
>
> Other nodes
>
>  INFO [CompactionExecutor:1800] 2012-09-24 10:09:11,789
> CompactionManager.java (line 476) Scrubbing
> SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-260-Data.db')
>  INFO [CompactionExecutor:1800] 2012-09-24 10:09:12,464
> CompactionManager.java (line 658) Scrub of
> SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-260-Data.db')
> complete: 4868 rows in new sstable and 0 empty (tombstoned) rows dropped
>
>  INFO [CompactionExecutor:1687] 2012-09-24 10:09:16,235
> CompactionManager.java (line 476) Scrubbing
> SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-271-Data.db')
>  INFO [CompactionExecutor:1687] 2012-09-24 10:09:16,898
> CompactionManager.java (line 658) Scrub of
> SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-271-Data.db')
> compete: 4868 rows in new sstable and 0 empty (tombstoned) rows dropped
>
> Any idea?
> Thanks!!
>
> *Tamar Fraenkel *
> Senior Software Engineer, TOK Media
>
> [image: Inline image 1]
>
>
> tamar@tok-media.com
> Tel:   +972 2 6409736
> Mob:  +972 54 8356490
> Fax:   +972 2 5612956
>
>
>
>
>
> On Wed, Sep 26, 2012 at 3:40 AM, aaron morton <aa...@thelastpickle.com>wrote:
>
>> Check the logs on  nodes 2 and 3 to see if the scrub started. The logs on
>> 1 will be a good help with that.
>>
>> Cheers
>>
>>   -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 24/09/2012, at 10:31 PM, Tamar Fraenkel <ta...@tok-media.com> wrote:
>>
>> Hi!
>> I ran
>> UPDATE COLUMN FAMILY cf_name WITH
>> compression_options={sstable_compression:SnappyCompressor,
>> chunk_length_kb:64};
>>
>> I then ran on all my nodes (3)
>> sudo nodetool -h localhost scrub tok cf_name
>>
>> I have replication factor 3. The size of the data on disk was cut in half
>> in the first node and in the jmx I can see that indeed the compression
>> ration is 0.46. But on nodes 2 and 3 nothing happened. In the jmx I can see
>> that compression ratio is 0 and the size of the files of disk stayed the
>> same.
>>
>> In cli
>>
>> ColumnFamily: cf_name
>>       Key Validation Class: org.apache.cassandra.db.marshal.UUIDType
>>       Default column value validator:
>> org.apache.cassandra.db.marshal.UTF8Type
>>       Columns sorted by:
>> org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)
>>       Row cache size / save period in seconds / keys to save : 0.0/0/all
>>       Row Cache Provider:
>> org.apache.cassandra.cache.SerializingCacheProvider
>>       Key cache size / save period in seconds: 200000.0/14400
>>       GC grace seconds: 864000
>>       Compaction min/max thresholds: 4/32
>>       Read repair chance: 1.0
>>       Replicate on write: true
>>       Bloom Filter FP chance: default
>>       Built indexes: []
>>       Compaction Strategy:
>> org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
>>       Compression Options:
>>         chunk_length_kb: 64
>>         sstable_compression:
>> org.apache.cassandra.io.compress.SnappyCompressor
>>
>> Can anyone help?
>> Thanks
>>
>>  *Tamar Fraenkel *
>> Senior Software Engineer, TOK Media
>>
>> <tokLogo.png>
>>
>>
>> tamar@tok-media.com
>> Tel:   +972 2 6409736
>> Mob:  +972 54 8356490
>> Fax:   +972 2 5612956
>>
>>
>>
>>
>>
>> On Mon, Sep 24, 2012 at 8:37 AM, Tamar Fraenkel <ta...@tok-media.com>wrote:
>>
>>> Thanks all, that helps. Will start with one - two CFs and let you know
>>> the effect
>>>
>>>
>>> *Tamar Fraenkel *
>>> Senior Software Engineer, TOK Media
>>>
>>> <tokLogo.png>
>>>
>>>
>>> tamar@tok-media.com
>>> Tel:   +972 2 6409736
>>> Mob:  +972 54 8356490
>>> Fax:   +972 2 5612956
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Sep 23, 2012 at 8:21 PM, Hiller, Dean <De...@nrel.gov>wrote:
>>>
>>>> As well as your unlimited column names may all have the same prefix,
>>>> right? Like "accounts".rowkey56, "accounts".rowkey78, etc. etc.  so the
>>>> "accounts gets a ton of compression then.
>>>>
>>>> Later,
>>>> Dean
>>>>
>>>> From: Tyler Hobbs <ty...@datastax.com>>
>>>> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
>>>> <us...@cassandra.apache.org>>
>>>> Date: Sunday, September 23, 2012 11:46 AM
>>>> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
>>>> user@cassandra.apache.org<ma...@cassandra.apache.org>>
>>>> Subject: Re: compression
>>>>
>>>>  column metadata, you're still likely to get a reasonable amount of
>>>> compression.  This is especially true if there is some amount of repetition
>>>> in the column names, values, or TTLs in wide rows.  Compression will almost
>>>> always be beneficial unless you're already somehow CPU bound or are using
>>>> large column values that are high in entropy, such as pre-compressed or
>>>> encrypted data.
>>>>
>>>
>>>
>>
>>
>

Re: compression

Posted by Tamar Fraenkel <ta...@tok-media.com>.

Hi!
First, the problem is still there, altough I checked and all node agree on
the schema.
This is from ls -l
Good Node
-rw-r--r-- 1 cassandra cassandra        606 2012-09-27 08:01
tk_usus_user-hc-269-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra    2246431 2012-09-27 08:01
tk_usus_user-hc-269-Data.db
-rw-r--r-- 1 cassandra cassandra      11056 2012-09-27 08:01
tk_usus_user-hc-269-Filter.db
-rw-r--r-- 1 cassandra cassandra     129792 2012-09-27 08:01
tk_usus_user-hc-269-Index.db
-rw-r--r-- 1 cassandra cassandra       4336 2012-09-27 08:01
tk_usus_user-hc-269-Statistics.db

Node 2
-rw-r--r-- 1 cassandra cassandra    4592393 2012-09-27 08:01
tk_usus_user-hc-268-Data.db
-rw-r--r-- 1 cassandra cassandra         69 2012-09-27 08:01
tk_usus_user-hc-268-Digest.sha1
-rw-r--r-- 1 cassandra cassandra      11056 2012-09-27 08:01
tk_usus_user-hc-268-Filter.db
-rw-r--r-- 1 cassandra cassandra     129792 2012-09-27 08:01
tk_usus_user-hc-268-Index.db
-rw-r--r-- 1 cassandra cassandra       4336 2012-09-27 08:01
tk_usus_user-hc-268-Statistics.db

Node 3
-rw-r--r-- 1 cassandra cassandra   4592393 2012-09-27 08:01
tk_usus_user-hc-278-Data.db
-rw-r--r-- 1 cassandra cassandra        69 2012-09-27 08:01
tk_usus_user-hc-278-Digest.sha1
-rw-r--r-- 1 cassandra cassandra     11056 2012-09-27 08:01
tk_usus_user-hc-278-Filter.db
-rw-r--r-- 1 cassandra cassandra    129792 2012-09-27 08:01
tk_usus_user-hc-278-Index.db
-rw-r--r-- 1 cassandra cassandra      4336 2012-09-27 08:01
tk_usus_user-hc-278-Statistics.db

Looking at the logs, on the "good node" I can see

 INFO [MigrationStage:1] 2012-09-24 10:08:16,511 Migration.java (line 119)
Applying migration c22413b0-062f-11e2-0000-1bcb936807db Update column
family to org.apache.cassandra.config.CFMetaData@1dbdcde9
[cfId=1016,ksName=tok,cfName=tk_usus_user,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type),subcolumncomparator=<null>,comment=,rowCacheSize=0.0,keyCacheSize=200000.0,readRepairChance=1.0,replicateOnWrite=true,gcGraceSeconds=864000,defaultValidator=org.apache.cassandra.db.marshal.UTF8Type,keyValidator=org.apache.cassandra.db.marshal.UUIDType,minCompactionThreshold=4,maxCompactionThreshold=32,rowCacheSavePeriodInSeconds=0,keyCacheSavePeriodInSeconds=14400,rowCacheKeysToSave=2147483647,rowCacheProvider=org.apache.cassandra.cache.SerializingCacheProvider@3505231c,mergeShardsChance=0.1,keyAlias=java.nio.HeapByteBuffer[pos=485
lim=488 cap=653],column_metadata={},compactionStrategyClass=class
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,compactionStrategyOptions={},compressionOptions={sstable_compression=org.apache.cassandra.io.compress.SnappyCompressor,
chunk_length_kb=64},bloomFilterFpChance=<null>]

But same can be seen in the logs of the two other nodes:
 INFO [MigrationStage:1] 2012-09-24 10:08:16,767 Migration.java (line 119)
Applying migration c22413b0-062f-11e2-0000-1bcb936807db Update column
family to org.apache.cassandra.config.CFMetaData@24fbb95d
[cfId=1016,ksName=tok,cfName=tk_usus_user,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type),subcolumncomparator=<null>,comment=,rowCacheSize=0.0,keyCacheSize=200000.0,readRepairChance=1.0,replicateOnWrite=true,gcGraceSeconds=864000,defaultValidator=org.apache.cassandra.db.marshal.UTF8Type,keyValidator=org.apache.cassandra.db.marshal.UUIDType,minCompactionThreshold=4,maxCompactionThreshold=32,rowCacheSavePeriodInSeconds=0,keyCacheSavePeriodInSeconds=14400,rowCacheKeysToSave=2147483647,rowCacheProvider=org.apache.cassandra.cache.SerializingCacheProvider@a469ba3,mergeShardsChance=0.1,keyAlias=java.nio.HeapByteBuffer[pos=0
lim=3 cap=3],column_metadata={},compactionStrategyClass=class
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,compactionStrategyOptions={},compressionOptions={sstable_compression=org.apache.cassandra.io.compress.SnappyCompressor,
chunk_length_kb=64},bloomFilterFpChance=<null>]

 INFO [MigrationStage:1] 2012-09-24 10:08:16,705 Migration.java (line 119)
Applying migration c22413b0-062f-11e2-0000-1bcb936807db Update column
family to org.apache.cassandra.config.CFMetaData@216b6a58
[cfId=1016,ksName=tok,cfName=tk_usus_user,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type),subcolumncomparator=<null>,comment=,rowCacheSize=0.0,keyCacheSize=200000.0,readRepairChance=1.0,replicateOnWrite=true,gcGraceSeconds=864000,defaultValidator=org.apache.cassandra.db.marshal.UTF8Type,keyValidator=org.apache.cassandra.db.marshal.UUIDType,minCompactionThreshold=4,maxCompactionThreshold=32,rowCacheSavePeriodInSeconds=0,keyCacheSavePeriodInSeconds=14400,rowCacheKeysToSave=2147483647,rowCacheProvider=org.apache.cassandra.cache.SerializingCacheProvider@1312c88c,mergeShardsChance=0.1,keyAlias=java.nio.HeapByteBuffer[pos=0
lim=3 cap=3],column_metadata={},compactionStrategyClass=class
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,compactionStrategyOptions={},compressionOptions={sstable_compression=org.apache.cassandra.io.compress.SnappyCompressor,
chunk_length_kb=64},bloomFilterFpChance=<null>]


I can also see scrub messages in logs
Good node:
 INFO [CompactionExecutor:1774] 2012-09-24 10:09:05,402
CompactionManager.java (line 476) Scrubbing
SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-264-Data.db')
 INFO [CompactionExecutor:1774] 2012-09-24 10:09:05,934
CompactionManager.java (line 658) Scrub of
SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-264-Data.db')
complete: 4868 rows in new sstable and 0 empty (tombstoned) rows dropped

Other nodes

 INFO [CompactionExecutor:1800] 2012-09-24 10:09:11,789
CompactionManager.java (line 476) Scrubbing
SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-260-Data.db')
 INFO [CompactionExecutor:1800] 2012-09-24 10:09:12,464
CompactionManager.java (line 658) Scrub of
SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-260-Data.db')
complete: 4868 rows in new sstable and 0 empty (tombstoned) rows dropped

 INFO [CompactionExecutor:1687] 2012-09-24 10:09:16,235
CompactionManager.java (line 476) Scrubbing
SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-271-Data.db')
 INFO [CompactionExecutor:1687] 2012-09-24 10:09:16,898
CompactionManager.java (line 658) Scrub of
SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-271-Data.db')
compete: 4868 rows in new sstable and 0 empty (tombstoned) rows dropped

Any idea?
Thanks!!

*Tamar Fraenkel *
Senior Software Engineer, TOK Media

[image: Inline image 1]

tamar@tok-media.com
Tel:   +972 2 6409736
Mob:  +972 54 8356490
Fax:   +972 2 5612956





On Wed, Sep 26, 2012 at 3:40 AM, aaron morton <aa...@thelastpickle.com>wrote:

> Check the logs on  nodes 2 and 3 to see if the scrub started. The logs on
> 1 will be a good help with that.
>
> Cheers
>
>   -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 24/09/2012, at 10:31 PM, Tamar Fraenkel <ta...@tok-media.com> wrote:
>
> Hi!
> I ran
> UPDATE COLUMN FAMILY cf_name WITH
> compression_options={sstable_compression:SnappyCompressor,
> chunk_length_kb:64};
>
> I then ran on all my nodes (3)
> sudo nodetool -h localhost scrub tok cf_name
>
> I have replication factor 3. The size of the data on disk was cut in half
> in the first node and in the jmx I can see that indeed the compression
> ration is 0.46. But on nodes 2 and 3 nothing happened. In the jmx I can see
> that compression ratio is 0 and the size of the files of disk stayed the
> same.
>
> In cli
>
> ColumnFamily: cf_name
>       Key Validation Class: org.apache.cassandra.db.marshal.UUIDType
>       Default column value validator:
> org.apache.cassandra.db.marshal.UTF8Type
>       Columns sorted by:
> org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)
>       Row cache size / save period in seconds / keys to save : 0.0/0/all
>       Row Cache Provider:
> org.apache.cassandra.cache.SerializingCacheProvider
>       Key cache size / save period in seconds: 200000.0/14400
>       GC grace seconds: 864000
>       Compaction min/max thresholds: 4/32
>       Read repair chance: 1.0
>       Replicate on write: true
>       Bloom Filter FP chance: default
>       Built indexes: []
>       Compaction Strategy:
> org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
>       Compression Options:
>         chunk_length_kb: 64
>         sstable_compression:
> org.apache.cassandra.io.compress.SnappyCompressor
>
> Can anyone help?
> Thanks
>
>  *Tamar Fraenkel *
> Senior Software Engineer, TOK Media
>
> <tokLogo.png>
>
>
> tamar@tok-media.com
> Tel:   +972 2 6409736
> Mob:  +972 54 8356490
> Fax:   +972 2 5612956
>
>
>
>
>
> On Mon, Sep 24, 2012 at 8:37 AM, Tamar Fraenkel <ta...@tok-media.com>wrote:
>
>> Thanks all, that helps. Will start with one - two CFs and let you know
>> the effect
>>
>>
>> *Tamar Fraenkel *
>> Senior Software Engineer, TOK Media
>>
>> <tokLogo.png>
>>
>>
>> tamar@tok-media.com
>> Tel:   +972 2 6409736
>> Mob:  +972 54 8356490
>> Fax:   +972 2 5612956
>>
>>
>>
>>
>>
>> On Sun, Sep 23, 2012 at 8:21 PM, Hiller, Dean <De...@nrel.gov>wrote:
>>
>>> As well as your unlimited column names may all have the same prefix,
>>> right? Like "accounts".rowkey56, "accounts".rowkey78, etc. etc.  so the
>>> "accounts gets a ton of compression then.
>>>
>>> Later,
>>> Dean
>>>
>>> From: Tyler Hobbs <ty...@datastax.com>>
>>> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
>>> <us...@cassandra.apache.org>>
>>> Date: Sunday, September 23, 2012 11:46 AM
>>> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
>>> user@cassandra.apache.org<ma...@cassandra.apache.org>>
>>> Subject: Re: compression
>>>
>>>  column metadata, you're still likely to get a reasonable amount of
>>> compression.  This is especially true if there is some amount of repetition
>>> in the column names, values, or TTLs in wide rows.  Compression will almost
>>> always be beneficial unless you're already somehow CPU bound or are using
>>> large column values that are high in entropy, such as pre-compressed or
>>> encrypted data.
>>>
>>
>>
>
>

Re: compression

Posted by aaron morton <aa...@thelastpickle.com>.

Can you try restarting the node ? That would reload the CF Meta data and reset the compaction settings.

Sorry that's not very helpful but it's all I can think of for now. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 24/10/2012, at 11:41 PM, Tamar Fraenkel <ta...@tok-media.com> wrote:

> Hi!
> I tried again, I see the scrub action in cassandra logs
>  INFO [CompactionExecutor:4029] 2012-10-24 10:36:54,108 CompactionManager.java (line 476) Scrubbing SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-339-Data.db')
>  INFO [CompactionExecutor:4029] 2012-10-24 10:36:54,184 CompactionManager.java (line 658) Scrub of SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-339-Data.db') complete: 54 rows in new sstable and 0 empty (tombstoned) rows dropped
>  INFO [CompactionExecutor:4029] 2012-10-24 10:36:54,185 CompactionManager.java (line 476) Scrubbing SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-340-Data.db')
>  INFO [CompactionExecutor:4029] 2012-10-24 10:36:54,914 CompactionManager.java (line 658) Scrub of SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-340-Data.db') complete: 7037 rows in new sstable and 0 empty (tombstoned) rows dropped
> 
> I don't see any CompressionInfo.db files and compression ratio is still 0.0 on this node only, on other nodes it is almost 0.5...
> 
> Any idea?
> 
> Thanks,
> 
> Tamar Fraenkel 
> Senior Software Engineer, TOK Media 
> 
> <tokLogo.png>
> 
> tamar@tok-media.com
> Tel:   +972 2 6409736 
> Mob:  +972 54 8356490 
> Fax:   +972 2 5612956 
> 
> 
> 
> 
> 
> On Wed, Sep 26, 2012 at 3:40 AM, aaron morton <aa...@thelastpickle.com> wrote:
> Check the logs on  nodes 2 and 3 to see if the scrub started. The logs on 1 will be a good help with that. 
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 24/09/2012, at 10:31 PM, Tamar Fraenkel <ta...@tok-media.com> wrote:
> 
>> Hi!
>> I ran 
>> UPDATE COLUMN FAMILY cf_name WITH compression_options={sstable_compression:SnappyCompressor, chunk_length_kb:64};
>> 
>> I then ran on all my nodes (3)
>> sudo nodetool -h localhost scrub tok cf_name
>> 
>> I have replication factor 3. The size of the data on disk was cut in half in the first node and in the jmx I can see that indeed the compression ration is 0.46. But on nodes 2 and 3 nothing happened. In the jmx I can see that compression ratio is 0 and the size of the files of disk stayed the same.
>> 
>> In cli 
>> 
>> ColumnFamily: cf_name
>>       Key Validation Class: org.apache.cassandra.db.marshal.UUIDType
>>       Default column value validator: org.apache.cassandra.db.marshal.UTF8Type
>>       Columns sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)
>>       Row cache size / save period in seconds / keys to save : 0.0/0/all
>>       Row Cache Provider: org.apache.cassandra.cache.SerializingCacheProvider
>>       Key cache size / save period in seconds: 200000.0/14400
>>       GC grace seconds: 864000
>>       Compaction min/max thresholds: 4/32
>>       Read repair chance: 1.0
>>       Replicate on write: true
>>       Bloom Filter FP chance: default
>>       Built indexes: []
>>       Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
>>       Compression Options:
>>         chunk_length_kb: 64
>>         sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor
>> 
>> Can anyone help?
>> Thanks
>> 
>> Tamar Fraenkel 
>> Senior Software Engineer, TOK Media 
>> 
>> <tokLogo.png>
>> 
>> 
>> tamar@tok-media.com
>> Tel:   +972 2 6409736 
>> Mob:  +972 54 8356490 
>> Fax:   +972 2 5612956 
>> 
>> 
>> 
>> 
>> 
>> On Mon, Sep 24, 2012 at 8:37 AM, Tamar Fraenkel <ta...@tok-media.com> wrote:
>> Thanks all, that helps. Will start with one - two CFs and let you know the effect
>> 
>> 
>> Tamar Fraenkel 
>> Senior Software Engineer, TOK Media 
>> 
>> <tokLogo.png>
>> 
>> 
>> tamar@tok-media.com
>> Tel:   +972 2 6409736 
>> Mob:  +972 54 8356490 
>> Fax:   +972 2 5612956 
>> 
>> 
>> 
>> 
>> 
>> On Sun, Sep 23, 2012 at 8:21 PM, Hiller, Dean <De...@nrel.gov> wrote:
>> As well as your unlimited column names may all have the same prefix, right? Like "accounts".rowkey56, "accounts".rowkey78, etc. etc.  so the "accounts gets a ton of compression then.
>> 
>> Later,
>> Dean
>> 
>> From: Tyler Hobbs <ty...@datastax.com>>
>> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
>> Date: Sunday, September 23, 2012 11:46 AM
>> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
>> Subject: Re: compression
>> 
>>  column metadata, you're still likely to get a reasonable amount of compression.  This is especially true if there is some amount of repetition in the column names, values, or TTLs in wide rows.  Compression will almost always be beneficial unless you're already somehow CPU bound or are using large column values that are high in entropy, such as pre-compressed or encrypted data.
>> 
>> 
> 
>

Re: compression

Posted by Tamar Fraenkel <ta...@tok-media.com>.

Hi!
I tried again, I see the scrub action in cassandra logs
 INFO [CompactionExecutor:4029] 2012-10-24 10:36:54,108
CompactionManager.java (line 476) Scrubbing
SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-339-Data.db')
 INFO [CompactionExecutor:4029] 2012-10-24 10:36:54,184
CompactionManager.java (line 658) Scrub of
SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-339-Data.db')
complete: 54 rows in new sstable and 0 empty (tombstoned) rows dropped
 INFO [CompactionExecutor:4029] 2012-10-24 10:36:54,185
CompactionManager.java (line 476) Scrubbing
SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-340-Data.db')
 INFO [CompactionExecutor:4029] 2012-10-24 10:36:54,914
CompactionManager.java (line 658) Scrub of
SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-340-Data.db')
complete: 7037 rows in new sstable and 0 empty (tombstoned) rows dropped

I don't see any CompressionInfo.db files and compression ratio is still 0.0
on this node only, on other nodes it is almost 0.5...

Any idea?

Thanks,

*Tamar Fraenkel *
Senior Software Engineer, TOK Media

[image: Inline image 1]

tamar@tok-media.com
Tel:   +972 2 6409736
Mob:  +972 54 8356490
Fax:   +972 2 5612956





On Wed, Sep 26, 2012 at 3:40 AM, aaron morton <aa...@thelastpickle.com>wrote:

> Check the logs on  nodes 2 and 3 to see if the scrub started. The logs on
> 1 will be a good help with that.
>
> Cheers
>
>   -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 24/09/2012, at 10:31 PM, Tamar Fraenkel <ta...@tok-media.com> wrote:
>
> Hi!
> I ran
> UPDATE COLUMN FAMILY cf_name WITH
> compression_options={sstable_compression:SnappyCompressor,
> chunk_length_kb:64};
>
> I then ran on all my nodes (3)
> sudo nodetool -h localhost scrub tok cf_name
>
> I have replication factor 3. The size of the data on disk was cut in half
> in the first node and in the jmx I can see that indeed the compression
> ration is 0.46. But on nodes 2 and 3 nothing happened. In the jmx I can see
> that compression ratio is 0 and the size of the files of disk stayed the
> same.
>
> In cli
>
> ColumnFamily: cf_name
>       Key Validation Class: org.apache.cassandra.db.marshal.UUIDType
>       Default column value validator:
> org.apache.cassandra.db.marshal.UTF8Type
>       Columns sorted by:
> org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)
>       Row cache size / save period in seconds / keys to save : 0.0/0/all
>       Row Cache Provider:
> org.apache.cassandra.cache.SerializingCacheProvider
>       Key cache size / save period in seconds: 200000.0/14400
>       GC grace seconds: 864000
>       Compaction min/max thresholds: 4/32
>       Read repair chance: 1.0
>       Replicate on write: true
>       Bloom Filter FP chance: default
>       Built indexes: []
>       Compaction Strategy:
> org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
>       Compression Options:
>         chunk_length_kb: 64
>         sstable_compression:
> org.apache.cassandra.io.compress.SnappyCompressor
>
> Can anyone help?
> Thanks
>
>  *Tamar Fraenkel *
> Senior Software Engineer, TOK Media
>
> <tokLogo.png>
>
>
> tamar@tok-media.com
> Tel:   +972 2 6409736
> Mob:  +972 54 8356490
> Fax:   +972 2 5612956
>
>
>
>
>
> On Mon, Sep 24, 2012 at 8:37 AM, Tamar Fraenkel <ta...@tok-media.com>wrote:
>
>> Thanks all, that helps. Will start with one - two CFs and let you know
>> the effect
>>
>>
>> *Tamar Fraenkel *
>> Senior Software Engineer, TOK Media
>>
>> <tokLogo.png>
>>
>>
>> tamar@tok-media.com
>> Tel:   +972 2 6409736
>> Mob:  +972 54 8356490
>> Fax:   +972 2 5612956
>>
>>
>>
>>
>>
>> On Sun, Sep 23, 2012 at 8:21 PM, Hiller, Dean <De...@nrel.gov>wrote:
>>
>>> As well as your unlimited column names may all have the same prefix,
>>> right? Like "accounts".rowkey56, "accounts".rowkey78, etc. etc.  so the
>>> "accounts gets a ton of compression then.
>>>
>>> Later,
>>> Dean
>>>
>>> From: Tyler Hobbs <ty...@datastax.com>>
>>> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
>>> <us...@cassandra.apache.org>>
>>> Date: Sunday, September 23, 2012 11:46 AM
>>> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
>>> user@cassandra.apache.org<ma...@cassandra.apache.org>>
>>> Subject: Re: compression
>>>
>>>  column metadata, you're still likely to get a reasonable amount of
>>> compression.  This is especially true if there is some amount of repetition
>>> in the column names, values, or TTLs in wide rows.  Compression will almost
>>> always be beneficial unless you're already somehow CPU bound or are using
>>> large column values that are high in entropy, such as pre-compressed or
>>> encrypted data.
>>>
>>
>>
>
>

Re: compression

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

I have no clue. I never did it even if I am planning to do so.

1 - Did you just spent 1 month with a cluster in an "unstable" state ? Had
you any issue during this time related to the transitional state of your
cluster ?

I am currently storing counters with:
row => objectId, column name => date#event, data => counter (date format
20121029).

2 - Is it a good Idea to compress this kind of data ?

I am looking for using composites columns.

3 - What are the benefits of using a column name like
"CompositeType(UTF8Type, UTF8Type)" and a simple UTF8 column with event and
date separated by a sharp as I am doing right now ?

4 - Would compression be a good idea in this case ?

Thanks for your help on any of these 4 points :).

Alain


2012/10/29 Tamar Fraenkel <ta...@tok-media.com>

> Hi!
> Thanks Aaron!
> Today I restarted Cassandra on that node and ran scrub again, now it is
> fine.
>
> I am worried though that if I decide to change another CF to use
> compression I will have that issue again. Any clue how to avoid it?
>
> Thanks.
>
> *Tamar Fraenkel *
> Senior Software Engineer, TOK Media
>
> [image: Inline image 1]
>
>
> tamar@tok-media.com
> Tel:   +972 2 6409736
> Mob:  +972 54 8356490
> Fax:   +972 2 5612956
>
>
>
>
>
> On Wed, Sep 26, 2012 at 3:40 AM, aaron morton <aa...@thelastpickle.com>wrote:
>
>> Check the logs on  nodes 2 and 3 to see if the scrub started. The logs on
>> 1 will be a good help with that.
>>
>> Cheers
>>
>>   -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 24/09/2012, at 10:31 PM, Tamar Fraenkel <ta...@tok-media.com> wrote:
>>
>> Hi!
>> I ran
>> UPDATE COLUMN FAMILY cf_name WITH
>> compression_options={sstable_compression:SnappyCompressor,
>> chunk_length_kb:64};
>>
>> I then ran on all my nodes (3)
>> sudo nodetool -h localhost scrub tok cf_name
>>
>> I have replication factor 3. The size of the data on disk was cut in half
>> in the first node and in the jmx I can see that indeed the compression
>> ration is 0.46. But on nodes 2 and 3 nothing happened. In the jmx I can see
>> that compression ratio is 0 and the size of the files of disk stayed the
>> same.
>>
>> In cli
>>
>> ColumnFamily: cf_name
>>       Key Validation Class: org.apache.cassandra.db.marshal.UUIDType
>>       Default column value validator:
>> org.apache.cassandra.db.marshal.UTF8Type
>>       Columns sorted by:
>> org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)
>>       Row cache size / save period in seconds / keys to save : 0.0/0/all
>>       Row Cache Provider:
>> org.apache.cassandra.cache.SerializingCacheProvider
>>       Key cache size / save period in seconds: 200000.0/14400
>>       GC grace seconds: 864000
>>       Compaction min/max thresholds: 4/32
>>       Read repair chance: 1.0
>>       Replicate on write: true
>>       Bloom Filter FP chance: default
>>       Built indexes: []
>>       Compaction Strategy:
>> org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
>>       Compression Options:
>>         chunk_length_kb: 64
>>         sstable_compression:
>> org.apache.cassandra.io.compress.SnappyCompressor
>>
>> Can anyone help?
>> Thanks
>>
>>  *Tamar Fraenkel *
>> Senior Software Engineer, TOK Media
>>
>> <tokLogo.png>
>>
>>
>> tamar@tok-media.com
>> Tel:   +972 2 6409736
>> Mob:  +972 54 8356490
>> Fax:   +972 2 5612956
>>
>>
>>
>>
>>
>> On Mon, Sep 24, 2012 at 8:37 AM, Tamar Fraenkel <ta...@tok-media.com>wrote:
>>
>>> Thanks all, that helps. Will start with one - two CFs and let you know
>>> the effect
>>>
>>>
>>> *Tamar Fraenkel *
>>> Senior Software Engineer, TOK Media
>>>
>>> <tokLogo.png>
>>>
>>>
>>> tamar@tok-media.com
>>> Tel:   +972 2 6409736
>>> Mob:  +972 54 8356490
>>> Fax:   +972 2 5612956
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Sep 23, 2012 at 8:21 PM, Hiller, Dean <De...@nrel.gov>wrote:
>>>
>>>> As well as your unlimited column names may all have the same prefix,
>>>> right? Like "accounts".rowkey56, "accounts".rowkey78, etc. etc.  so the
>>>> "accounts gets a ton of compression then.
>>>>
>>>> Later,
>>>> Dean
>>>>
>>>> From: Tyler Hobbs <ty...@datastax.com>>
>>>> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
>>>> <us...@cassandra.apache.org>>
>>>> Date: Sunday, September 23, 2012 11:46 AM
>>>> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
>>>> user@cassandra.apache.org<ma...@cassandra.apache.org>>
>>>> Subject: Re: compression
>>>>
>>>>  column metadata, you're still likely to get a reasonable amount of
>>>> compression.  This is especially true if there is some amount of repetition
>>>> in the column names, values, or TTLs in wide rows.  Compression will almost
>>>> always be beneficial unless you're already somehow CPU bound or are using
>>>> large column values that are high in entropy, such as pre-compressed or
>>>> encrypted data.
>>>>
>>>
>>>
>>
>>
>

Re: compression

Posted by aaron morton <aa...@thelastpickle.com>.

>  Any clue how to avoid it?
Not really sure what went wrong. Diagnosing that sort of problem usually takes access to the running node and time to poke around and see what it does in responses to various things. 

Rebooting works for Windows 95 and Cassandra is not that different. 

Cheers
 
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 29/10/2012, at 9:12 PM, Tamar Fraenkel <ta...@tok-media.com> wrote:

> Hi!
> Thanks Aaron!
> Today I restarted Cassandra on that node and ran scrub again, now it is fine.
> 
> I am worried though that if I decide to change another CF to use compression I will have that issue again. Any clue how to avoid it?
> 
> Thanks.
> 
> Tamar Fraenkel 
> Senior Software Engineer, TOK Media 
> 
> <tokLogo.png>
> 
> tamar@tok-media.com
> Tel:   +972 2 6409736 
> Mob:  +972 54 8356490 
> Fax:   +972 2 5612956 
> 
> 
> 
> 
> 
> On Wed, Sep 26, 2012 at 3:40 AM, aaron morton <aa...@thelastpickle.com> wrote:
> Check the logs on  nodes 2 and 3 to see if the scrub started. The logs on 1 will be a good help with that. 
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 24/09/2012, at 10:31 PM, Tamar Fraenkel <ta...@tok-media.com> wrote:
> 
>> Hi!
>> I ran 
>> UPDATE COLUMN FAMILY cf_name WITH compression_options={sstable_compression:SnappyCompressor, chunk_length_kb:64};
>> 
>> I then ran on all my nodes (3)
>> sudo nodetool -h localhost scrub tok cf_name
>> 
>> I have replication factor 3. The size of the data on disk was cut in half in the first node and in the jmx I can see that indeed the compression ration is 0.46. But on nodes 2 and 3 nothing happened. In the jmx I can see that compression ratio is 0 and the size of the files of disk stayed the same.
>> 
>> In cli 
>> 
>> ColumnFamily: cf_name
>>       Key Validation Class: org.apache.cassandra.db.marshal.UUIDType
>>       Default column value validator: org.apache.cassandra.db.marshal.UTF8Type
>>       Columns sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)
>>       Row cache size / save period in seconds / keys to save : 0.0/0/all
>>       Row Cache Provider: org.apache.cassandra.cache.SerializingCacheProvider
>>       Key cache size / save period in seconds: 200000.0/14400
>>       GC grace seconds: 864000
>>       Compaction min/max thresholds: 4/32
>>       Read repair chance: 1.0
>>       Replicate on write: true
>>       Bloom Filter FP chance: default
>>       Built indexes: []
>>       Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
>>       Compression Options:
>>         chunk_length_kb: 64
>>         sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor
>> 
>> Can anyone help?
>> Thanks
>> 
>> Tamar Fraenkel 
>> Senior Software Engineer, TOK Media 
>> 
>> <tokLogo.png>
>> 
>> 
>> tamar@tok-media.com
>> Tel:   +972 2 6409736 
>> Mob:  +972 54 8356490 
>> Fax:   +972 2 5612956 
>> 
>> 
>> 
>> 
>> 
>> On Mon, Sep 24, 2012 at 8:37 AM, Tamar Fraenkel <ta...@tok-media.com> wrote:
>> Thanks all, that helps. Will start with one - two CFs and let you know the effect
>> 
>> 
>> Tamar Fraenkel 
>> Senior Software Engineer, TOK Media 
>> 
>> <tokLogo.png>
>> 
>> 
>> tamar@tok-media.com
>> Tel:   +972 2 6409736 
>> Mob:  +972 54 8356490 
>> Fax:   +972 2 5612956 
>> 
>> 
>> 
>> 
>> 
>> On Sun, Sep 23, 2012 at 8:21 PM, Hiller, Dean <De...@nrel.gov> wrote:
>> As well as your unlimited column names may all have the same prefix, right? Like "accounts".rowkey56, "accounts".rowkey78, etc. etc.  so the "accounts gets a ton of compression then.
>> 
>> Later,
>> Dean
>> 
>> From: Tyler Hobbs <ty...@datastax.com>>
>> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
>> Date: Sunday, September 23, 2012 11:46 AM
>> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
>> Subject: Re: compression
>> 
>>  column metadata, you're still likely to get a reasonable amount of compression.  This is especially true if there is some amount of repetition in the column names, values, or TTLs in wide rows.  Compression will almost always be beneficial unless you're already somehow CPU bound or are using large column values that are high in entropy, such as pre-compressed or encrypted data.
>> 
>> 
> 
>

Re: compression

Posted by Tamar Fraenkel <ta...@tok-media.com>.

Hi!
Thanks Aaron!
Today I restarted Cassandra on that node and ran scrub again, now it is
fine.

I am worried though that if I decide to change another CF to use
compression I will have that issue again. Any clue how to avoid it?

Thanks.

*Tamar Fraenkel *
Senior Software Engineer, TOK Media

[image: Inline image 1]

tamar@tok-media.com
Tel:   +972 2 6409736
Mob:  +972 54 8356490
Fax:   +972 2 5612956





On Wed, Sep 26, 2012 at 3:40 AM, aaron morton <aa...@thelastpickle.com>wrote:

> Check the logs on  nodes 2 and 3 to see if the scrub started. The logs on
> 1 will be a good help with that.
>
> Cheers
>
>   -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 24/09/2012, at 10:31 PM, Tamar Fraenkel <ta...@tok-media.com> wrote:
>
> Hi!
> I ran
> UPDATE COLUMN FAMILY cf_name WITH
> compression_options={sstable_compression:SnappyCompressor,
> chunk_length_kb:64};
>
> I then ran on all my nodes (3)
> sudo nodetool -h localhost scrub tok cf_name
>
> I have replication factor 3. The size of the data on disk was cut in half
> in the first node and in the jmx I can see that indeed the compression
> ration is 0.46. But on nodes 2 and 3 nothing happened. In the jmx I can see
> that compression ratio is 0 and the size of the files of disk stayed the
> same.
>
> In cli
>
> ColumnFamily: cf_name
>       Key Validation Class: org.apache.cassandra.db.marshal.UUIDType
>       Default column value validator:
> org.apache.cassandra.db.marshal.UTF8Type
>       Columns sorted by:
> org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)
>       Row cache size / save period in seconds / keys to save : 0.0/0/all
>       Row Cache Provider:
> org.apache.cassandra.cache.SerializingCacheProvider
>       Key cache size / save period in seconds: 200000.0/14400
>       GC grace seconds: 864000
>       Compaction min/max thresholds: 4/32
>       Read repair chance: 1.0
>       Replicate on write: true
>       Bloom Filter FP chance: default
>       Built indexes: []
>       Compaction Strategy:
> org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
>       Compression Options:
>         chunk_length_kb: 64
>         sstable_compression:
> org.apache.cassandra.io.compress.SnappyCompressor
>
> Can anyone help?
> Thanks
>
>  *Tamar Fraenkel *
> Senior Software Engineer, TOK Media
>
> <tokLogo.png>
>
>
> tamar@tok-media.com
> Tel:   +972 2 6409736
> Mob:  +972 54 8356490
> Fax:   +972 2 5612956
>
>
>
>
>
> On Mon, Sep 24, 2012 at 8:37 AM, Tamar Fraenkel <ta...@tok-media.com>wrote:
>
>> Thanks all, that helps. Will start with one - two CFs and let you know
>> the effect
>>
>>
>> *Tamar Fraenkel *
>> Senior Software Engineer, TOK Media
>>
>> <tokLogo.png>
>>
>>
>> tamar@tok-media.com
>> Tel:   +972 2 6409736
>> Mob:  +972 54 8356490
>> Fax:   +972 2 5612956
>>
>>
>>
>>
>>
>> On Sun, Sep 23, 2012 at 8:21 PM, Hiller, Dean <De...@nrel.gov>wrote:
>>
>>> As well as your unlimited column names may all have the same prefix,
>>> right? Like "accounts".rowkey56, "accounts".rowkey78, etc. etc.  so the
>>> "accounts gets a ton of compression then.
>>>
>>> Later,
>>> Dean
>>>
>>> From: Tyler Hobbs <ty...@datastax.com>>
>>> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
>>> <us...@cassandra.apache.org>>
>>> Date: Sunday, September 23, 2012 11:46 AM
>>> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
>>> user@cassandra.apache.org<ma...@cassandra.apache.org>>
>>> Subject: Re: compression
>>>
>>>  column metadata, you're still likely to get a reasonable amount of
>>> compression.  This is especially true if there is some amount of repetition
>>> in the column names, values, or TTLs in wide rows.  Compression will almost
>>> always be beneficial unless you're already somehow CPU bound or are using
>>> large column values that are high in entropy, such as pre-compressed or
>>> encrypted data.
>>>
>>
>>
>
>

Re: compression

Posted by aaron morton <aa...@thelastpickle.com>.

Check the logs on  nodes 2 and 3 to see if the scrub started. The logs on 1 will be a good help with that. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 24/09/2012, at 10:31 PM, Tamar Fraenkel <ta...@tok-media.com> wrote:

> Hi!
> I ran 
> UPDATE COLUMN FAMILY cf_name WITH compression_options={sstable_compression:SnappyCompressor, chunk_length_kb:64};
> 
> I then ran on all my nodes (3)
> sudo nodetool -h localhost scrub tok cf_name
> 
> I have replication factor 3. The size of the data on disk was cut in half in the first node and in the jmx I can see that indeed the compression ration is 0.46. But on nodes 2 and 3 nothing happened. In the jmx I can see that compression ratio is 0 and the size of the files of disk stayed the same.
> 
> In cli 
> 
> ColumnFamily: cf_name
>       Key Validation Class: org.apache.cassandra.db.marshal.UUIDType
>       Default column value validator: org.apache.cassandra.db.marshal.UTF8Type
>       Columns sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)
>       Row cache size / save period in seconds / keys to save : 0.0/0/all
>       Row Cache Provider: org.apache.cassandra.cache.SerializingCacheProvider
>       Key cache size / save period in seconds: 200000.0/14400
>       GC grace seconds: 864000
>       Compaction min/max thresholds: 4/32
>       Read repair chance: 1.0
>       Replicate on write: true
>       Bloom Filter FP chance: default
>       Built indexes: []
>       Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
>       Compression Options:
>         chunk_length_kb: 64
>         sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor
> 
> Can anyone help?
> Thanks
> 
> Tamar Fraenkel 
> Senior Software Engineer, TOK Media 
> 
> <tokLogo.png>
> 
> tamar@tok-media.com
> Tel:   +972 2 6409736 
> Mob:  +972 54 8356490 
> Fax:   +972 2 5612956 
> 
> 
> 
> 
> 
> On Mon, Sep 24, 2012 at 8:37 AM, Tamar Fraenkel <ta...@tok-media.com> wrote:
> Thanks all, that helps. Will start with one - two CFs and let you know the effect
> 
> 
> Tamar Fraenkel 
> Senior Software Engineer, TOK Media 
> 
> <tokLogo.png>
> 
> tamar@tok-media.com
> Tel:   +972 2 6409736 
> Mob:  +972 54 8356490 
> Fax:   +972 2 5612956 
> 
> 
> 
> 
> 
> On Sun, Sep 23, 2012 at 8:21 PM, Hiller, Dean <De...@nrel.gov> wrote:
> As well as your unlimited column names may all have the same prefix, right? Like "accounts".rowkey56, "accounts".rowkey78, etc. etc.  so the "accounts gets a ton of compression then.
> 
> Later,
> Dean
> 
> From: Tyler Hobbs <ty...@datastax.com>>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
> Date: Sunday, September 23, 2012 11:46 AM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
> Subject: Re: compression
> 
>  column metadata, you're still likely to get a reasonable amount of compression.  This is especially true if there is some amount of repetition in the column names, values, or TTLs in wide rows.  Compression will almost always be beneficial unless you're already somehow CPU bound or are using large column values that are high in entropy, such as pre-compressed or encrypted data.
> 
>

Re: compression

Posted by Tamar Fraenkel <ta...@tok-media.com>.

Hi!
I ran
UPDATE COLUMN FAMILY cf_name WITH
compression_options={sstable_compression:SnappyCompressor,
chunk_length_kb:64};

I then ran on all my nodes (3)
sudo nodetool -h localhost scrub tok cf_name

I have replication factor 3. The size of the data on disk was cut in half
in the first node and in the jmx I can see that indeed the compression
ration is 0.46. But on nodes 2 and 3 nothing happened. In the jmx I can see
that compression ratio is 0 and the size of the files of disk stayed the
same.

In cli

ColumnFamily: cf_name
      Key Validation Class: org.apache.cassandra.db.marshal.UUIDType
      Default column value validator:
org.apache.cassandra.db.marshal.UTF8Type
      Columns sorted by:
org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)
      Row cache size / save period in seconds / keys to save : 0.0/0/all
      Row Cache Provider:
org.apache.cassandra.cache.SerializingCacheProvider
      Key cache size / save period in seconds: 200000.0/14400
      GC grace seconds: 864000
      Compaction min/max thresholds: 4/32
      Read repair chance: 1.0
      Replicate on write: true
      Bloom Filter FP chance: default
      Built indexes: []
      Compaction Strategy:
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
      Compression Options:
        chunk_length_kb: 64
        sstable_compression:
org.apache.cassandra.io.compress.SnappyCompressor

Can anyone help?
Thanks

*Tamar Fraenkel *
Senior Software Engineer, TOK Media

[image: Inline image 1]

tamar@tok-media.com
Tel:   +972 2 6409736
Mob:  +972 54 8356490
Fax:   +972 2 5612956





On Mon, Sep 24, 2012 at 8:37 AM, Tamar Fraenkel <ta...@tok-media.com> wrote:

> Thanks all, that helps. Will start with one - two CFs and let you know the
> effect
>
>
> *Tamar Fraenkel *
> Senior Software Engineer, TOK Media
>
> [image: Inline image 1]
>
> tamar@tok-media.com
> Tel:   +972 2 6409736
> Mob:  +972 54 8356490
> Fax:   +972 2 5612956
>
>
>
>
>
> On Sun, Sep 23, 2012 at 8:21 PM, Hiller, Dean <De...@nrel.gov>wrote:
>
>> As well as your unlimited column names may all have the same prefix,
>> right? Like "accounts".rowkey56, "accounts".rowkey78, etc. etc.  so the
>> "accounts gets a ton of compression then.
>>
>> Later,
>> Dean
>>
>> From: Tyler Hobbs <ty...@datastax.com>>
>> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
>> user@cassandra.apache.org<ma...@cassandra.apache.org>>
>> Date: Sunday, September 23, 2012 11:46 AM
>> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
>> user@cassandra.apache.org<ma...@cassandra.apache.org>>
>> Subject: Re: compression
>>
>>  column metadata, you're still likely to get a reasonable amount of
>> compression.  This is especially true if there is some amount of repetition
>> in the column names, values, or TTLs in wide rows.  Compression will almost
>> always be beneficial unless you're already somehow CPU bound or are using
>> large column values that are high in entropy, such as pre-compressed or
>> encrypted data.
>>
>
>

Re: compression

Posted by Tamar Fraenkel <ta...@tok-media.com>.

Thanks all, that helps. Will start with one - two CFs and let you know the
effect

*Tamar Fraenkel *
Senior Software Engineer, TOK Media

[image: Inline image 1]

tamar@tok-media.com
Tel:   +972 2 6409736
Mob:  +972 54 8356490
Fax:   +972 2 5612956





On Sun, Sep 23, 2012 at 8:21 PM, Hiller, Dean <De...@nrel.gov> wrote:

> As well as your unlimited column names may all have the same prefix,
> right? Like "accounts".rowkey56, "accounts".rowkey78, etc. etc.  so the
> "accounts gets a ton of compression then.
>
> Later,
> Dean
>
> From: Tyler Hobbs <ty...@datastax.com>>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Date: Sunday, September 23, 2012 11:46 AM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Subject: Re: compression
>
>  column metadata, you're still likely to get a reasonable amount of
> compression.  This is especially true if there is some amount of repetition
> in the column names, values, or TTLs in wide rows.  Compression will almost
> always be beneficial unless you're already somehow CPU bound or are using
> large column values that are high in entropy, such as pre-compressed or
> encrypted data.
>

Re: compression

Posted by "Hiller, Dean" <De...@nrel.gov>.

As well as your unlimited column names may all have the same prefix, right? Like "accounts".rowkey56, "accounts".rowkey78, etc. etc.  so the "accounts gets a ton of compression then.

Later,
Dean

From: Tyler Hobbs <ty...@datastax.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Sunday, September 23, 2012 11:46 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: compression

 column metadata, you're still likely to get a reasonable amount of compression.  This is especially true if there is some amount of repetition in the column names, values, or TTLs in wide rows.  Compression will almost always be beneficial unless you're already somehow CPU bound or are using large column values that are high in entropy, such as pre-compressed or encrypted data.

Re: compression

Posted by Tyler Hobbs <ty...@datastax.com>.

Due to repetition in the column metadata, you're still likely to get a
reasonable amount of compression.  This is especially true if there is some
amount of repetition in the column names, values, or TTLs in wide rows.
Compression will almost always be beneficial unless you're already somehow
CPU bound or are using large column values that are high in entropy, such
as pre-compressed or encrypted data.

On Sun, Sep 23, 2012 at 10:29 AM, Tamar Fraenkel <ta...@tok-media.com>wrote:

> Hi!
> In datastax documentation<http://www.datastax.com/docs/1.0/ddl/column_family>there is an explanation of what CFs are a good fit for compression:
>
> When to Use Compression
>
> Compression is best suited for column families where there are many rows,
> with each row having the same columns, or at least many columns in common.
> For example, a column family containing user data such as username, email,
> etc., would be a good candidate for compression. The more similar the data
> across rows, the greater the compression ratio will be, and the larger the
> gain in read performance.
>
> Compression is not as good a fit for column families where each row has a
> different set of columns, or where there are just a few very wide rows.
> Dynamic column families such as this will not yield good compression ratios.
>
> I have many column families where rows share some of the columns and have
> varied number of unique columns per row.
> For example, I have a CF where each row has ~13 shared columns, but
> between 0 to many unique columns. Will such CF be a good fit for
> compression?
>
> More generally, is there a rule of thumb for how many shared columns (or
> percentage of columns which are shared) is considered a good fit for
> compression?
>
> Thanks,
>
> *Tamar Fraenkel *
> Senior Software Engineer, TOK Media
>
> [image: Inline image 1]
>
> tamar@tok-media.com
> Tel:   +972 2 6409736
> Mob:  +972 54 8356490
> Fax:   +972 2 5612956
>
>
>
>


-- 
Tyler Hobbs
DataStax <http://datastax.com/>