You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jason Baker <ja...@apture.com> on 2011/08/11 03:18:24 UTC

Tuning a column family for archival

I have a column family that I'm using to archive records.  They're mostly
kept around for historical purposes.  Aside from that, they're mostly
considered deleted.  It's probably going to be very rare that anyone reads
from this table *ever*.  I don't really even write to it that much.

Does anyone have advice for me as far as how (or if) I should tune this
table with that in mind?  My concern is less speeding up access to this
table than it is making sure that it doesn't impact the performance of any
other column families in any way.

Here's the data from nodetool cfstat (although this table was just created a
few days ago):

Column Family: ArchivedLinks
SSTable count: 1
Space used (live): 29580801
Space used (total): 97838786
Number of Keys (estimate): 93184
Memtable Columns Count: 7497
Memtable Data Size: 3223587
Memtable Switch Count: 11
Read Count: 0
Read Latency: NaN ms.
Write Count: 139091
Write Latency: 0.007 ms.
Pending Tasks: 0
Key cache: disabled
Row cache: disabled
Compacted row minimum size: 259
Compacted row maximum size: 372
Compacted row mean size: 311

Re: Tuning a column family for archival

Posted by Jonathan Ellis <jb...@gmail.com>.
No.  He's saying that one of the points of mmaping the data files is
that the OS is free to only keep files that are actually used, in the
page cache.  Since this data is backed by an actual file swap is not
involved.

On Thu, Aug 11, 2011 at 12:59 PM, Jason Baker <ja...@apture.com> wrote:
> On Thu, Aug 11, 2011 at 6:14 AM, Edward Capriolo <ed...@gmail.com>
> wrote:
>>
>> In many regards Cassandra automatically does the correct thing. Other then
>> the costs of the bloom filters for the table size being in ram, if you never
>> read or write to those sstables and you are not reusing the row key, the OS
>> will page out those tables and they will not take any cache space.
>
> Forgive me if I'm being dense, but the wiki says that an instance shouldn't
> be swapping at all[1].  I presently have swappiness turned down to 0.  Are
> you saying that there may be benefits to allowing some swap usage?
> [1] http://wiki.apache.org/cassandra/MemtableThresholds?highlight=%28swap%29#Virtual_Memory_and_Swap



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Tuning a column family for archival

Posted by Jason Baker <ja...@apture.com>.
On Thu, Aug 11, 2011 at 6:14 AM, Edward Capriolo <ed...@gmail.com>wrote:
>
> In many regards Cassandra automatically does the correct thing. Other then
> the costs of the bloom filters for the table size being in ram, if you never
> read or write to those sstables and you are not reusing the row key, the OS
> will page out those tables and they will not take any cache space.
>

Forgive me if I'm being dense, but the wiki says that an instance shouldn't
be swapping at all[1].  I presently have swappiness turned down to 0.  Are
you saying that there may be benefits to allowing some swap usage?

[1]
http://wiki.apache.org/cassandra/MemtableThresholds?highlight=%28swap%29#Virtual_Memory_and_Swap

Re: Tuning a column family for archival

Posted by Edward Capriolo <ed...@gmail.com>.
On Thu, Aug 11, 2011 at 12:07 AM, aaron morton <aa...@thelastpickle.com>wrote:

> There's not much to do other than turn off the caches (which you have done)
> and leave it alone.
>
> If you want to poke around perhaps look at the compaction settings (from
> CLI help):
>
> - max_compaction_threshold: The maximum number of SSTables allowed before a
> minor compaction is forced. Default is 32, setting to 0 disables minor
> compactions.
>
> Decreasing this will cause minor compactions to start more frequently and
> be less intensive. The min_compaction_threshold and
> max_compaction_threshold
> boundaries are the number of tables Cassandra attempts to merge together at
> once.
>
> - min_compaction_threshold: The minimum number of SSTables needed
> to start a minor compaction. Default is 4, setting to 0 disables minor
> compactions.
>
> Increasing this will cause minor compactions to start less frequently and
> be more intensive. The min_compaction_threshold and
> max_compaction_threshold
> boundaries are the number of tables Cassandra attempts to merge together at
> once.
>
> You *could* disable compaction and then manually compact at the best time.
> If you are not doing many updates I'd wait and see.
>
> You could repair different CF's at different times. This would help with
> reducing the amount of data that is used to build the Merkle tree's, but
> there is a bug about streaming the differences that means extra data is
> streamed (can't remember the bug number now)
>
> I'd wait to see if there is an issue first.
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 11 Aug 2011, at 13:18, Jason Baker wrote:
>
> > I have a column family that I'm using to archive records.  They're mostly
> kept around for historical purposes.  Aside from that, they're mostly
> considered deleted.  It's probably going to be very rare that anyone reads
> from this table *ever*.  I don't really even write to it that much.
> >
> > Does anyone have advice for me as far as how (or if) I should tune this
> table with that in mind?  My concern is less speeding up access to this
> table than it is making sure that it doesn't impact the performance of any
> other column families in any way.
> >
> > Here's the data from nodetool cfstat (although this table was just
> created a few days ago):
> >
> >               Column Family: ArchivedLinks
> >               SSTable count: 1
> >               Space used (live): 29580801
> >               Space used (total): 97838786
> >               Number of Keys (estimate): 93184
> >               Memtable Columns Count: 7497
> >               Memtable Data Size: 3223587
> >               Memtable Switch Count: 11
> >               Read Count: 0
> >               Read Latency: NaN ms.
> >               Write Count: 139091
> >               Write Latency: 0.007 ms.
> >               Pending Tasks: 0
> >               Key cache: disabled
> >               Row cache: disabled
> >               Compacted row minimum size: 259
> >               Compacted row maximum size: 372
> >               Compacted row mean size: 311
>
>
In many regards Cassandra automatically does the correct thing. Other then
the costs of the bloom filters for the table size being in ram, if you never
read or write to those sstables and you are not reusing the row key, the OS
will page out those tables and they will not take any cache space.

Coming soon compaction is going to change a lot, I know one of the tickets
in the works is that SSTables will have a max size, compaction should do
something like a one to one rewrite of these tables, which should not be
very intensive.

Re: Tuning a column family for archival

Posted by aaron morton <aa...@thelastpickle.com>.
There's not much to do other than turn off the caches (which you have done) and leave it alone. 

If you want to poke around perhaps look at the compaction settings (from CLI help):

- max_compaction_threshold: The maximum number of SSTables allowed before a
minor compaction is forced. Default is 32, setting to 0 disables minor
compactions.

Decreasing this will cause minor compactions to start more frequently and
be less intensive. The min_compaction_threshold and max_compaction_threshold
boundaries are the number of tables Cassandra attempts to merge together at
once.

- min_compaction_threshold: The minimum number of SSTables needed
to start a minor compaction. Default is 4, setting to 0 disables minor
compactions.

Increasing this will cause minor compactions to start less frequently and
be more intensive. The min_compaction_threshold and max_compaction_threshold
boundaries are the number of tables Cassandra attempts to merge together at
once. 

You *could* disable compaction and then manually compact at the best time. If you are not doing many updates I'd wait and see. 

You could repair different CF's at different times. This would help with reducing the amount of data that is used to build the Merkle tree's, but there is a bug about streaming the differences that means extra data is streamed (can't remember the bug number now)

I'd wait to see if there is an issue first. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 11 Aug 2011, at 13:18, Jason Baker wrote:

> I have a column family that I'm using to archive records.  They're mostly kept around for historical purposes.  Aside from that, they're mostly considered deleted.  It's probably going to be very rare that anyone reads from this table *ever*.  I don't really even write to it that much.  
> 
> Does anyone have advice for me as far as how (or if) I should tune this table with that in mind?  My concern is less speeding up access to this table than it is making sure that it doesn't impact the performance of any other column families in any way.
> 
> Here's the data from nodetool cfstat (although this table was just created a few days ago):
> 
> 		Column Family: ArchivedLinks
> 		SSTable count: 1
> 		Space used (live): 29580801
> 		Space used (total): 97838786
> 		Number of Keys (estimate): 93184
> 		Memtable Columns Count: 7497
> 		Memtable Data Size: 3223587
> 		Memtable Switch Count: 11
> 		Read Count: 0
> 		Read Latency: NaN ms.
> 		Write Count: 139091
> 		Write Latency: 0.007 ms.
> 		Pending Tasks: 0
> 		Key cache: disabled
> 		Row cache: disabled
> 		Compacted row minimum size: 259
> 		Compacted row maximum size: 372
> 		Compacted row mean size: 311