You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Radim Kolar <hs...@sendmail.cz> on 2011/11/17 11:02:05 UTC

split large sstable

Is there some simple way how to split large sstable into several smaller 
ones? I increased  min_compaction_threshold (smaller tables seems to get 
better file offset caching from OS) and now i need to reshuffle data to 
smaller sstables, running several cluster wide repairs worked well just 
largest table was left. I have 80 GB sstable and need to split it to 
about 10 GB ones.

Re: split large sstable

Posted by Edward Capriolo <ed...@gmail.com>.

On Mon, Nov 21, 2011 at 11:26 AM, sridhar basam <sr...@basam.org> wrote:

>
>
> On Mon, Nov 21, 2011 at 10:34 AM, Edward Capriolo <ed...@gmail.com>wrote:
>
>>
>>
>> On Mon, Nov 21, 2011 at 10:07 AM, Dan Hendry <da...@gmail.com>wrote:
>>
>>> Pretty sure your argument about indirect blocks making large files
>>> inefficient only pertains to ext2/3 and not ext4. It seems ext4 replaces
>>> the
>>> 'indirect block' approach with extents
>>> (
>>> http://kernelnewbies.org/Ext4#head-7c5fd53118e8b888345b95cc11756346be4268f4
>>> , http://en.wikipedia.org/wiki/Ext4#Features).
>>>
>>> I was not aware of this difference in the file systems and it seems to
>>> be a
>>> compelling reason ext4 should be chosen (over ext3) for Cassandra - at
>>> least
>>> when using size tiered compaction.
>>>
>>>
> If you are using a Redhat distribution, at least in the 5.x series, make
> sure that you pass in a '-O extent' option when you create the filesystem.
> Otherwise extents are not enabled by default.
>
>
>> IMHO there is only one good reason left to use ext3. For a 100MB /boot
>> partition since the boot loaders have an easier time with it.
>>
>> EXT4 is better then EXT3 in every way. It is the default formatting for
>> RHEL. Do not fight the future.
>>
>>
>> http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/a_great_reason_to_use
>>
>>
> I agree with ext4 being superior to ext3 but some constructive feedback
> about your graphs.
>
> You might want to add a legend or point out the before and after if you
> want to show difference between ext3 and ext4. I can kind of see that
> something might have changed on the Friday but without a legend it makes it
> hard to see the point you are trying to make.
>
>  Sridhar
>
>
To be clear the event was stop hadoop. Convert ext3 to ext4 start hadoop.
The black represents io-wait. As you can see after the conversion iowait
dropped significantly.

http://www.phoronix.com/scan.php?page=article&item=ext4_benchmarks&num=1

Ext3 was a fine file system but it was not designed for today's large
disks.

In particular, something that hurts Cassandra is that ext3's design is not
great for deleting large files. However with the write and compact model of
Cassandra you are deleting large files often.

So these facts:
1) major distributions now install ext4 by default
2) ext4 wins most/all benchmarks vs ext3
3) ext4 can handle bigger files and larger filesystems then ext3
4) large file deletion factoid
5) my graphs :)
6) You can upgrade and ext3 to an ext4 without a reformat (although it does
take an unmount and an fsck)

I can not see why anyone would run ext3 anymore.

Re: split large sstable

Posted by sridhar basam <sr...@basam.org>.

On Mon, Nov 21, 2011 at 10:34 AM, Edward Capriolo <ed...@gmail.com>wrote:

>
>
> On Mon, Nov 21, 2011 at 10:07 AM, Dan Hendry <da...@gmail.com>wrote:
>
>> Pretty sure your argument about indirect blocks making large files
>> inefficient only pertains to ext2/3 and not ext4. It seems ext4 replaces
>> the
>> 'indirect block' approach with extents
>> (
>> http://kernelnewbies.org/Ext4#head-7c5fd53118e8b888345b95cc11756346be4268f4
>> , http://en.wikipedia.org/wiki/Ext4#Features).
>>
>> I was not aware of this difference in the file systems and it seems to be
>> a
>> compelling reason ext4 should be chosen (over ext3) for Cassandra - at
>> least
>> when using size tiered compaction.
>>
>>
If you are using a Redhat distribution, at least in the 5.x series, make
sure that you pass in a '-O extent' option when you create the filesystem.
Otherwise extents are not enabled by default.


> IMHO there is only one good reason left to use ext3. For a 100MB /boot
> partition since the boot loaders have an easier time with it.
>
> EXT4 is better then EXT3 in every way. It is the default formatting for
> RHEL. Do not fight the future.
>
>
> http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/a_great_reason_to_use
>
>
I agree with ext4 being superior to ext3 but some constructive feedback
about your graphs.

You might want to add a legend or point out the before and after if you
want to show difference between ext3 and ext4. I can kind of see that
something might have changed on the Friday but without a legend it makes it
hard to see the point you are trying to make.

 Sridhar

Re: split large sstable

Posted by Edward Capriolo <ed...@gmail.com>.

On Mon, Nov 21, 2011 at 10:07 AM, Dan Hendry <da...@gmail.com>wrote:

> Pretty sure your argument about indirect blocks making large files
> inefficient only pertains to ext2/3 and not ext4. It seems ext4 replaces
> the
> 'indirect block' approach with extents
> (
> http://kernelnewbies.org/Ext4#head-7c5fd53118e8b888345b95cc11756346be4268f4
> , http://en.wikipedia.org/wiki/Ext4#Features).
>
> I was not aware of this difference in the file systems and it seems to be a
> compelling reason ext4 should be chosen (over ext3) for Cassandra - at
> least
> when using size tiered compaction.
>
> Dan
>
> -----Original Message-----
> From: Radim Kolar [mailto:hsn@sendmail.cz]
> Sent: November-19-11 19:42
> To: user@cassandra.apache.org
> Subject: Re: split large sstable
>
> Dne 17.11.2011 17:42, Dan Hendry napsal(a):
> > What do you mean by ' better file offset caching'? Presumably you mean
> > 'better page cache hit rate'?
> fs metadata used to find blocks in smaller files are cached better.
> Large files are using indirect blocks and you need more reads to find
> correct block during seek syscall. For example if large file is using 3
> indirect levels, you need 3xdisk seek to find correct block.
>
> http://computer-forensics.sans.org/blog/2008/12/24/understanding-indirect-bl
> ocks-in-unix-file-systems/
> Metadata caching in OS is far worse then file caching - one "find /"
> will effectively nullify metadata cache.
>
> If cassandra could use raw storage. it will eliminate fs overhead and it
> could be over 100% faster on reads because fragmentation will be an
> exception - no need to design fs like FAT or UFS where designers expects
> files to be stored in non continuous area on disk.  Implementing
> something log based like - http://logfs.sourceforge.net/ will be enough.
> Cleaning will not be much needed because compaction will clean it
> naturally.
>
> > Perhaps what you are actually seeing is row fragmentation across your
> > SSTables? Easy to check with nodetool cfhistograms (SSTables column).
> i have 1.5% hitrate to 2 sstables and 3% to hit 3 sstables. Its pretty
> low with min. compaction set to 5, i will probably set it to 6.
>
> I would really like to see tests with user defined sizes and file counts
> used for tiered compaction because it work best if you do not leave
> largest file alone in bucket. If your data in cassandra are not growing,
> it can be better fine tuned. i havent done experiments with it but maybe
> max sstable size defined per cf will be enough. Lets say i have 5 GB
> data per CF - ideal setting will be max sstable size to slightly less
> then 1 GB. Cassandra will not keep old data stuck in one 4 GB compacted
> sstable waiting for other 4 GB sstables to be created before compaction
> will remove old data.
>
> > To answer your question, I know of no tools to split SSTables. If you
> want
> > to switch compaction strategies, levelled compaction (1.0.x) creates many
> > smaller sstables instead of fewer, bigger ones.
> I dont use levelled compaction, it compacts too often. It might get
> better if it can be tuned how many and how large files to use at each
> level. But i will try to switch to levelled compaction and back again it
> might do what i want.
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.920 / Virus Database: 271.1.1/4029 - Release Date: 11/20/11
> 14:34:00
>
>
IMHO there is only one good reason left to use ext3. For a 100MB /boot
partition since the boot loaders have an easier time with it.

EXT4 is better then EXT3 in every way. It is the default formatting for
RHEL. Do not fight the future.

http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/a_great_reason_to_use

Re: split large sstable

Posted by Zhu Han <sc...@gmail.com>.

best regards,
韩竹(Zhu Han)

坚果铺子 <https://jianguopuzi.com>, 最简单易用的云存储
同步文件, 分享照片, 文档备份!



On Mon, Nov 21, 2011 at 11:07 PM, Dan Hendry <da...@gmail.com>wrote:

> Pretty sure your argument about indirect blocks making large files
> inefficient only pertains to ext2/3 and not ext4. It seems ext4 replaces
> the
> 'indirect block' approach with extents
> (
> http://kernelnewbies.org/Ext4#head-7c5fd53118e8b888345b95cc11756346be4268f4
> , http://en.wikipedia.org/wiki/Ext4#Features).
>



>
> I was not aware of this difference in the file systems and it seems to be a
> compelling reason ext4 should be chosen (over ext3) for Cassandra - at
> least
> when using size tiered compaction.
>

An alternative is XFS, which is also extent based.

>
> Dan
>
> -----Original Message-----
> From: Radim Kolar [mailto:hsn@sendmail.cz]
> Sent: November-19-11 19:42
> To: user@cassandra.apache.org
> Subject: Re: split large sstable
>
> Dne 17.11.2011 17:42, Dan Hendry napsal(a):
> > What do you mean by ' better file offset caching'? Presumably you mean
> > 'better page cache hit rate'?
> fs metadata used to find blocks in smaller files are cached better.
> Large files are using indirect blocks and you need more reads to find
> correct block during seek syscall. For example if large file is using 3
> indirect levels, you need 3xdisk seek to find correct block.
>
> http://computer-forensics.sans.org/blog/2008/12/24/understanding-indirect-bl
> ocks-in-unix-file-systems/
> Metadata caching in OS is far worse then file caching - one "find /"
> will effectively nullify metadata cache.
>
> If cassandra could use raw storage. it will eliminate fs overhead and it
> could be over 100% faster on reads because fragmentation will be an
> exception - no need to design fs like FAT or UFS where designers expects
> files to be stored in non continuous area on disk.  Implementing
> something log based like - http://logfs.sourceforge.net/ will be enough.
> Cleaning will not be much needed because compaction will clean it
> naturally.
>
> > Perhaps what you are actually seeing is row fragmentation across your
> > SSTables? Easy to check with nodetool cfhistograms (SSTables column).
> i have 1.5% hitrate to 2 sstables and 3% to hit 3 sstables. Its pretty
> low with min. compaction set to 5, i will probably set it to 6.
>
> I would really like to see tests with user defined sizes and file counts
> used for tiered compaction because it work best if you do not leave
> largest file alone in bucket. If your data in cassandra are not growing,
> it can be better fine tuned. i havent done experiments with it but maybe
> max sstable size defined per cf will be enough. Lets say i have 5 GB
> data per CF - ideal setting will be max sstable size to slightly less
> then 1 GB. Cassandra will not keep old data stuck in one 4 GB compacted
> sstable waiting for other 4 GB sstables to be created before compaction
> will remove old data.
>
> > To answer your question, I know of no tools to split SSTables. If you
> want
> > to switch compaction strategies, levelled compaction (1.0.x) creates many
> > smaller sstables instead of fewer, bigger ones.
> I dont use levelled compaction, it compacts too often. It might get
> better if it can be tuned how many and how large files to use at each
> level. But i will try to switch to levelled compaction and back again it
> might do what i want.
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.920 / Virus Database: 271.1.1/4029 - Release Date: 11/20/11
> 14:34:00
>
>

RE: split large sstable

Posted by Dan Hendry <da...@gmail.com>.

Pretty sure your argument about indirect blocks making large files
inefficient only pertains to ext2/3 and not ext4. It seems ext4 replaces the
'indirect block' approach with extents
(http://kernelnewbies.org/Ext4#head-7c5fd53118e8b888345b95cc11756346be4268f4
, http://en.wikipedia.org/wiki/Ext4#Features). 

I was not aware of this difference in the file systems and it seems to be a
compelling reason ext4 should be chosen (over ext3) for Cassandra - at least
when using size tiered compaction. 

Dan

-----Original Message-----
From: Radim Kolar [mailto:hsn@sendmail.cz] 
Sent: November-19-11 19:42
To: user@cassandra.apache.org
Subject: Re: split large sstable

Dne 17.11.2011 17:42, Dan Hendry napsal(a):
> What do you mean by ' better file offset caching'? Presumably you mean
> 'better page cache hit rate'?
fs metadata used to find blocks in smaller files are cached better. 
Large files are using indirect blocks and you need more reads to find 
correct block during seek syscall. For example if large file is using 3 
indirect levels, you need 3xdisk seek to find correct block. 
http://computer-forensics.sans.org/blog/2008/12/24/understanding-indirect-bl
ocks-in-unix-file-systems/ 
Metadata caching in OS is far worse then file caching - one "find /" 
will effectively nullify metadata cache.

If cassandra could use raw storage. it will eliminate fs overhead and it 
could be over 100% faster on reads because fragmentation will be an 
exception - no need to design fs like FAT or UFS where designers expects 
files to be stored in non continuous area on disk.  Implementing 
something log based like - http://logfs.sourceforge.net/ will be enough. 
Cleaning will not be much needed because compaction will clean it naturally.

> Perhaps what you are actually seeing is row fragmentation across your
> SSTables? Easy to check with nodetool cfhistograms (SSTables column).
i have 1.5% hitrate to 2 sstables and 3% to hit 3 sstables. Its pretty 
low with min. compaction set to 5, i will probably set it to 6.

I would really like to see tests with user defined sizes and file counts 
used for tiered compaction because it work best if you do not leave 
largest file alone in bucket. If your data in cassandra are not growing, 
it can be better fine tuned. i havent done experiments with it but maybe 
max sstable size defined per cf will be enough. Lets say i have 5 GB 
data per CF - ideal setting will be max sstable size to slightly less 
then 1 GB. Cassandra will not keep old data stuck in one 4 GB compacted 
sstable waiting for other 4 GB sstables to be created before compaction 
will remove old data.

> To answer your question, I know of no tools to split SSTables. If you want
> to switch compaction strategies, levelled compaction (1.0.x) creates many
> smaller sstables instead of fewer, bigger ones.
I dont use levelled compaction, it compacts too often. It might get 
better if it can be tuned how many and how large files to use at each 
level. But i will try to switch to levelled compaction and back again it 
might do what i want.
No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.920 / Virus Database: 271.1.1/4029 - Release Date: 11/20/11
14:34:00

Re: split large sstable

Posted by Radim Kolar <hs...@sendmail.cz>.

Dne 17.11.2011 17:42, Dan Hendry napsal(a):
> What do you mean by ' better file offset caching'? Presumably you mean
> 'better page cache hit rate'?
fs metadata used to find blocks in smaller files are cached better. 
Large files are using indirect blocks and you need more reads to find 
correct block during seek syscall. For example if large file is using 3 
indirect levels, you need 3xdisk seek to find correct block. 
http://computer-forensics.sans.org/blog/2008/12/24/understanding-indirect-blocks-in-unix-file-systems/ 
Metadata caching in OS is far worse then file caching - one "find /" 
will effectively nullify metadata cache.

If cassandra could use raw storage. it will eliminate fs overhead and it 
could be over 100% faster on reads because fragmentation will be an 
exception - no need to design fs like FAT or UFS where designers expects 
files to be stored in non continuous area on disk.  Implementing 
something log based like - http://logfs.sourceforge.net/ will be enough. 
Cleaning will not be much needed because compaction will clean it naturally.

> Perhaps what you are actually seeing is row fragmentation across your
> SSTables? Easy to check with nodetool cfhistograms (SSTables column).
i have 1.5% hitrate to 2 sstables and 3% to hit 3 sstables. Its pretty 
low with min. compaction set to 5, i will probably set it to 6.

I would really like to see tests with user defined sizes and file counts 
used for tiered compaction because it work best if you do not leave 
largest file alone in bucket. If your data in cassandra are not growing, 
it can be better fine tuned. i havent done experiments with it but maybe 
max sstable size defined per cf will be enough. Lets say i have 5 GB 
data per CF - ideal setting will be max sstable size to slightly less 
then 1 GB. Cassandra will not keep old data stuck in one 4 GB compacted 
sstable waiting for other 4 GB sstables to be created before compaction 
will remove old data.

> To answer your question, I know of no tools to split SSTables. If you want
> to switch compaction strategies, levelled compaction (1.0.x) creates many
> smaller sstables instead of fewer, bigger ones.
I dont use levelled compaction, it compacts too often. It might get 
better if it can be tuned how many and how large files to use at each 
level. But i will try to switch to levelled compaction and back again it 
might do what i want.

RE: split large sstable

Posted by Dan Hendry <da...@gmail.com>.

What do you mean by ' better file offset caching'? Presumably you mean
'better page cache hit rate'? Out of curiosity, why do you think this? What
data are you seeing which makes you think it's better? I am certainly not
even close to a virtual memory or page caching expert but I am pretty sure
file size does not matter (assuming file sizes are significantly greater
than the page size which I believe is 4k). 

Perhaps what you are actually seeing is row fragmentation across your
SSTables? Easy to check with nodetool cfhistograms (SSTables column).

To answer your question, I know of no tools to split SSTables. If you want
to switch compaction strategies, levelled compaction (1.0.x) creates many
smaller sstables instead of fewer, bigger ones. Although it is workload
dependent, increasing min_compaction_threshold for size tiered compaction is
probably a bad idea since it will increase row fragmentation across SSTables
and therefore increase io/seeking requirements for reads (particularly for
column ranges or non named-column queries). The only reason to do so is to
reduce the frequency of compaction (disk io considerations). 

Dan

-----Original Message-----
From: Radim Kolar [mailto:hsn@sendmail.cz] 
Sent: November-17-11 5:02
To: user@cassandra.apache.org
Subject: split large sstable

Is there some simple way how to split large sstable into several smaller 
ones? I increased  min_compaction_threshold (smaller tables seems to get 
better file offset caching from OS) and now i need to reshuffle data to 
smaller sstables, running several cluster wide repairs worked well just 
largest table was left. I have 80 GB sstable and need to split it to 
about 10 GB ones.
No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.920 / Virus Database: 271.1.1/4020 - Release Date: 11/16/11
02:34:00