You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Sheng Chen <ch...@gmail.com> on 2011/03/29 10:55:46 UTC

Compaction doubles disk space

I use 'nodetool compact' command to start a compaction.
I can understand that extra disk spaces are required during the compaction,
but after the compaction, the extra spaces are not released.

Before compaction:
SSTable count: 10
space used (live): 19G
space used (total): 21G

After compaction:
sstable count: 1
space used (live): 19G
space used (total): 42G


BTW, given that compaction requires double disk spaces, does it mean that I
should never reach half of my total disk space?
e.g. if I have 505GB data on 1TB disk, I cannot even delete any data at all.

Re: Compaction doubles disk space

Posted by Sheng Chen <ch...@gmail.com>.
It really helps. Thank you very much.

Sheng

2011/3/30 aaron morton <aa...@thelastpickle.com>

> When a compaction need to write a file cassandra will try to find a place
> to put the new file, based on an estimate of it's size. If it cannot find
> enough space it will trigger a GC which will delete any previously compacted
> and so unneeded SSTables. The same thing will happen when a new SSTable
> needs to be written to disk.
>
> Minor Compaction groups the SSTables on disk into buckets of similar sizes
> (http://wiki.apache.org/cassandra/MemtableSSTable) each bucket is
> processed in it's own compaction task. Under 0.7 compaction is single
> threaded and when each compaction task starts it will try to find space on
> disk and if necessary trigger GC to free space.
>
> SSTables are immutable on disk, compaction cannot delete data from them as
> they are also used to serve read requests at the same time. To do so would
> require locking around (regions of) the file.
>
> Also as far as I understand we cannot immediately delete files because
> other operations (including repair) may be using them. The data in the pre
> compacted files is just as correct as the data in the compacted file, it's
> just more compact. So the easiest thing to do is let the JVM sort out if
> anything else is using them.
>
> Perhaps it could be improved by actively tracking which files are in use so
> they may be deleted quicker. But right so long as unused space is freed when
> needed it's working as designed AFAIK.
>
> Thats my understanding, hope it helps explain why it works that way.
> Aaron
>
> On 30 Mar 2011, at 13:32, Sheng Chen wrote:
>
> Yes.
> I think at least we can remove the tombstones for each sstable first, and
> then do the merge.
>
> 2011/3/29 Karl Hiramoto <ka...@hiramoto.org>
>
>> Would it be possible to improve the current compaction disk space issue by
>>  compacting one only a few SSTables at a time then imediately deleting the
>> old one?  Looking at the logs it seems like deletions of old SSTables are
>> taking longer than necessary.
>>
>> --
>> Karl
>>
>
>
>

Re: Compaction doubles disk space

Posted by Karl Hiramoto <ka...@hiramoto.org>.
On 3/30/2011 12:39 PM, aaron morton wrote:
> Checked the code again, got it a bit wrong. When getting a path to 
> flush a memtable (and to write an incoming stream) via 
> cfs.getFlushPath() the code does not invoke GC if there is not enough 
> space.
>
> One reason for not doing this could be that when we do it during 
> compaction we wait for 20 seconds before checking disk space again. 
> However the write happens on a separate flusher pool.
>
> created https://issues.apache.org/jira/browse/CASSANDRA-2404 to ask if 
> we can/should reclaim space during flush.
>
> Karl, what version are you using and have you altered the compaction 
> thresholds ?
>

0.7.4  and no.
This behavior has been though all of the 0.7.x series.

Thanks
--
Karl

Re: Compaction doubles disk space

Posted by aaron morton <aa...@thelastpickle.com>.
Checked the code again, got it a bit wrong. When getting a path to flush a memtable (and to write an incoming stream) via cfs.getFlushPath() the code does not invoke GC if there is not enough space. 

One reason for not doing this could be that when we do it during compaction we wait for 20 seconds before checking disk space again. However the write happens on a separate flusher pool.

created https://issues.apache.org/jira/browse/CASSANDRA-2404 to ask if we can/should reclaim space during flush. 

Karl, what version are you using and have you altered the compaction thresholds ? 

Aaron

On 30 Mar 2011, at 19:46, Karl Hiramoto wrote:

> On 30/03/2011 09:08, aaron morton wrote:
>> Also as far as I understand we cannot immediately delete files because other operations (including repair) may be using them. The data in the pre compacted files is just as correct as the data in the compacted file, it's just more compact. So the easiest thing to do is let the JVM sort out if anything else is using them.
>> 
>> Perhaps it could be improved by actively tracking which files are in use so they may be deleted quicker. But right so long as unused space is freed when needed it's working as designed AFAIK.
>> 
>> 
> 
> I've run out of space on multiple occasions, and we have nagios alarms going off frequently when disk usage is over 90%.   I check cassandra and the data/ directory is 2X  to 4X bigger than it needs to be, and no compaction or repair is currently running.  I restart the cassandra process, or force a GC, it deletes a lot of old SSTables and the data/ directory goes down to 1/2 to 1/4  of the size it was a few minutes ago.
> 
> Under lots of disk pressure here.
> 
> --
> Karl
> 


Re: Compaction doubles disk space

Posted by Karl Hiramoto <ka...@hiramoto.org>.
On 30/03/2011 09:08, aaron morton wrote:
> Also as far as I understand we cannot immediately delete files because 
> other operations (including repair) may be using them. The data in the 
> pre compacted files is just as correct as the data in the compacted 
> file, it's just more compact. So the easiest thing to do is let the 
> JVM sort out if anything else is using them.
>
> Perhaps it could be improved by actively tracking which files are in 
> use so they may be deleted quicker. But right so long as unused space 
> is freed when needed it's working as designed AFAIK.
>
>

I've run out of space on multiple occasions, and we have nagios alarms 
going off frequently when disk usage is over 90%.   I check cassandra 
and the data/ directory is 2X  to 4X bigger than it needs to be, and no 
compaction or repair is currently running.  I restart the cassandra 
process, or force a GC, it deletes a lot of old SSTables and the data/ 
directory goes down to 1/2 to 1/4  of the size it was a few minutes ago.

Under lots of disk pressure here.

--
Karl


Re: Compaction doubles disk space

Posted by aaron morton <aa...@thelastpickle.com>.
When a compaction need to write a file cassandra will try to find a place to put the new file, based on an estimate of it's size. If it cannot find enough space it will trigger a GC which will delete any previously compacted and so unneeded SSTables. The same thing will happen when a new SSTable needs to be written to disk. 

Minor Compaction groups the SSTables on disk into buckets of similar sizes (http://wiki.apache.org/cassandra/MemtableSSTable) each bucket is processed in it's own compaction task. Under 0.7 compaction is single threaded and when each compaction task starts it will try to find space on disk and if necessary trigger GC to free space. 
 
SSTables are immutable on disk, compaction cannot delete data from them as they are also used to serve read requests at the same time. To do so would require locking around (regions of) the file.  

Also as far as I understand we cannot immediately delete files because other operations (including repair) may be using them. The data in the pre compacted files is just as correct as the data in the compacted file, it's just more compact. So the easiest thing to do is let the JVM sort out if anything else is using them. 

Perhaps it could be improved by actively tracking which files are in use so they may be deleted quicker. But right so long as unused space is freed when needed it's working as designed AFAIK. 

Thats my understanding, hope it helps explain why it works that way. 
Aaron

On 30 Mar 2011, at 13:32, Sheng Chen wrote:

> Yes.
> I think at least we can remove the tombstones for each sstable first, and then do the merge.
> 
> 2011/3/29 Karl Hiramoto <ka...@hiramoto.org>
> Would it be possible to improve the current compaction disk space issue by  compacting one only a few SSTables at a time then imediately deleting the old one?  Looking at the logs it seems like deletions of old SSTables are taking longer than necessary.
> 
> --
> Karl
> 


Re: Compaction doubles disk space

Posted by Sheng Chen <ch...@gmail.com>.
Yes.
I think at least we can remove the tombstones for each sstable first, and
then do the merge.

2011/3/29 Karl Hiramoto <ka...@hiramoto.org>

> Would it be possible to improve the current compaction disk space issue by
>  compacting one only a few SSTables at a time then imediately deleting the
> old one?  Looking at the logs it seems like deletions of old SSTables are
> taking longer than necessary.
>
> --
> Karl
>

Re: Compaction doubles disk space

Posted by Karl Hiramoto <ka...@hiramoto.org>.
Would it be possible to improve the current compaction disk space issue 
by  compacting one only a few SSTables at a time then imediately 
deleting the old one?  Looking at the logs it seems like deletions of 
old SSTables are taking longer than necessary.

--
Karl

Re: Compaction doubles disk space

Posted by Sylvain Lebresne <sy...@datastax.com>.
> BTW, given that compaction requires double disk spaces, does it mean that I
> should never reach half of my total disk space?
> e.g. if I have 505GB data on 1TB disk, I cannot even delete any data at all.

It is not so black and white. What is true is that in practice
reaching half the disk should
be a first alert, from which you should start to monitor things more
carefully to avoid problems.

There is 2 kind of compaction, major and minor ones. The major ones
are the ones that compact
all the sstables for a given column family. Minor compaction are the
one that are trigger automatically
and regularly. By definition they don't compact everything and thus
don't need half your disk space.
Note however that over time, even minor compaction will require a fair
amount of disk space and could
very well require as much as half the disk space, but in practice it
won't happen all the time.

There other thing is that even a major compaction only have to be
applied to one Column Family at a
time. So unless you only have one CF or 90% of you data in one CF (and
for the record, there's nothing
wrong with that, it's just not necessarily your case), you won't need
exactly half you disk for a
compaction.

All this to say that it is not as if as simple as: you've reached half
your disk space you are necessarily doomed.
Chances are you'll never hit any problem until you're say 70% full (or
more). But there is no fullproof number
here so I said earlier, hitting 50% should be a first sign that you
may need a plan for the future.

--
Sylvain

Re: Compaction doubles disk space

Posted by Sheng Chen <ch...@gmail.com>.
>From a previous thread of the same topic, I used a force GC and the extra
spaces are released.

What about my second question?




2011/3/29 Sheng Chen <ch...@gmail.com>

> I use 'nodetool compact' command to start a compaction.
> I can understand that extra disk spaces are required during the compaction,
> but after the compaction, the extra spaces are not released.
>
> Before compaction:
> SSTable count: 10
> space used (live): 19G
> space used (total): 21G
>
> After compaction:
> sstable count: 1
> space used (live): 19G
> space used (total): 42G
>
>
> BTW, given that compaction requires double disk spaces, does it mean that I
> should never reach half of my total disk space?
> e.g. if I have 505GB data on 1TB disk, I cannot even delete any data at
> all.
>
>
>
>
>