You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Anthony Molinaro <an...@alumni.caltech.edu> on 2010/04/10 21:24:38 UTC

Recovery from botched compaction

Hi,

  This is sort of a pre-emptive question as the compaction I'm doing hasn't
failed yet but I expect it to any time now.  I have a cluster which has been
storing user profile data for a client.  Recently I've had to go back and
reload all the data again.  I wasn't watching diskspace, and on one of the
nodes it went above 50% (which I recall was bad), to somewhere around 70%.
I expect to most back with a compaction (as most of the data was the same
so a compaction should remove old copies), and went ahead and started one
with nodeprobe compact (using 0.5.0 on this cluster).  However, I do see
that the disk usage is growing (it's at 91% now).

So when the disk fills up and this compaction crashes what can I do?
I assume get a bigger disk, shut down the node, move the data and
restart will work, but do I have other options?
Which files can I ignore (ie, can I not move any of the *-tmp-* files)?
Will my system be in a corrupt state?

This machine is one in a set of 6, and since I didn't choose tokens
initially, they are very lopsided (ie, some use 20% of their disk, others
60-70%).  If I were to start moving tokens around would the machines short
of space be able to anti-compact without filling up?  or does anti-compaction
like compaction require 2x disk space?

Thanks,

-Anthony

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Recovery from botched compaction

Posted by Jonathan Ellis <jb...@gmail.com>.

On Tue, Apr 13, 2010 at 3:59 PM, Anthony Molinaro
<an...@alumni.caltech.edu> wrote:
> I actually got lucky and while it hovered in the 91-95% full, compaction
> finished and its now at 60%.  However, I still have around a dozen or so
> data files.  I thought 'nodeprobe compact' did a major compaction, and
> that a major compaction would shrink to one file?

2 possibilities, probably both of which are affecting you:

1. If there isn't enough disk space to compact everything, cassandra
will remove files from the to-compact list until it has room to do
what you asked it to do.  (But, you you can still run out of space if
you write enough data while the compaction happens.)

2. 0.5's minor compactions don't combine as many sstables as they
should automatically.  This is fixed in 0.6

> Okay, sounds good, I may leave it for the moment, as last time I tried
> any sort of move/decommision with 0.5.x I was unable to figure out if
> anything was happening, so I may just wait and revisit when I upgrade.

Yes, 0.5 sucks there.  0.6 is still a little opaque but you can at
least see what is happening if you know where to look:
http://wiki.apache.org/cassandra/Streaming

-Jonathan

Re: Recovery from botched compaction

Posted by Anthony Molinaro <an...@alumni.caltech.edu>.

On Tue, Apr 13, 2010 at 10:54:51AM -0500, Jonathan Ellis wrote:
> On Sat, Apr 10, 2010 at 2:24 PM, Anthony Molinaro
> <an...@alumni.caltech.edu> wrote:
> >  This is sort of a pre-emptive question as the compaction I'm doing hasn't
> > failed yet but I expect it to any time now.  I have a cluster which has been
> > storing user profile data for a client.  Recently I've had to go back and
> > reload all the data again.  I wasn't watching diskspace, and on one of the
> > nodes it went above 50% (which I recall was bad), to somewhere around 70%.
> > I expect to most back with a compaction (as most of the data was the same
> > so a compaction should remove old copies), and went ahead and started one
> > with nodeprobe compact (using 0.5.0 on this cluster).  However, I do see
> > that the disk usage is growing (it's at 91% now).
> 
> Right, it can't remove any old data, until the compacted version is written.
> 
> (This is where the 50% recommendation comes from: worst-case, the
> compacted version will take up exactly as much space as it did before,
> if there were no deletes or overwrites.)

I actually got lucky and while it hovered in the 91-95% full, compaction
finished and its now at 60%.  However, I still have around a dozen or so
data files.  I thought 'nodeprobe compact' did a major compaction, and
that a major compaction would shrink to one file?

> > So when the disk fills up and this compaction crashes what can I do?
> > I assume get a bigger disk, shut down the node, move the data and
> > restart will work, but do I have other options?
> > Which files can I ignore (ie, can I not move any of the *-tmp-* files)?
> > Will my system be in a corrupt state?
> 
> It won't corrupt itself, and it will automatically r/m tmp files when
> it starts up.
> 
> If the disk fills up entirely then the node will become unresponsive
> even for reads which is something we plan to fix.
> (https://issues.apache.org/jira/browse/CASSANDRA-809)
> 
> Otherwise there isn't a whole lot you can do with the "I need to put
> more data on my machine than I have room for" scenario.

Got it, I already went ahead and added a few EBS's, raid0'd them and
transfered data over to them.  I was happy to recall that if I turned
off writes (not hard as writes are all bulk on this cluster), the disk
files never change, so I was able to rsync while serving reads :)

> > This machine is one in a set of 6, and since I didn't choose tokens
> > initially, they are very lopsided (ie, some use 20% of their disk, others
> > 60-70%).  If I were to start moving tokens around would the machines short
> > of space be able to anti-compact without filling up?  or does anti-compaction
> > like compaction require 2x disk space?
> 
> Anticompaction requires as much space as the data being transferred,
> so worst case of transferring 100% off would require 2x.
> 
> https://issues.apache.org/jira/browse/CASSANDRA-579 will fix this for
> the anticompaction case.

Okay, sounds good, I may leave it for the moment, as last time I tried
any sort of move/decommision with 0.5.x I was unable to figure out if
anything was happening, so I may just wait and revisit when I upgrade.

Thanks for the answers,

-Anthony

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Recovery from botched compaction

Posted by Jonathan Ellis <jb...@gmail.com>.

On Sat, Apr 10, 2010 at 2:24 PM, Anthony Molinaro
<an...@alumni.caltech.edu> wrote:
>  This is sort of a pre-emptive question as the compaction I'm doing hasn't
> failed yet but I expect it to any time now.  I have a cluster which has been
> storing user profile data for a client.  Recently I've had to go back and
> reload all the data again.  I wasn't watching diskspace, and on one of the
> nodes it went above 50% (which I recall was bad), to somewhere around 70%.
> I expect to most back with a compaction (as most of the data was the same
> so a compaction should remove old copies), and went ahead and started one
> with nodeprobe compact (using 0.5.0 on this cluster).  However, I do see
> that the disk usage is growing (it's at 91% now).

Right, it can't remove any old data, until the compacted version is written.

(This is where the 50% recommendation comes from: worst-case, the
compacted version will take up exactly as much space as it did before,
if there were no deletes or overwrites.)

> So when the disk fills up and this compaction crashes what can I do?
> I assume get a bigger disk, shut down the node, move the data and
> restart will work, but do I have other options?
> Which files can I ignore (ie, can I not move any of the *-tmp-* files)?
> Will my system be in a corrupt state?

It won't corrupt itself, and it will automatically r/m tmp files when
it starts up.

If the disk fills up entirely then the node will become unresponsive
even for reads which is something we plan to fix.
(https://issues.apache.org/jira/browse/CASSANDRA-809)

Otherwise there isn't a whole lot you can do with the "I need to put
more data on my machine than I have room for" scenario.

> This machine is one in a set of 6, and since I didn't choose tokens
> initially, they are very lopsided (ie, some use 20% of their disk, others
> 60-70%).  If I were to start moving tokens around would the machines short
> of space be able to anti-compact without filling up?  or does anti-compaction
> like compaction require 2x disk space?

Anticompaction requires as much space as the data being transferred,
so worst case of transferring 100% off would require 2x.

https://issues.apache.org/jira/browse/CASSANDRA-579 will fix this for
the anticompaction case.

-Jonathan