You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Roman Tkachenko <ro...@mailgunhq.com> on 2015/04/11 01:00:18 UTC

Moving SSTables from one disk to another

Hey guys,

We're running Cassandra with two data directories, let's say
/data/sstables1 and /data/sstables2, which are in fact two separate (but
identical) disks. The problem is that the disk where "sstables2" is mounted
is running out of space and large SSTables stored there cannot be compacted.

So I have two questions:

* Can I just move some SSTables data files from "sstables2" to "sstables1"
which has much more free disk space? Will Cassandra start fine after that
and not lose any data?

* Provided multiple data dirs, should Cassandra distribute data equally
between them? In what I'm observing this is almost always not true. On that
particular node I mentioned above the difference is huge: 4% occupied disk
space for "sstables1" and 87% for "sstables2"; on other nodes the situation
is a little better but still not 50/50.

Thanks!

Roman

Re: Moving SSTables from one disk to another

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Nov 30, 2015 at 11:29 AM, S C <as...@outlook.com> wrote:

> It is inevitable that the repairs are needed to keep consistency
> guarantees. Is it worthwhile to consider RAID-0 as we get more storage? One
> can treat loss of disk as loss of node and rebuild the node and repair. Any
> other suggestions are most welcome.
>
It depends on whether you consider decreasing "unique replica count" to be
acceptable.

Rebuilding node C from the contents of A and B by definition loses any data
that was only successfully written to C.

In practice, the Coli Conjecture suggests you probably don't care about
decreasing unique replica count if you're, for example, using
ConsistencyLevel.ONE...

=Rob

Re: Moving SSTables from one disk to another

Posted by S C <as...@outlook.com>.

Rob,


It is inevitable that the repairs are needed to keep consistency guarantees. Is it worthwhile to consider RAID-0 as we get more storage? One can treat loss of disk as loss of node and rebuild the node and repair. Any other suggestions are most welcome.


-Sri
________________________________
From: Robert Coli <rc...@eventbrite.com>
Sent: Friday, April 10, 2015 6:51 PM
To: user@cassandra.apache.org
Subject: Re: Moving SSTables from one disk to another

On Fri, Apr 10, 2015 at 4:30 PM, Jonathan Haddad <jo...@jonhaddad.com>> wrote:
However, it was pointed out to me that
https://issues.apache.org/jira/browse/CASSANDRA-6696 will be a better
solution in a lot of cases.

Thank you for the interesting link about a theoretical usage which would make JBOD worth using.

But I really don't understand why we consider the use of the current JBOD ok, when :

"In JBOD, when someone gets a bad drive, the bad drive is replaced with a new empty one and repair is run. This can cause deleted data to come back in some cases."

This class of issue is permanently fatal to consistency for the affected data.

Why are we encouraging people to expose themselves to this class of issue? What benefit do they get from current JBOD implementation that is worth this risk to consistency?

Yes, it's true that if an operator in this case never creates tombstones or never runs repair after losing only one disk, they're not exposed to the risk. But when they configure JBOD, the entire point is that they hope to run repair after losing only one disk, instead of rebuilding the entire node. The status quo seems to set up operators for failure when they attempt to do what the feature claims to be useful for.

I don't get "features" like this : questionable benefit, measurable risk, known serious issues and yet they sit there in the product for years on end, daring someone to use them...

=Rob

Re: Moving SSTables from one disk to another

Posted by Robert Coli <rc...@eventbrite.com>.

On Fri, Apr 10, 2015 at 4:30 PM, Jonathan Haddad <jo...@jonhaddad.com> wrote:

> However, it was pointed out to me that
> https://issues.apache.org/jira/browse/CASSANDRA-6696 will be a better
> solution in a lot of cases.

Thank you for the interesting link about a theoretical usage which would
make JBOD worth using.

But I really don't understand why we consider the use of the current JBOD
ok, when :

"In JBOD, when someone gets a bad drive, the bad drive is replaced with a
new empty one and repair is run. This can cause deleted data to come back
in some cases."

This class of issue is permanently fatal to consistency for the affected
data.

Why are we encouraging people to expose themselves to this class of issue?
What benefit do they get from current JBOD implementation that is worth
this risk to consistency?

Yes, it's true that if an operator in this case never creates tombstones or
never runs repair after losing only one disk, they're not exposed to the
risk. But when they configure JBOD, the entire point is that they hope to
run repair after losing only one disk, instead of rebuilding the entire
node. The status quo seems to set up operators for failure when they
attempt to do what the feature claims to be useful for.

I don't get "features" like this : questionable benefit, measurable risk,
known serious issues and yet they sit there in the product for years on
end, daring someone to use them...

=Rob

Re: Moving SSTables from one disk to another

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

I had submitted this issue which could have had (in theory) some
serious performance benefit when using JBOD:
https://issues.apache.org/jira/browse/CASSANDRA-8868

However, it was pointed out to me that
https://issues.apache.org/jira/browse/CASSANDRA-6696 will be a better
solution in a lot of cases.

On Fri, Apr 10, 2015 at 4:13 PM, Robert Coli <rc...@eventbrite.com> wrote:
> On Fri, Apr 10, 2015 at 4:00 PM, Roman Tkachenko <ro...@mailgunhq.com>
> wrote:
>>
>> * Can I just move some SSTables data files from "sstables2" to "sstables1"
>> which has much more free disk space? Will Cassandra start fine after that
>> and not lose any data?
>
>
> Cassandra generally discovers files in its data directories and treats them
> as legitimate files. I do not have specific knowledge of JBOD behavior here,
> but I would presume it would be the same.
>
>>
>> * Provided multiple data dirs, should Cassandra distribute data equally
>> between them? In what I'm observing this is almost always not true. On that
>> particular node I mentioned above the difference is huge: 4% occupied disk
>> space for "sstables1" and 87% for "sstables2"; on other nodes the situation
>> is a little better but still not 50/50.
>
>
> No, and especially not when using Size Tiered Compaction.
>
> I honestly wonder why people think JBOD is a useful feature for Cassandra.
> You don't really want to continue to operate a node that has lost half of
> its data, and managing multiple data directories seems relatively likely to
> be more trouble than it's worth. You have a distributed, replicated
> database... just replace nodes when they fail. Anyone care to set me
> straight about the amazing benefits they see which make the costs
> worthwhile?
>
> =Rob
>



-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Re: Moving SSTables from one disk to another

Posted by Robert Coli <rc...@eventbrite.com>.

On Fri, Apr 10, 2015 at 4:00 PM, Roman Tkachenko <ro...@mailgunhq.com>
wrote:

> * Can I just move some SSTables data files from "sstables2" to "sstables1"
> which has much more free disk space? Will Cassandra start fine after that
> and not lose any data?
>

Cassandra generally discovers files in its data directories and treats them
as legitimate files. I do not have specific knowledge of JBOD behavior
here, but I would presume it would be the same.

> * Provided multiple data dirs, should Cassandra distribute data equally
> between them? In what I'm observing this is almost always not true. On that
> particular node I mentioned above the difference is huge: 4% occupied disk
> space for "sstables1" and 87% for "sstables2"; on other nodes the situation
> is a little better but still not 50/50.
>

No, and especially not when using Size Tiered Compaction.

I honestly wonder why people think JBOD is a useful feature for Cassandra.
You don't really want to continue to operate a node that has lost half of
its data, and managing multiple data directories seems relatively likely to
be more trouble than it's worth. You have a distributed, replicated
database... just replace nodes when they fail. Anyone care to set me
straight about the amazing benefits they see which make the costs
worthwhile?

=Rob