You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Anand Somani <me...@gmail.com> on 2011/02/07 20:33:32 UTC

Best way to detect/fix bitrot today?

Hi,

Our application space is such that there is data that might not be read for
a long time. The data is mostly immutable. How should I approach
detecting/solving the bitrot problem? One approach is read data and let read
repair do the detection, but given the size of data, that does not look very
efficient.

Has anybody solved/workaround this or has any other suggestions to detect
and fix bitrot?


Thanks
Anand

Re: Best way to detect/fix bitrot today?

Posted by Peter Schuller <pe...@infidyne.com>.

> Some RAID storage might do it, potentially more efficiently!!

People keep claiming that but I have yet to confirm that a hardware
raid does actual checksumming as opposed to just healing bad blocks.
But yes, they might :)

> Food for thought, or wild imagination ?

That was my intent. Checksumming at the sstable level would allow
detection of corruption, and regular repair and read-repair would
provide self-healing.

-- 
/ Peter Schuller

Re: Best way to detect/fix bitrot today?

Posted by Anthony John <ch...@gmail.com>.

Some RAID storage might do it, potentially more efficiently!!

Rhetorical question - Does Cassandra's architecture of reconciling reads
over multiple copies of the same data provide an even more interesting
answer? I submit - YES!

All bitrot protection mechanisms involve some element of redundant storage -
to verify and reconstruct any rot. Cassandra can do this on JBODs with the
appropriate Replication Factor (say > 3). Granted that the total storage in
terms of number of disks might exceed the other alternatives, but at the
lowest tier, using JBODs, the cost might actually be lesser.

Food for thought, or wild imagination ?

-JA

On Mon, Feb 7, 2011 at 2:09 PM, Peter Schuller
<pe...@infidyne.com>wrote:

> > Our application space is such that there is data that might not be read
> for
> > a long time. The data is mostly immutable. How should I approach
> > detecting/solving the bitrot problem? One approach is read data and let
> read
> > repair do the detection, but given the size of data, that does not look
> very
> > efficient.
>
> Note that read-repair is not really intended to repair arbitrary
> corruptions. Unless I'm mistaken, arbitrary corruption, unless it
> triggers a serialization failure that causes row skipping, it's a
> toss-up which version of the data is retained (or both, if the
> corruption is in the key). Given the same key and column timestamp,
> the tie breaker is the volumn value. So depending on whether
> corruption results in a "lesser" or "greater" value, you might get the
> corrupt or non-corrupt data.
>
> > Has anybody solved/workaround this or has any other suggestions to detect
> > and fix bitrot?
>
> My feel/tentative opinion is that the clean fix is for Cassandra to
> support strong checksumming at the sstable level.
>
> Deploying on e.g. ZFS would help a lot with this, but that's a problem
> for deployment on Linux (which is the recommended platform for
> Cassandra).
>
> --
> / Peter Schuller
>

Re: Best way to detect/fix bitrot today?

Posted by Peter Schuller <pe...@infidyne.com>.

> I should have clarified we have 3 copies, so in that case as long as 2 match
> we should be ok?

As far as I can think of, no. Whatever the reconciliation of two
columns results in, is what the cluster is expected to converge to. So
in the case of identical keys and mismatched values, tie breaking is
the deciding factor. There is no "global" comparison/voting process
between all nodes in the replicate set for the row.

> Even if there were checksumming at the SStable level, I assume it has to
> check and report these errors on compaction (or node repair)?

I believe that it would work minimally to just skip it (regular repair
would do the rest). However that said, there may be reasons to want to
do more. For example, if you have a cluster where you are relying on
QUORUM consistency then silently dropping data actively discovered to
be corrupt could be considered a consistency violation.

> I have seen some JIRA open on these issues ( 47 and 1717), but if I need
> something today, a read repair ( or a node repair) is the only viable
> option?

repair is needed anyway (unless you're use case is very unusual; no
deletes, no updates to pre-existing rows). But again to be clear,
neither repair nor read repair is primarily intended to address
arbitrary data corruption but rather to reach eventual consistency in
the cluster (after writes were dropped, a node went down, etc).

-- 
/ Peter Schuller

Re: Best way to detect/fix bitrot today?

Posted by Peter Schuller <pe...@infidyne.com>.

> One thing that we're doing for (guaranteed) immutable data is to use MD5
> signatures as keys... this will also prevent duplication, and it will allow
> detection (if not correction) of bitrot at the app level easy.

Yes. Another option is to checksum keys and/or values themselves by
effectively encoding each in a self-verifying format. But that makes
the data a lot more opaque to tools/humans.

Also consider that arbitrary data corruption could have other effects
than modifying a value or a key. I'm not sure the row skipping on
deserialization issues is good enough to handle absolutely arbitrary
corruption (anyone?).

-- 
/ Peter Schuller

Re: Best way to detect/fix bitrot today?

Posted by Shaun Cutts <sh...@cuttshome.net>.

One thing that we're doing for (guaranteed) immutable data is to use MD5 signatures as keys... this will also prevent duplication, and it will allow detection (if not correction) of bitrot at the app level easy.

On Feb 8, 2011, at 9:23 AM, Anand Somani wrote:

> I should have clarified we have 3 copies, so in that case as long as 2 match we should be ok? 
> 
> Even if there were checksumming at the SStable level, I assume it has to check and report these errors on compaction (or node repair)? 
> 
> I have seen some JIRA open on these issues ( 47 and 1717), but if I need something today, a read repair ( or a node repair) is the only viable option? 
> 
>  
> 
> On Mon, Feb 7, 2011 at 12:09 PM, Peter Schuller <pe...@infidyne.com> wrote:
> > Our application space is such that there is data that might not be read for
> > a long time. The data is mostly immutable. How should I approach
> > detecting/solving the bitrot problem? One approach is read data and let read
> > repair do the detection, but given the size of data, that does not look very
> > efficient.
> 
> Note that read-repair is not really intended to repair arbitrary
> corruptions. Unless I'm mistaken, arbitrary corruption, unless it
> triggers a serialization failure that causes row skipping, it's a
> toss-up which version of the data is retained (or both, if the
> corruption is in the key). Given the same key and column timestamp,
> the tie breaker is the volumn value. So depending on whether
> corruption results in a "lesser" or "greater" value, you might get the
> corrupt or non-corrupt data.
> 
> > Has anybody solved/workaround this or has any other suggestions to detect
> > and fix bitrot?
> 
> My feel/tentative opinion is that the clean fix is for Cassandra to
> support strong checksumming at the sstable level.
> 
> Deploying on e.g. ZFS would help a lot with this, but that's a problem
> for deployment on Linux (which is the recommended platform for
> Cassandra).
> 
> --
> / Peter Schuller
>

Re: Best way to detect/fix bitrot today?

Posted by Anand Somani <me...@gmail.com>.

I should have clarified we have 3 copies, so in that case as long as 2 match
we should be ok?

Even if there were checksumming at the SStable level, I assume it has to
check and report these errors on compaction (or node repair)?

I have seen some JIRA open on these issues ( 47 and 1717), but if I need
something today, a read repair ( or a node repair) is the only viable
option?



On Mon, Feb 7, 2011 at 12:09 PM, Peter Schuller <peter.schuller@infidyne.com
> wrote:

> > Our application space is such that there is data that might not be read
> for
> > a long time. The data is mostly immutable. How should I approach
> > detecting/solving the bitrot problem? One approach is read data and let
> read
> > repair do the detection, but given the size of data, that does not look
> very
> > efficient.
>
> Note that read-repair is not really intended to repair arbitrary
> corruptions. Unless I'm mistaken, arbitrary corruption, unless it
> triggers a serialization failure that causes row skipping, it's a
> toss-up which version of the data is retained (or both, if the
> corruption is in the key). Given the same key and column timestamp,
> the tie breaker is the volumn value. So depending on whether
> corruption results in a "lesser" or "greater" value, you might get the
> corrupt or non-corrupt data.
>
> > Has anybody solved/workaround this or has any other suggestions to detect
> > and fix bitrot?
>
> My feel/tentative opinion is that the clean fix is for Cassandra to
> support strong checksumming at the sstable level.
>
> Deploying on e.g. ZFS would help a lot with this, but that's a problem
> for deployment on Linux (which is the recommended platform for
> Cassandra).
>
> --
> / Peter Schuller
>

Re: Best way to detect/fix bitrot today?

Posted by Peter Schuller <pe...@infidyne.com>.

> Our application space is such that there is data that might not be read for
> a long time. The data is mostly immutable. How should I approach
> detecting/solving the bitrot problem? One approach is read data and let read
> repair do the detection, but given the size of data, that does not look very
> efficient.

Note that read-repair is not really intended to repair arbitrary
corruptions. Unless I'm mistaken, arbitrary corruption, unless it
triggers a serialization failure that causes row skipping, it's a
toss-up which version of the data is retained (or both, if the
corruption is in the key). Given the same key and column timestamp,
the tie breaker is the volumn value. So depending on whether
corruption results in a "lesser" or "greater" value, you might get the
corrupt or non-corrupt data.

> Has anybody solved/workaround this or has any other suggestions to detect
> and fix bitrot?

My feel/tentative opinion is that the clean fix is for Cassandra to
support strong checksumming at the sstable level.

Deploying on e.g. ZFS would help a lot with this, but that's a problem
for deployment on Linux (which is the recommended platform for
Cassandra).

-- 
/ Peter Schuller