You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2009/04/17 21:14:14 UTC
[jira] Issue Comment Edited: (CASSANDRA-33) Bugs in tombstone handling in remove code

    [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700289#action_12700289 ] 

Jonathan Ellis edited comment on CASSANDRA-33 at 4/17/09 12:12 PM:
-------------------------------------------------------------------

fixed tests to make it more obvious what should be happening (in 4-and-5-v2).

      was (Author: jbellis):
    fixed tests to make it more obvious what should be happening.
  
> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004-and-5-v2.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.