You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2009/04/17 22:10:16 UTC
[jira] Resolved: (CASSANDRA-33) Bugs in tombstone handling in
remove code
[ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Ellis resolved CASSANDRA-33.
-------------------------------------
Resolution: Fixed
committed
> Bugs in tombstone handling in remove code
> -----------------------------------------
>
> Key: CASSANDRA-33
> URL: https://issues.apache.org/jira/browse/CASSANDRA-33
> Project: Cassandra
> Issue Type: Bug
> Reporter: Jonathan Ellis
> Assignee: Jonathan Ellis
> Fix For: 0.3
>
> Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004-and-5-v2.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code. One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly. This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data. C goes
> down. The data is deleted. A and B delete it and later GC it. C
> comes back up. C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
> (Default value: 10?) Then, no node may GC tombstones before N days
> have elapsed. Also, after N days, tombstones will no longer be read
> repaired. (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn. (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.) We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms. Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.