You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2009/04/01 15:49:14 UTC

[jira] Created: (CASSANDRA-33) Bugs in tombstone handling in remove code

Bugs in tombstone handling in remove code
-----------------------------------------

                 Key: CASSANDRA-33
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
             Project: Cassandra
          Issue Type: Bug
            Reporter: Jonathan Ellis
            Assignee: Jonathan Ellis


[copied from dev list]

Avinash pointed out two bugs in my remove code.  One is easy to fix,
the other is tougher.

The easy one is that my code removes tombstones (deletion markers) at
the ColumnFamilyStore level, so when CassandraServer does read repair
it will not know about the tombstones and they will not be replicated
correctly.  This can be fixed by simply moving the removeDeleted call
up to just before CassandraServer's final return-to-client.

The hard one is that tombstones are problematic on GC (that is, major
compaction of SSTables, to use the Bigtable paper terminology).

One failure scenario: Node A, B, and C replicate some data.  C goes
down.  The data is deleted.  A and B delete it and later GC it.  C
comes back up.  C now has the only copy of the data so on read repair
the stale data will be sent to A and B.

A solution: pick a number N such that we are confident that no node
will be down (and catch up on hinted handoffs) for longer than N days.
 (Default value: 10?)  Then, no node may GC tombstones before N days
have elapsed.  Also, after N days, tombstones will no longer be read
repaired.  (This prevents a node which has not yet GC'd from sending a
new tombstone copy to a node that has already GC'd.)

Implementation detail: we'll need to add a 32-bit "time of tombstone"
to ColumnFamily and SuperColumn.  (For Column we can stick it in the
byte[] value, since we already have an unambiguous way to know if the
Column is in a deleted state.)  We only need 32 bits since the time
frame here is sufficiently granular that we don't need ms.  Also, we
will use the system clock for these values, not the client timestamp,
since we don't know what the source of the client timestamps is.

Admittedly this is suboptimal compared to being able to GC immediately
but it has the virtue of being (a) easily implemented, (b) with no
extra components such as a coordination protocol, and (c) better than
not GCing tombstones at all (the other easy way to ensure
correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-33:
------------------------------------

    Attachment: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700289#action_12700289 ] 

Jonathan Ellis edited comment on CASSANDRA-33 at 4/17/09 12:12 PM:
-------------------------------------------------------------------

fixed tests to make it more obvious what should be happening (in 4-and-5-v2).

      was (Author: jbellis):
    fixed tests to make it more obvious what should be happening.
  
> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004-and-5-v2.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-33.
-------------------------------------

    Resolution: Fixed

committed

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004-and-5-v2.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-33:
------------------------------------

    Attachment: 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jun Rao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jun Rao updated CASSANDRA-33:
-----------------------------

    Attachment: 0006_fix_sequencefile_bug.patch

There is a serious bug in SequenceFille.java with the current fix. Attach a patch in 0006.

Unfortunately, none of the unit test captures the fact that many fields such as localDeletionTime, markedForDeleteAt, totalNumOfCols were read incorrectly. Existing test cases work just by luck.

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004-and-5-v2.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch, 0006_fix_sequencefile_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700410#action_12700410 ] 

Jonathan Ellis commented on CASSANDRA-33:
-----------------------------------------

created CASSANDRA-89 to remind us to add a test covering these code paths

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004-and-5-v2.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch, 0006_fix_sequencefile_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700217#action_12700217 ] 

Jonathan Ellis commented on CASSANDRA-33:
-----------------------------------------

added configuration patch.

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jun Rao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jun Rao reopened CASSANDRA-33:
------------------------------


reopens the issue because of the new bug found.

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004-and-5-v2.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch, 0006_fix_sequencefile_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-33.
-------------------------------------

    Resolution: Fixed

applied, thanks for catching that

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004-and-5-v2.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch, 0006_fix_sequencefile_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-33:
------------------------------------

    Fix Version/s: 0.3

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-33.
-------------------------------------

    Resolution: Fixed

committed.

> We should probably open another issue to clean up the code such that the row key and row size are not written to outBuf. 

it's not clear to me what the right cleanup is for this specific piece since the context is so messy. :(

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004-and-5-v2.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch, 0006_fix_sequencefile_bug.patch, 0007_fix_another_sequencefile_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-33:
------------------------------------

    Attachment: 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jun Rao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jun Rao updated CASSANDRA-33:
-----------------------------

    Attachment: 0004_expose_remove_bug.patch

Haven't looked at the patch in details, but found another related remove bug. Patch 0004 adds two test cases and one of them testRemoveColumn1() exposes the problem and will fail.


> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700208#action_12700208 ] 

Jonathan Ellis commented on CASSANDRA-33:
-----------------------------------------

MultiGet doesn't seem to be coming any time soon.  Guess we'll just have to deal with conflict resolution when it does.

Re the patches provided, they follow the outline above except that it turns out we can just use a single removeDeleted to handle both tombstones (which still supress old data "below" them in the tree, e.g., a deleted supercolumn does not need to keep its subcolumn data around) and GC.  So CFS and compaction can still calls removeDeleted, and then CassandraServer just has to remove the tombstones themselves in thriftifyColumns and thriftifySuperColumns.

Patch to make GC_GRACE_IN_SECONDS configurable forthcoming.

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jun Rao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700283#action_12700283 ] 

Jun Rao commented on CASSANDRA-33:
----------------------------------

Another issue is that if you look at testRemoveColumn1() and testRemoveColumn2(), retrieved.getColumn("Column1") returns null. I am wondering how that affects read repairs.

Also, you need to patch test/conf/storage-conf.xml for GC_GRACE_IN_SECONDS.


> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700290#action_12700290 ] 

Jonathan Ellis commented on CASSANDRA-33:
-----------------------------------------

we do need the ability to read-repair CF and SC tombstones.  I'll open a separate ticket for that.

the reason the behavior retrieved.getColumn("Column1") == null is correct is, if we've deleted the CF (more recently than the column!) then we don't care what data used to be there, it should just be gone.  what we need to RR is the CF tombstone.

of course if there is a column tombstone _more_ recent than the CF tombstone then it should be preserved.

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004-and-5-v2.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jun Rao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jun Rao reopened CASSANDRA-33:
------------------------------


reopen the issue because of the new bug found.

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004-and-5-v2.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch, 0006_fix_sequencefile_bug.patch, 0007_fix_another_sequencefile_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Eric Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700294#action_12700294 ] 

Eric Evans commented on CASSANDRA-33:
-------------------------------------

+1 to 0001-0003 and 0004-and-5-v2.patch

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004-and-5-v2.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700285#action_12700285 ] 

Jonathan Ellis commented on CASSANDRA-33:
-----------------------------------------

+1 for Jun's patches.

IMO test/conf is fine using the default value.

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jun Rao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jun Rao updated CASSANDRA-33:
-----------------------------

    Attachment: 0007_fix_another_sequencefile_bug.patch

Include patch for another bug in SequeceFile.java where the size of the row is not calculated correctly. At this moment, this bug is not exposed.since the rowkey and the size of row written to outBuf were simply read and discarded in SSTable.next().

We should probably open another issue to clean up the code such that the row key and row size are not written to outBuf.


> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004-and-5-v2.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch, 0006_fix_sequencefile_bug.patch, 0007_fix_another_sequencefile_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jun Rao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jun Rao updated CASSANDRA-33:
-----------------------------

    Attachment: 0005_fix_exposed_remove_bug.patch

Attach a patch that fixes the problem exposed in 0004 patch.

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-33) Bugs in tombstone handling in remove code

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-33:
------------------------------------

    Attachment: 0004-and-5-v2.patch

fixed tests to make it more obvious what should be happening.

> Bugs in tombstone handling in remove code
> -----------------------------------------
>
>                 Key: CASSANDRA-33
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-33
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.3
>
>         Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004-and-5-v2.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch
>
>
> [copied from dev list]
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
>  (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.