You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by aaron morton <aa...@thelastpickle.com> on 2012/11/01 00:56:03 UTC

Re: repair, compaction, and tombstone rows

> Is this a feature or a bug?  
Yes :)

You are probably on a bit of an edge case. 

Maybe a purge-able tombstone can be ignored as part of the merkle tree calculation and skipped from the streaming? (have not checked the code to see if they already are.)

Can you create a ticket on https://issues.apache.org/jira/browse/CASSANDRA  and describe the problem ? 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 1/11/2012, at 8:04 AM, Bryan Talbot <bt...@aeriagames.com> wrote:

> I've been experiencing a behavior that is undesirable and it seems like a bug that causes a high amount of wasted work.
> 
> I have a CF where all columns have a TTL, are generally all inserted in a very short period of time (less than a second) and are never over-written or explicitly deleted.  Eventually one node will run a compaction and remove rows containing only tombstones greater than gc_grace_seconds old which is expected.  
> 
> The problem comes up when a repair is run.  During the repair the other nodes that haven't run a compaction and still have the tombstoned rows "fix" the inconsistency and stream the rows (which contain only a tombstone which is more than gc_grace_seconds old) back to the node which had compacted that row away.  This ends up occurring over and over and uses a lot of time, storage, and bandwidth to keep repairing rows that are intentionally missing.
> 
> I think the issue stems from the behavior of compaction of TTL rows and repair.  The compaction of TTL rows is a node-local event which will eventually cause tombstoned rows to disappear from the one node doing the compaction and then get "repaired" from replicas later.  I guess this could happen for rows which are explicitly deleted as well.
> 
> Is this a feature or a bug?  How can I avoid repair of rows that were correctly removed via compaction from one node but not from replicas just because compactions run independently on each node?  Every repair ends up streaming tens of gigabytes of "missing" rows to and from replicas.
> 
> Cassandra 1.1.5 with size tiered compaction strategy and RF=3
> 
> -Bryan
> 
>