You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Benjamin Roth (JIRA)" <ji...@apache.org> on 2016/12/05 08:01:11 UTC
[jira] [Updated] (CASSANDRA-12991) Inter-node race condition in
validation compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benjamin Roth updated CASSANDRA-12991:
--------------------------------------
Description:
Problem:
When a validation compaction is triggered by a repair it may happen that due to flying in mutations the merkle trees differ but the data is consistent however.
Example:
t = 10000:
Repair starts validation
Node A starts validation
t = 10001:
Mutation arrives at Node A
t = 10002:
Mutation arrives at Node B
t = 10003:
Node B starts validation
Hashes of node A+B will differ but data is consistent from a view (think of it like a snapshot) t = 10000.
Impact:
Unnecessary streaming happens. This may not a big impact on low traffic CFs, partitions but on high traffic CFs and maybe very big partitions, this may have a bigger impact and is a waste of resources.
Possible solution:
Build hashes based upon a snapshot timestamp.
This requires SSTables created after that timestamp to be filtered when doing a validation compaction:
- Cells with timestamp > snapshot time have to be removed
- Tombstone range markers have to be handled
- Bounds have to be removed if delete timestamp > snapshot time
- Boundary markers have to be either changed to a bound or completely removed, depending if start and/or end are both affected or not
Probably this is a known behaviour. Have there been any discussions about this in the past? Did not find an matching issue, so I created this one.
I am happy about any feedback, whatsoever.
was:
Problem:
When a validation compaction is triggered by a repair it may happen that due to flying in mutations the merkle trees differ but the data is not consistent.
Example:
t = 10000:
Repair starts validation
Node A starts validation
t = 10001:
Mutation arrives at Node A
t = 10002:
Mutation arrives at Node B
t = 10003:
Node B starts validation
Hashes of node A+B will differ but data is consistent from a view (think of it like a snapshot) t = 10000.
Impact:
Unnecessary streaming happens. This may not a big impact on low traffic CFs, partitions but on high traffic CFs and maybe very big partitions, this may have a bigger impact and is a waste of resources.
Possible solution:
Build hashes based upon a snapshot timestamp.
This requires SSTables created after that timestamp to be filtered when doing a validation compaction:
- Cells with timestamp > snapshot time have to be removed
- Tombstone range markers have to be handled
- Bounds have to be removed if delete timestamp > snapshot time
- Boundary markers have to be either changed to a bound or completely removed, depending if start and/or end are both affected or not
Probably this is a known behaviour. Have there been any discussions about this in the past? Did not find an matching issue, so I created this one.
I am happy about any feedback, whatsoever.
> Inter-node race condition in validation compaction
> --------------------------------------------------
>
> Key: CASSANDRA-12991
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12991
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Benjamin Roth
> Priority: Minor
>
> Problem:
> When a validation compaction is triggered by a repair it may happen that due to flying in mutations the merkle trees differ but the data is consistent however.
> Example:
> t = 10000:
> Repair starts validation
> Node A starts validation
> t = 10001:
> Mutation arrives at Node A
> t = 10002:
> Mutation arrives at Node B
> t = 10003:
> Node B starts validation
> Hashes of node A+B will differ but data is consistent from a view (think of it like a snapshot) t = 10000.
> Impact:
> Unnecessary streaming happens. This may not a big impact on low traffic CFs, partitions but on high traffic CFs and maybe very big partitions, this may have a bigger impact and is a waste of resources.
> Possible solution:
> Build hashes based upon a snapshot timestamp.
> This requires SSTables created after that timestamp to be filtered when doing a validation compaction:
> - Cells with timestamp > snapshot time have to be removed
> - Tombstone range markers have to be handled
> - Bounds have to be removed if delete timestamp > snapshot time
> - Boundary markers have to be either changed to a bound or completely removed, depending if start and/or end are both affected or not
> Probably this is a known behaviour. Have there been any discussions about this in the past? Did not find an matching issue, so I created this one.
> I am happy about any feedback, whatsoever.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)