You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Nadav Har'El (JIRA)" <ji...@apache.org> on 2015/11/18 15:53:11 UTC

[jira] [Commented] (CASSANDRA-10728) Hash used in repair does not include partition key

    [ https://issues.apache.org/jira/browse/CASSANDRA-10728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15011118#comment-15011118 ] 

Nadav Har'El commented on CASSANDRA-10728:
------------------------------------------

What if in one replica we have partition 'a' with value 1 and some timestamp, and in a second replica, we have partition 'b' with a value 1 and the same timestamp - and 'a' and 'b' happen to be close enough in their tokens to be in the same Merkle Tree partition range?

I realize this is very unlikely case (especially considering the need for the timestamps to be identical, which I haven't considered before). But it seems it's possible... For example, consider a contrieved use case which uses Cassandra to store a large set of keys - the value of each key is always set to "1" (or whatever). Now, at exactly the same time (at millisecond resolution, which is Cassandra's default), two servers want to write two different keys "a" and "b" - and because of a partition in the cluster, "a" ends up on one machine, "b" on a second machine - and both have the same time (in miilisecond resolution, it's not completely improbable).

> Hash used in repair does not include partition key
> --------------------------------------------------
>
>                 Key: CASSANDRA-10728
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10728
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Nadav Har'El
>            Priority: Minor
>
> When the repair code builds the Merkle Tree, it appears to be using AbstractCompactedRow.update() to calculate a partition's hash. This method's documentation states that it calculates a "digest with the data bytes of the row (not including row key or row size).". The code itself seems to agree with this comment.
> However, I believe that not including the row (actually, partition) key in the hash function is a mistake: This means that if two nodes have the same data but different key, repair would not notice this discrepancy. Moreover, if two different keys have their data switched - or have the same data - again this would not be noticed by repair. Actually running across this problem in a real repair is not very likely, but I can imagine seeing it easily in an hypothetical use case where all partitions have exactly the same data and just the partition key matters.
> I am sorry if I'm mistaken and the partition key is actually taken into account in the Merkle tree, but I tried to find evidence that it does and failed. Glancing over the code, it almost seems that it does use the key: Validator.add() calculates rowHash() which includes the digest (without the partition key) *and* the key's token. But then, the code calls MerkleTree.TreeRange.addHash() on that tuple, and that function conspicuously ignores the token, and only uses the digest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)