You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Christopher Smith (JIRA)" <ji...@apache.org> on 2013/09/27 05:36:03 UTC

[jira] [Comment Edited] (CASSANDRA-6106) QueryState.getTimestamp() & FBUtilities.timestampMicros() reads current timestamp with System.currentTimeMillis() * 1000 instead of System.nanoTime() / 1000

    [ https://issues.apache.org/jira/browse/CASSANDRA-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13779623#comment-13779623 ] 

Christopher Smith edited comment on CASSANDRA-6106 at 9/27/13 3:35 AM:
-----------------------------------------------------------------------

Look at the above description and also look at the article. LWT doesn't fix this. You could use a vector clock, but then you have all the hell that comes with that.

I agree the "still possible" is really dumb and a violation of the guarantees that Cassandra documents. As long as Cassandra has this mechanism though, we should make the probabilities way, way lower. With this change the probability of a collision gets to around the kind of odds as UUID collisions (clarification: the odds are still much higher than type 1 UUID's as they are 128-bit, but at least it is as good as you can do with 64b-bit... note that the mechanism employed is quite similar to how type 1's try to avoid collisions), which I think for practical purposes is "good enough".

Note that the current "+1" trick also creates potentially backwards ordering problems (if you write 2 times in one millisecond to node A and once in the same millisecond to node B, the second write to node A is treated as having been last, even if it happened 999 microseconds before the write to node B).

Cassandra should use a different mechanism to resolve concurrent writes with the same timestamp. I would propose something more like this:

If two nodes have different values for a cell, but have the same timestamp for the cell:

1) Compute the "token" for the record.
2) Compute replicas 1 to N for that token and assign them those values 1 to N to each node in the datacenter.
3) If there is a tie, win goes to the replica with the node with the highest value for #2.
4) If there are two datacenters, each with the same highest value node (note this favours data centers with higher replication factors, which seems... good to me), you resolve in favour of the datacenter whose name alphasorts lowest.

                
      was (Author: xcbsmith):
    Look at the above description and also look at the article. LWT doesn't fix this. You could use a vector clock, but then you have all the hell that comes with that.

I agree the "still possible" is really dumb and a violation of the guarantees that Cassandra documents. As long as Cassandra has this mechanism though, we should make the probabilities way, way lower. With this change the probability of a collision gets to around the kind of odds as UUID collisions, which I think for practical purposes is "good enough".

Note that the current "+1" trick also creates potentially backwards ordering problems (if you write 2 times in one millisecond to node A and once in the same millisecond to node B, the second write to node A is treated as having been last, even if it happened 999 microseconds before the write to node B).

Cassandra should use a different mechanism to resolve concurrent writes with the same timestamp. I would propose something more like this:

If two nodes have different values for a cell, but have the same timestamp for the cell:

1) Compute the "token" for the record.
2) Compute replicas 1 to N for that token and assign them those values 1 to N to each node in the datacenter.
3) If there is a tie, win goes to the replica with the node with the highest value for #2.
4) If there are two datacenters, each with the same highest value node (note this favours data centers with higher replication factors, which seems... good to me), you resolve in favour of the datacenter whose name alphasorts lowest.

                  
> QueryState.getTimestamp() & FBUtilities.timestampMicros() reads current timestamp with System.currentTimeMillis() * 1000 instead of System.nanoTime() / 1000
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-6106
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6106
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: DSE Cassandra 3.1, but also HEAD
>            Reporter: Christopher Smith
>            Priority: Minor
>              Labels: collision, conflict, timestamp
>         Attachments: microtimstamp.patch
>
>
> I noticed this blog post: http://aphyr.com/posts/294-call-me-maybe-cassandra mentioned issues with millisecond rounding in timestamps and was able to reproduce the issue. If I specify a timestamp in a mutating query, I get microsecond precision, but if I don't, I get timestamps rounded to the nearest millisecond, at least for my first query on a given connection, which substantially increases the possibilities of collision.
> I believe I found the offending code, though I am by no means sure this is comprehensive. I think we probably need a fairly comprehensive replacement of all uses of System.currentTimeMillis() with System.nanoTime().
> There seems to be some confusion here, so I'd like to clarify: the purpose of this patch is NOT to improve the precision of ordering guarantees for concurrent writes to cells. The purpose of this patch is to reduce the probability that concurrent writes to cells are deemed as having occurred at *the same time*, which is when Cassandra violates its atomicity guarantee.
> To clarify the failure scenario. Cassandra promises that writes to the same record are "atomic", so if you do something like:
> create table foo {
> i int PRIMARY KEY,
> x int,
> y int,
> };
> and then send these two queries concurrently:
> insert into foo (i, x, y) values (1, 8, -8);
> insert into foo (i, x, y) values (1, -8, 8);
> you can't be quite sure which of the two writes will be the "last" one, but you do know that if you do:
> select x, y from foo where i = 1;
> you don't know if x is "8" or "-8".
> you don't know if y is "-8" or "8".
> YOU DO KNOW: x + y will equal 0.
> EXCEPT... if the timestamps assigned to the two queries are *exactly* the same, in which case x + y = 16. :-( Now your writes are not atomic.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira