You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2010/08/23 17:21:16 UTC

[jira] Created: (CASSANDRA-1421) An eventually consistent approach to counting

An eventually consistent approach to counting
---------------------------------------------

Key: CASSANDRA-1421
URL: https://issues.apache.org/jira/browse/CASSANDRA-1421
Project: Cassandra
Issue Type: New Feature
Components: Core
Reporter: Jonathan Ellis
Fix For: 0.7.0

Counters may be implemented as multiple rows in a column family; that is, counters will have a configurable shard parameter; a shard factor of 128 would have 128 rows.

An increment will be a (uuid, count) name, value tuple. The row shard will be uuid % shardfactor. Timestamp is ignored. This could be implemented w/ the existing Thrift write api, or we could add a special case method for it. Either is fine; the main advantage of the former is it lets increments be included in batch mutations.

(Decrements we get for free as simply negative values.)

Each node will be responsible for aggregating *the rows replicated to it* after GCGraceSeconds have elapsed. Count aggregation will be a scheduled task on each machine. This will require a mutex for each shard vs both writes and reads.

This will not have the conflict resolution problem of CASSANDRA-580, or the write fragility of CASSANDRA-1072. Normal CL will apply on both read and write. Write idempotentcy is preserved. I expect writes will be faster than either, since no reads are required at all on the write path. Reads will be slower, but the read overhead can be reduced by lowering GCGraceSeconds to below your repair frequency if you are okay with the durability tradeoff there (it will not be worse than CASSANDRA-1072, for instance). More disk space will be used by this approach, but that is the cheapest resource we have.

Special case code required will be much less than either the 580 or 1072 approach -- primarily some code in StorageProxy to combine the uuid slices with their aggregation columns and sum them for all the shards, the local aggregation code, and minor changes to read/write path to add the mutex vs aggregation.

We could also get rid of the Clock change and go back to i64 timestamps; if we're not going to use Clocks for increments I don't think they have much raison d'être. (Those of you just joining us, see http://pl.atyp.us/wordpress/?p=2601 for background.) The CASSANDRA-1072 approach doesn't use Clocks either, or rather, it uses Clocks but not a byte[] value, which really means the Clock is unnecessary.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1421) An eventually consistent approach to counting

Posted by "Kevin Weil (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905092#action_12905092 ] 

Kevin Weil commented on CASSANDRA-1421:
---------------------------------------

Jonathan, "too slow" is actually going to make it completely unworkable for many folks, while the CL issues only remove certain classes of applications.  It means you wouldn't use counters as a system of record for an advertising system counting dollars and cents, for example.  But for the applications folks have mentioned here, and the applications we're using it for at Twitter -- realtime analytics, high-volume system performance monitoring, etc -- we're completely happy to sacrifice a few counts in a failure scenario in order to get the perf/scalability that 1072 offers.

> An eventually consistent approach to counting
> ---------------------------------------------
>
>                 Key: CASSANDRA-1421
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1421
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>             Fix For: 0.7.0
>
>
> Counters may be implemented as multiple rows in a column family; that is, counters will have a configurable shard parameter; a shard factor of 128 would have 128 rows.
> An increment will be a (uuid, count) name, value tuple.  The row shard will be uuid % shardfactor.  Timestamp is ignored.  This could be implemented w/ the existing Thrift write api, or we could add a special case method for it.  Either is fine; the main advantage of the former is it lets increments be included in batch mutations.
> (Decrements we get for free as simply negative values.)
> Each node will be responsible for aggregating *the rows replicated to it* after GCGraceSeconds have elapsed.  Count aggregation will be a scheduled task on each machine.  This will require a mutex for each shard vs both writes and reads.
> This will not have the conflict resolution problem of CASSANDRA-580, or the write fragility of CASSANDRA-1072.  Normal CL will apply on both read and write.  Write idempotentcy is preserved.  I expect writes will be faster than either, since no reads are required at all on the write path.  Reads will be slower, but the read overhead can be reduced by lowering GCGraceSeconds to below your repair frequency if you are okay with the durability tradeoff there (it will not be worse than CASSANDRA-1072, for instance).  More disk space will be used by this approach, but that is the cheapest resource we have.
> Special case code required will be much less than either the 580 or 1072 approach -- primarily some code in StorageProxy to combine the uuid slices with their aggregation columns and sum them for all the shards, the local aggregation code, and minor changes to read/write path to add the mutex vs aggregation.
>  
> We could also get rid of the Clock change and go back to i64 timestamps; if we're not going to use Clocks for increments I don't think they have much raison d'être.  (Those of you just joining us, see http://pl.atyp.us/wordpress/?p=2601 for background.)  The CASSANDRA-1072 approach doesn't use Clocks either, or rather, it uses Clocks but not a byte[] value, which really means the Clock is unnecessary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1421) An eventually consistent approach to counting

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901446#action_12901446 ] 

Jonathan Ellis commented on CASSANDRA-1421:
-------------------------------------------

Stu points out that sharding the rows is not inherent to the design and could be an optional part 2.

> An eventually consistent approach to counting
> ---------------------------------------------
>
>                 Key: CASSANDRA-1421
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1421
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>             Fix For: 0.7.0
>
>
> Counters may be implemented as multiple rows in a column family; that is, counters will have a configurable shard parameter; a shard factor of 128 would have 128 rows.
> An increment will be a (uuid, count) name, value tuple.  The row shard will be uuid % shardfactor.  Timestamp is ignored.  This could be implemented w/ the existing Thrift write api, or we could add a special case method for it.  Either is fine; the main advantage of the former is it lets increments be included in batch mutations.
> (Decrements we get for free as simply negative values.)
> Each node will be responsible for aggregating *the rows replicated to it* after GCGraceSeconds have elapsed.  Count aggregation will be a scheduled task on each machine.  This will require a mutex for each shard vs both writes and reads.
> This will not have the conflict resolution problem of CASSANDRA-580, or the write fragility of CASSANDRA-1072.  Normal CL will apply on both read and write.  Write idempotentcy is preserved.  I expect writes will be faster than either, since no reads are required at all on the write path.  Reads will be slower, but the read overhead can be reduced by lowering GCGraceSeconds to below your repair frequency if you are okay with the durability tradeoff there (it will not be worse than CASSANDRA-1072, for instance).  More disk space will be used by this approach, but that is the cheapest resource we have.
> Special case code required will be much less than either the 580 or 1072 approach -- primarily some code in StorageProxy to combine the uuid slices with their aggregation columns and sum them for all the shards, the local aggregation code, and minor changes to read/write path to add the mutex vs aggregation.
>  
> We could also get rid of the Clock change and go back to i64 timestamps; if we're not going to use Clocks for increments I don't think they have much raison d'être.  (Those of you just joining us, see http://pl.atyp.us/wordpress/?p=2601 for background.)  The CASSANDRA-1072 approach doesn't use Clocks either, or rather, it uses Clocks but not a byte[] value, which really means the Clock is unnecessary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1421) An eventually consistent approach to counting

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904772#action_12904772 ] 

Johan Oskarsson commented on CASSANDRA-1421:
--------------------------------------------

I'd have to agree with Sylvain, read performance and resource usage would suffer when many increments are made to the same counter. Looking at how our current production load would behave using this approach the amount of data that would have to be read back for each counter is substantial.

Estimates show that under current load with default settings a row could grow to contain a few megabytes of data. The load is expected to grow with our application load in the future. 

Turning down gc grace to a level where the reads are reasonably sized will make even a short node downtime an operational issue, due to the need to rebuild them, according to http://wiki.apache.org/cassandra/DistributedDeletes

> An eventually consistent approach to counting
> ---------------------------------------------
>
>                 Key: CASSANDRA-1421
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1421
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>             Fix For: 0.7.0
>
>
> Counters may be implemented as multiple rows in a column family; that is, counters will have a configurable shard parameter; a shard factor of 128 would have 128 rows.
> An increment will be a (uuid, count) name, value tuple.  The row shard will be uuid % shardfactor.  Timestamp is ignored.  This could be implemented w/ the existing Thrift write api, or we could add a special case method for it.  Either is fine; the main advantage of the former is it lets increments be included in batch mutations.
> (Decrements we get for free as simply negative values.)
> Each node will be responsible for aggregating *the rows replicated to it* after GCGraceSeconds have elapsed.  Count aggregation will be a scheduled task on each machine.  This will require a mutex for each shard vs both writes and reads.
> This will not have the conflict resolution problem of CASSANDRA-580, or the write fragility of CASSANDRA-1072.  Normal CL will apply on both read and write.  Write idempotentcy is preserved.  I expect writes will be faster than either, since no reads are required at all on the write path.  Reads will be slower, but the read overhead can be reduced by lowering GCGraceSeconds to below your repair frequency if you are okay with the durability tradeoff there (it will not be worse than CASSANDRA-1072, for instance).  More disk space will be used by this approach, but that is the cheapest resource we have.
> Special case code required will be much less than either the 580 or 1072 approach -- primarily some code in StorageProxy to combine the uuid slices with their aggregation columns and sum them for all the shards, the local aggregation code, and minor changes to read/write path to add the mutex vs aggregation.
>  
> We could also get rid of the Clock change and go back to i64 timestamps; if we're not going to use Clocks for increments I don't think they have much raison d'être.  (Those of you just joining us, see http://pl.atyp.us/wordpress/?p=2601 for background.)  The CASSANDRA-1072 approach doesn't use Clocks either, or rather, it uses Clocks but not a byte[] value, which really means the Clock is unnecessary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1421) An eventually consistent approach to counting

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902803#action_12902803 ] 

Sylvain Lebresne commented on CASSANDRA-1421:
---------------------------------------------

On the technical side of this approach, maybe I have misunderstood some of
the proposition, but there is one point I'm not sure how to deal with, and
that has to do with conflict problems. Each replica will aggregate
periodically the columns in counter rows. But what will happen of the
aggregate ? That is, say you need to repair (read-repair for instance) another
node that haven't aggregated the same part yet, how do you know what you
should keep of the replica the node send you and the individual columns you
have ?

More generally, on the approach, I'm a bit afraid of the read performance.
Sure there will be tunables, but it still put quite the burden on the reads.
This will also add a new scheduled operation (that could, I'm sure, easily
gets cpu intensive), one more thing that may plague a node that people must be
aware of (granted, I'm playing devil's advocate here, but just trying to voice
as much concerns as possible to help make the better decision).

But hey, I agree that the CASSANDRA-1072 approach, even though it avoided the
concerns I have with this particular approach, is broken because of non write
idempotency (and I, for one, don't know how to fix this). So this proposal is
probably better than nothing.

As for the getting rid of Clocks, If we don't use them for increments, I
suppose it depends on whether we want to have generic vector clocks (or version
vector). But it is only worth having those if we have at least one really good
use case for them. I don't personally have one, nor see one right off my head.


> An eventually consistent approach to counting
> ---------------------------------------------
>
>                 Key: CASSANDRA-1421
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1421
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>             Fix For: 0.7.0
>
>
> Counters may be implemented as multiple rows in a column family; that is, counters will have a configurable shard parameter; a shard factor of 128 would have 128 rows.
> An increment will be a (uuid, count) name, value tuple.  The row shard will be uuid % shardfactor.  Timestamp is ignored.  This could be implemented w/ the existing Thrift write api, or we could add a special case method for it.  Either is fine; the main advantage of the former is it lets increments be included in batch mutations.
> (Decrements we get for free as simply negative values.)
> Each node will be responsible for aggregating *the rows replicated to it* after GCGraceSeconds have elapsed.  Count aggregation will be a scheduled task on each machine.  This will require a mutex for each shard vs both writes and reads.
> This will not have the conflict resolution problem of CASSANDRA-580, or the write fragility of CASSANDRA-1072.  Normal CL will apply on both read and write.  Write idempotentcy is preserved.  I expect writes will be faster than either, since no reads are required at all on the write path.  Reads will be slower, but the read overhead can be reduced by lowering GCGraceSeconds to below your repair frequency if you are okay with the durability tradeoff there (it will not be worse than CASSANDRA-1072, for instance).  More disk space will be used by this approach, but that is the cheapest resource we have.
> Special case code required will be much less than either the 580 or 1072 approach -- primarily some code in StorageProxy to combine the uuid slices with their aggregation columns and sum them for all the shards, the local aggregation code, and minor changes to read/write path to add the mutex vs aggregation.
>  
> We could also get rid of the Clock change and go back to i64 timestamps; if we're not going to use Clocks for increments I don't think they have much raison d'être.  (Those of you just joining us, see http://pl.atyp.us/wordpress/?p=2601 for background.)  The CASSANDRA-1072 approach doesn't use Clocks either, or rather, it uses Clocks but not a byte[] value, which really means the Clock is unnecessary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1421) An eventually consistent approach to counting

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904942#action_12904942 ] 

Jonathan Ellis commented on CASSANDRA-1421:
-------------------------------------------

I am open to suggestions to improving either this approach or the CASSANDRA-1072 one.

But "1421 is slow for workload X" scares me a lot less than "1072 does not allow higher write CL than ONE or idempotent retries for any workload."

> An eventually consistent approach to counting
> ---------------------------------------------
>
>                 Key: CASSANDRA-1421
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1421
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>             Fix For: 0.7.0
>
>
> Counters may be implemented as multiple rows in a column family; that is, counters will have a configurable shard parameter; a shard factor of 128 would have 128 rows.
> An increment will be a (uuid, count) name, value tuple.  The row shard will be uuid % shardfactor.  Timestamp is ignored.  This could be implemented w/ the existing Thrift write api, or we could add a special case method for it.  Either is fine; the main advantage of the former is it lets increments be included in batch mutations.
> (Decrements we get for free as simply negative values.)
> Each node will be responsible for aggregating *the rows replicated to it* after GCGraceSeconds have elapsed.  Count aggregation will be a scheduled task on each machine.  This will require a mutex for each shard vs both writes and reads.
> This will not have the conflict resolution problem of CASSANDRA-580, or the write fragility of CASSANDRA-1072.  Normal CL will apply on both read and write.  Write idempotentcy is preserved.  I expect writes will be faster than either, since no reads are required at all on the write path.  Reads will be slower, but the read overhead can be reduced by lowering GCGraceSeconds to below your repair frequency if you are okay with the durability tradeoff there (it will not be worse than CASSANDRA-1072, for instance).  More disk space will be used by this approach, but that is the cheapest resource we have.
> Special case code required will be much less than either the 580 or 1072 approach -- primarily some code in StorageProxy to combine the uuid slices with their aggregation columns and sum them for all the shards, the local aggregation code, and minor changes to read/write path to add the mutex vs aggregation.
>  
> We could also get rid of the Clock change and go back to i64 timestamps; if we're not going to use Clocks for increments I don't think they have much raison d'être.  (Those of you just joining us, see http://pl.atyp.us/wordpress/?p=2601 for background.)  The CASSANDRA-1072 approach doesn't use Clocks either, or rather, it uses Clocks but not a byte[] value, which really means the Clock is unnecessary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1421) An eventually consistent approach to counting

Posted by "Chris Goffinet (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905673#action_12905673 ] 

Chris Goffinet commented on CASSANDRA-1421:
-------------------------------------------

I have to agree with Kevin as well on this. Digg is in the exact same position, needing perf/scalability. We can afford to drop some counts in a failure. The compromise by Johan Oskarsson on 1072 seems like a reasonable solution IMHO.

> An eventually consistent approach to counting
> ---------------------------------------------
>
>                 Key: CASSANDRA-1421
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1421
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>             Fix For: 0.7.0
>
>
> Counters may be implemented as multiple rows in a column family; that is, counters will have a configurable shard parameter; a shard factor of 128 would have 128 rows.
> An increment will be a (uuid, count) name, value tuple.  The row shard will be uuid % shardfactor.  Timestamp is ignored.  This could be implemented w/ the existing Thrift write api, or we could add a special case method for it.  Either is fine; the main advantage of the former is it lets increments be included in batch mutations.
> (Decrements we get for free as simply negative values.)
> Each node will be responsible for aggregating *the rows replicated to it* after GCGraceSeconds have elapsed.  Count aggregation will be a scheduled task on each machine.  This will require a mutex for each shard vs both writes and reads.
> This will not have the conflict resolution problem of CASSANDRA-580, or the write fragility of CASSANDRA-1072.  Normal CL will apply on both read and write.  Write idempotentcy is preserved.  I expect writes will be faster than either, since no reads are required at all on the write path.  Reads will be slower, but the read overhead can be reduced by lowering GCGraceSeconds to below your repair frequency if you are okay with the durability tradeoff there (it will not be worse than CASSANDRA-1072, for instance).  More disk space will be used by this approach, but that is the cheapest resource we have.
> Special case code required will be much less than either the 580 or 1072 approach -- primarily some code in StorageProxy to combine the uuid slices with their aggregation columns and sum them for all the shards, the local aggregation code, and minor changes to read/write path to add the mutex vs aggregation.
>  
> We could also get rid of the Clock change and go back to i64 timestamps; if we're not going to use Clocks for increments I don't think they have much raison d'être.  (Those of you just joining us, see http://pl.atyp.us/wordpress/?p=2601 for background.)  The CASSANDRA-1072 approach doesn't use Clocks either, or rather, it uses Clocks but not a byte[] value, which really means the Clock is unnecessary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.