You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Ariel Weisberg (JIRA)" <ji...@apache.org> on 2014/12/10 22:31:13 UTC
[jira] [Comment Edited] (CASSANDRA-6060) Remove internal use of Strings for ks/cf names

    [ https://issues.apache.org/jira/browse/CASSANDRA-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241364#comment-14241364 ] 

Ariel Weisberg edited comment on CASSANDRA-6060 at 12/10/14 9:30 PM:
---------------------------------------------------------------------

I am still digging but I am not sure there is much value here.

For prepared statements between client and server there are no ks/cf names.

Here is the breakdown for a minimum size mutation inside the cluster

Size of Ethernet frame - 24 Bytes
Size of IPv4 Header (without any options) - 20 bytes
Size of TCP Header (without any options) - 20 Bytes

4-bytes protocol magic
4-bytes version
4-bytes timestamp
4-bytes verb
4-bytes parameter count
4-bytes payload length prefix
No keyspace name in current versions
2-byte key length
key say 10 bytes
4-byte mutation count

1-byte boolean
16-byte cf id
4-byte count of columns

Per column
2-byte column name length prefix
column name say 8 bytes
1-byte serialization flags
8-byte timestamp
4-byte length prefix
column value say 8 bytes

Total is 158 bytes. Saving 12 bytes on the CF uuid would be 7.5 %. 

For single CF mutations this is not a win. Loading data points 16 bytes at a time isn't going to work so hot anyways so people might look into batching at that point.

The UUID is not repeated for each cell so it is a one time cost for workloads that modify multiple cells per CF. The one case where the 12-bytes becomes significant is single cell updates to multiple CFs in one mutation. There the 12-byte overhead converges on 23%.

I am going to look at the read path next, but I kind of expect to find something similar. A read is going t o have key overhead and possibly overhead for all the other query parameters that should match the simple single cell mutation case.


was (Author: aweisberg):
I am still digging but I am not sure there is much value here.

For prepared statements between client and server there are no ks/cf names.

Here is the breakdown for a minimum size mutation inside the cluster

Size of Ethernet frame - 24 Bytes
Size of IPv4 Header (without any options) - 20 bytes
Size of TCP Header (without any options) - 20 Bytes

4-bytes protocol magic
4-bytes version
4-bytes timestamp
4-bytes verb
4-bytes parameter count
4-bytes payload length prefix
No keyspace name in current versions
2-byte key length
key say 10 bytes
4-byte mutation count

1-byte boolean
16-byte cf id
4-byte count of columns

Per column
2-byte column name length prefix
column name say 8 bytes
1-byte serialization flags
8-byte timestamp
4-byte length prefix
column value say 8 bytes

Total is 158 bytes. Saving 12 bytes on the CF uuid would be 7.5 %. 

For single CF mutations this is not a win. Loading data points 16 bytes at a time isn't going to work so hot anyways so people might look into batching at that point.

The UUID is not repeated for each cell so it is a one time cost so for workloads that modify multiple cells per CF. The one case where the 12-bytes becomes significant is single cell updates to multiple CFs in one mutation. There the 12-byte overhead converges on 23%.

I am going to look at the read path next, but I kind of expect to find something similar. A read is going t o have key overhead and possibly overhead for all the other query parameters that should match the simple single cell mutation case.

> Remove internal use of Strings for ks/cf names
> ----------------------------------------------
>
>                 Key: CASSANDRA-6060
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6060
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Ariel Weisberg
>              Labels: performance
>
> We toss a lot of Strings around internally, including across the network.  Once a request has been Prepared, we ought to be able to encode these as int ids.
> Unfortuntely, we moved from int to uuid in CASSANDRA-3794, which was a reasonable move at the time, but a uuid is a lot bigger than an int.  Now that we have CAS we can allow concurrent schema updates while still using sequential int IDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)