You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Sylvain Lebresne (JIRA)" <ji...@apache.org> on 2013/11/13 18:11:22 UTC
[jira] [Comment Edited] (CASSANDRA-5417) Push composites support in the storage engine

    [ https://issues.apache.org/jira/browse/CASSANDRA-5417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13821243#comment-13821243 ] 

Sylvain Lebresne edited comment on CASSANDRA-5417 at 11/13/13 5:11 PM:
-----------------------------------------------------------------------

I've pushed a rebased/heavily modified version at https://github.com/pcmanus/cassandra/commits/5417-v2.

Compared to the first version, this new version cleans up things a bit but also try to be a lot more careful with not allocating too much objects for performance reasons. Yet, the main principle is the same, to make cell names not be opaque ByteBuffer anymore but rather custom objects (of type CellName in the patch) and this with 3 main objectives:
# to abstract away the details of the cell name layouts. With CQL3 we have 4 different layout of cell names, depending on whether 1) the underlying comparator is a CompositeType or not and 2) whether one of the component of the cell name is used to store a CQL3 column name (and if yes, which one). Currently, there is a lot of ad-hock and error-prone code all over the place to deal with those differences manually. The CellName interface of this patch allow to encapsulate those difference a lot better imo (in the different implementation of that interface).
# to keep components of the cell name separated in memory. Currently, because a cell name is an opaque buffer, every code that is only concerned by only one of the component of the cell name has to manually re-split the byte buffer. On top of the ad-hock concern described above (every time you split a cell name, you need to worry about the different layouts), the other concern is that we may split the same cell name over and over, which has a cost. Typically, each time we compare two composite cell names we have to re-decode each components one by one.  And during reads, on top of all the cell name comparisons, we also have to split the cell name to 1) count results in SliceQueryFilter and 2) process the storage engine result into a ResultSet. By keeping components separated behind the CellName interface, the attached patch save duplicating this work.
# keeping components of composites names separated should allow to make component-specific optimizations a lot easier. Typically, some possible optimizations like sharing common prefixes, using ids for CQL3 column names (CASSANDRA-4175) or even, why not, per-component-type specific implementation that uses native java types to save allocations (use an int when the component is Int32Type), are likely to be easier to do if we can deal with individual components easily rather than with opaque ByteBuffer. Note that I'm *not* claiming that this patch does those optimizations yet, just that it very likely make them a lot easier to implement properly.

The patch itself is pretty big but most of the interesting bits are in the newly added {{db.composites}} package. The rest of the patch is mainly updating the the code to use those new interfaces and some related simplification of the CQL3 code. One interesting bit though is the new AtomDeserializer class and it's use in sstable iterators. The reason is that for composite cell names, this patch move the work of splitting the names read from disk into their components from doing it every time we need to access a component to once at deserialization time.  Which is better, except that when we deserialize an index block from disk the current code fully deserialize atoms even if they are not used afterwards by the query (either because it's a query by name and we only care about a handful of names in the block or because the name is before the beginning of the slice we're looking for). In that case, the first version of this patch was actually doing more work splitting/allocating all the components for names of cells the query don't care about, which is why Jake's test of the first version was showing slower reads. So this new version introduce the AtomDeserializer class which basically is a lazy deserializer of OnDiskAtom (and their cell names) that makes sure we don't do unnecessary work for atoms we don't care about.

On the performance side I did 2 simple tests:
# I check basic stress writes and reads (with 5M keys but default parameters otherwise). On writes, the patch is slower than trunk by about 10%.  I've yourkited it but I'm still not entirely sure why tbh (at least why the difference is that big). Some of my tests suggests that the MemoryMeter is taking more time and that this impact the overall performance but that could be a red herring. For reads however, there is no noticeable difference.
# Then I did a read test similar to Jake's test above (I wasn't able to reuse his sstables because it seems trunk can't read them).  I used the following CQL3 table:
{noformat}
CREATE TABLE timeline (
    pk text,
    ck1 text,
    ck2 int,
    val bigint,
    PRIMARY KEY (pk, ck1, ck2)
)
{noformat}
I inserted 5 different {{pk}} each having 10 values for {{ck1}} and 5K {{ck2}} different values for each {{ck1}}. The read test was then requesting slices of 50 rows (with a random value for {{pk}}, {{ck1}} and for the start of the slice ({{ck2}})). Long story short, this is testing slices for wide row with composites cell names. For that test, the attached patch outperform trunk by more than 45% on my test.


was (Author: slebresne):
I've pushed a rebased and heavily modified version of this at https://github.com/pcmanus/cassandra/commits/5417-v2.

Compared to the first version, this new version cleans up things a bit but also try to be a lot more careful with not allocating too much objects for performance reasons. Yet, the main principle is the same, to make cell names not be opaque ByteBuffer anymore but rather custom objects (of type CellName in the patch) and this with 3 main objectives:
# to abstract away the details of the cell name layouts. With CQL3 we have 4 different layout of cell names, depending on whether 1) the underlying comparator is a CompositeType or not and 2) whether one of the component of the cell name is used to store a CQL3 column name (and if yes, which one). Currently, there is a lot of ad-hock and error-prone code all over the place to deal with those differences manually. The CellName interface of this patch allow to encapsulate those difference a lot better imo (in the different implementation of that interface).
# to keep components of the cell name separated in memory. Currently, because a cell name is an opaque buffer, every code that is only concerned by only one of the component of the cell name has to manually re-split the byte buffer. On top of the ad-hock concern described above (every time you split a cell name, you need to worry about the different layouts), the other concern is that we may split the same cell name over and over, which has a cost. Typically, each time we compare two composite cell names we have to re-decode each components one by one.  And during reads, on top of all the cell name comparisons, we also have to split the cell name to 1) count results in SliceQueryFilter and 2) process the storage engine result into a ResultSet. By keeping components separated behind the CellName interface, the attached patch save duplicating this work.
# keeping components of composites names separated should allow to make component-specific optimizations a lot easier. Typically, some possible optimizations like sharing common prefixes, using ids for CQL3 column names (CASSANDRA-4175) or even, why not, per-component-type specific implementation that uses native java types to save allocations (use an int when the component is Int32Type), are likely to be easier to do if we can deal with individual components easily rather than with opaque ByteBuffer. Note that I'm *not* claiming that this patch does those optimizations yet, just that it very likely make them a lot easier to implement properly.

The patch itself is pretty big but most of the interesting bits are in the newly added {{db.composites}} package. The rest of the patch is mainly updating the the code to use those new interfaces and some related simplification of the CQL3 code. One interesting bit though is the new AtomDeserializer class and it's use in sstable iterators. The reason is that for composite cell names, this patch move the work of splitting the names read from disk into their components from doing it every time we need to access a component to once at deserialization time.  Which is better, except that when we deserialize an index block from disk the current code fully deserialize atoms even if they are not used afterwards by the query (either because it's a query by name and we only care about a handful of names in the block or because the name is before the beginning of the slice we're looking for). In that case, the first version of this patch was actually doing more work splitting/allocating all the components for names of cells the query don't care about, which is why Jake's test of the first version was showing slower reads. So this new version introduce the AtomDeserializer class which basically is a lazy deserializer of OnDiskAtom (and their cell names) that makes sure we don't do unnecessary work for atoms we don't care about.

On the performance side I did 2 simple tests:
# I check basic stress writes and reads (with 5M keys but default parameters otherwise). On writes, the patch is slower than trunk by about 10%.  I've yourkited it but I'm still not entirely sure why tbh (at least why the difference is that big). Some of my tests suggests that the MemoryMeter is taking more time and that this impact the overall performance but that could be a red herring. For reads however, there is no noticeable difference.
# Then I did a read test similar to Jake's test above (I wasn't able to reuse his sstables because it seems trunk can't read them).  I used the following CQL3 table:
{noformat}
CREATE TABLE timeline (
    pk text,
    ck1 text,
    ck2 int,
    val bigint,
    PRIMARY KEY (pk, ck1, ck2)
)
{noformat}
I inserted 5 different {{pk}} each having 10 values for {{ck1}} and 5K {{ck2}} different values for each {{ck1}}. The read test was then requesting slices of 50 rows (with a random value for {{pk}}, {{ck1}} and for the start of the slice ({{ck2}})). Long story short, this is testing slices for wide row with composites cell names. For that test, the attached patch outperform trunk by more than 45% on my test.

> Push composites support in the storage engine
> ---------------------------------------------
>
>                 Key: CASSANDRA-5417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5417
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: performance
>             Fix For: 2.1
>
>
> CompositeType happens to be very useful and is now widely used: CQL3 heavily rely on it, and super columns are now using it too internally. Besides, CompositeType has been advised as a replacement of super columns on the thrift side for a while, so it's safe to assume that it's generally used there too.
> CompositeType has initially been introduced as just another AbstractType.  Meaning that the storage engine has no nothing whatsoever of composites being, well, composite. This has the following drawbacks:
> * Because internally a composite value is handled as just a ByteBuffer, we end up doing a lot of extra work. Typically, each time we compare 2 composite value, we end up "deserializing" the components (which, while it doesn't copy data per-se because we just slice the global ByteBuffer, still waste some cpu cycles and allocate a bunch of ByteBuffer objects). And since compare can be called *a lot*, this is likely not negligible.
> * This make CQL3 code uglier than necessary. Basically, CQL3 makes extensive use of composites, and since it gets backs ByteBuffer from the internal columns, it always have to check if it's actually a compositeType or not, and then split it and pick the different parts it needs. It's only an API problem, but having things exposed as composites directly would definitively make thinks cleaner. In particular, in most cases, CQL3 don't care whether it has a composite with only one component or a non-really-composite value, but we still always distinguishes both cases.  Lastly, if we do expose composites more directly internally, it's not a lot more work to "internalize" better the different parts of the cell name that CQL3 uses (what's the clustering key, what's the actuall CQL3 column name, what's the collection element), making things cleaner. Last but not least, there is currently a bunch of places where methods take a ByteBuffer as argument and it's hard to know whether it expects a cell name or a CQL3 column name. This is pretty error prone.
> * It makes it hard (or impossible) to do a number of performance improvements.  Consider CASSANDRA-4175, I'm not really sure how you can do it properly (in memory) if cell names are just ByteBuffer (since CQL3 column names are just one of the component in general). But we also miss oportunities of sharing prefixes. If we were able to share prefixes of composite names in memory we would 1) lower the memory footprint and 2) potentially speed-up comparison (of the prefixes) by checking reference equality first (also, doing prefix sharing on-disk, which is a separate concern btw, might be easier to do if we do prefix sharing in memory).
> So I suggest pushing CompositeType support inside the storage engine. What I mean by that concretely would be change the internal {{Column.name}} from ByteBuffer to some CellName type. A CellName would API-wise just be a list of ByteBuffer. But in practice, we'd have a specific CellName implementation for not-really-composite names, and the truly composite implementation will allow some prefix sharing. From an external API however, nothing would change, we would pack the composite as usual before sending it back to the client, but at least internally, comparison won't have to deserialize the components every time, and CQL3 code will be cleaner.



--
This message was sent by Atlassian JIRA
(v6.1#6144)