You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Michael Kjellman (JIRA)" <ji...@apache.org> on 2016/06/08 19:59:21 UTC

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

    [ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321332#comment-15321332 ] 

Michael Kjellman commented on CASSANDRA-9754:
---------------------------------------------

Alright, happy to finally be able to write this. :) I'm attaching a v1 diff containing Birch!

h4. Why is it named Birch?
B+ Tree -> Trees that start with the letter B -> Birch... get it? haha...

h4. Description
Birch is a B+ish/inspired tree aimed at improving the performance of the SSTable index in Cassandra (especially with large partitions).

The existing implementation scales poorly with the size of the index/ row as the entire index must be deserialized onto the heap even to look for a single element. This puts significant pressure on the heap, where one read to a large partition will cause at the minimum a long painful CMS GC pause or -- in the worst case -- an OOM. 

The Birch implementation has a predictable fixed cost for reads at the expense of the additional on disk overhead for the tree itself -- with an implementation that is the same complexity O(log(n)) as the existing implementation. Every row added to the SSTable is also added to the primary index. If the size of the row is greater than 64kb we build an index (otherwise we just encode the position in the sstable for that row). All entries encoded into the index are page aligned and padded to the nearest boundary (4096 bytes by default). Every segment can be marked as either internally padded/aligned along a boundary or non-padded/aligned (up to 2GB). Birch indexes are aligned into 4096 byte nodes (both leaf and inner). Keys will be encoded inside the node itself, unless they exceed the size of the node/2. In that case, the size of the node/2 is encoded into the node itself and the offset of the remaining bytes in the overflow page is encoded. This enables predictable fixed performance of the tree, but accommodates variable length keys/elements.

h4. Notes on v1 of the diff (in no particular order)
 * I broke the changes into two logical parts: The first abstracts out the existing Index implementation and adds no new logic. The second includes a IndexedEntry implementation backed by a Birch tree.
 * The attached v1 patch is written for 2.1, I have already started rebasing the patch onto trunk and hope to finish that shortly and post a the trunk based patch
 * There's some high level Javadoc documentation in BirchWriter and PageAlignedWriter on the layout of the tree on disk, serialization and deserialization paths, and higher level goals of the classes
 * The next steps are to start getting feedback from reviews and the community. I also have profiled the tree itself but profiling the tree integrated into the stack and optimizing non-performant code paths is next (after the immediate task to rebase the change onto trunk)
 * There are still a few todo's I've left in regards to handling backwards compatibility, parts of the code I expect might be non-performant, and things I'd like to discuss on the "correct" implementation/behavior etc
 * I have a few unit tests that still don't pass and still need to be root caused... I've taken the approach this entire time that the unit tests shouldn't be touched to pass, so there is still a few behavioral regressions I've accidentally introduced. The current failing tests are: 
 ** AutoSavingCacheTest
 ** SecondaryIndexTest
 ** BatchlogManagerTest
 ** KeyCacheTest
 ** ScrubTest
 ** IndexSummaryManagerTest
 ** LegacySSTableTest
 ** MultiSliceTest
 * I need to write a unit test to test reading the legacy/existing primary index implementation
 * By the nature of the index's role in the database, the unit test coverage is actually pretty extensive as any read and write touches the index in some capacity

I'll be giving a talk at NGCC tomorrow (Thursday the 9th) to go over the high level design I ended up with and considerations I had to take into account once I actually got deep inside this part of the code.

Looking forward to feedback!

> Make index info heap friendly for large CQL partitions
> ------------------------------------------------------
>
>                 Key: CASSANDRA-9754
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: sankalp kohli
>            Assignee: Michael Kjellman
>            Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)