You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Sylvain Lebresne (JIRA)" <ji...@apache.org> on 2015/08/31 18:35:45 UTC

[jira] [Commented] (CASSANDRA-10232) Small optimizations in index entry serialization

    [ https://issues.apache.org/jira/browse/CASSANDRA-10232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14723648#comment-14723648 ] 

Sylvain Lebresne commented on CASSANDRA-10232:
----------------------------------------------

Pushed [a branch|https://github.com/pcmanus/cassandra/commits/10232] that implements the following changes:
# The 1st commit uses vint encoding for index entries. This should give quite a few benefits in itself since we store things like each index entry width as a 8 byte long while t will almost always is around 64k. Or we use a 4 byte int for the number of entries which is likely often smallish.
# The 2nd commit get rid of the offset in each index entry. Indeed, keep both the offset (from the row start) and the width of each entry is a bit redundant: we can recompute one with the other. And since the width will yield a better vint encoding, the patch removes the offset. We do need to add for each indexed partition the size of the "partition header", but that's a single (small) value for each partition and since we don't index unless we have at least 2 block, this will always be a net win.
# The 3rd commit is not a serialization improvement but just a minor cleanup that avoid re-creating serializer objects every time we need them.
# The 4th and last commit is a small improvement over the 1st one: it uses 64k as a base to delta-encode each entry width (since by definition each entry will be just slightly bigger than 64k). The patch actually hard-code 64k even though users can theoretically change the index size, but that's because I didn't saw a trivial way to save the actual index size used alongside each index file and I want to keep the patch on this ticket simple enough to write/review so they can make it in 3.0. Happy to skip that commit if someone has an allergic reaction to the hard-coded number however.

All those changes should be simple and quick to review so hopefully we can get those in 3.0 quickly. At the very very least, the 1st commit is trivial and there is no reason not to include it imo.

> Small optimizations in index entry serialization
> ------------------------------------------------
>
>                 Key: CASSANDRA-10232
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10232
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>             Fix For: 3.0.0 rc1
>
>
> While we should improve the data structure we use for our on-disk index in future versions, it occurred to me that we had a few _very_ low hanging fruit optimization (as in, for 3.0) we could do for the serialization of our current entries, like using vint encodings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)