You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Robert Stupp (JIRA)" <ji...@apache.org> on 2016/03/01 19:34:18 UTC

[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

    [ https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174150#comment-15174150 ] 

Robert Stupp commented on CASSANDRA-11206:
------------------------------------------

A brief outline of what I am planning ("full version"):

For partitions < 64k (partitions without an IndexInfo object) we could skip the indirection during reads via RowIndexEntry at all by extending the IndexSummary and directly store the offset into the data file. (This also flattens the IndexedEntry vs. RowIndexEntry class hierarchy and removes some if-else constructs.) Maybe also use vint encoding in IndexSummary to save some space in memory and on disk (looks possible from a brief look). Eventually also add the partition deletion time to the summary, if it's worth to do that (not sure about this - it's in IndexedEntry but not in RowIndexEntry).

For other partitions we use the offset information in IndexedEntry and only read those IndexInfo entries that are really necessary during the binary search. It doesn't really matter whether we are reading cold or hot data as cold data has to be read from disk anyway and hot data should already be in the page cache.

Having the offset into the data file in the summary, we can remove the key cache.

Tests for CASSANDRA-9738 have shown that there is not much benefit keeping the full IndexedEntry + IndexInfo structure in memory (off heap). So this ticket would supersede CASSANDRA-9738 and CASSANDRA-10320.

Downside of this approach is that it changes the on-disk format of IndexSummary, which might be an issue in 3.x - so there's a "plan B version":

* Leave IndexSummary untouched
* Remove IndexInfo from the key cache (not from the index file on disk, of course)
* Change IndexSummary and remove the whole key cache in a follow-up ticket for 4.x

/cc [~slebresne] [~aweisberg] [~iamaleksey] 

> Support large partitions on the 3.0 sstable format
> --------------------------------------------------
>
>                 Key: CASSANDRA-11206
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: Robert Stupp
>             Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within each partition of every 64KB (by default) range of rows.  To find a row, we binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, we deserialize the entire set of IndexInfo, which both creates a lot of GC overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform the IndexInfo bsearch while only deserializing IndexInfo that we need to compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)