You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Stu Hood (JIRA)" <ji...@apache.org> on 2011/01/06 07:18:50 UTC

[jira] Issue Comment Edited: (CASSANDRA-674) New SSTable Format

    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978155#action_12978155 ] 

Stu Hood edited comment on CASSANDRA-674 at 1/6/11 1:17 AM:
------------------------------------------------------------

>> Indexes for individual rows are gone, since the global index allows random access...
> ^ This wouldn't be useful to cache? in the situation you only want a small range of columns?
That information is outdated: it's from the original implementation. But yes... we will want to keep the index in app memory or page cache.

> Roughly how large would the actual chunk be? This is the unit of deserialization right?
The span is the unit of deserialization (made up of at most 1 chunk per level), and its size would be 100% configurable. The main question is how frequently to index the spans in the sstable index: does each span get an index entry? or only the first span of a row (this is our approach in the current implementation).

EDIT: Sorry... the span is symbolic: you would deserialize the first chunk of the span (containing the keys) to decide whether to skip the rest of the chunks in the span.

> So if you are doing a range query on a very wide row how do you know when to stop processing chunks?
By looking at the global index: if all spans get entries in the index, you know the last interesting span.

> Let me know if this is wrong, but this design opens the cassandra data model to contain arbitrarily nested data.
> Given the complexity we already have surrounding the supercolumn concept do you think this is the right way forward? 
The super column concept is only confusing _because_ we call them "supercolumns" rather than just calling them "compound column names". People use them, and the consensus I've heard is that they are useful.

> If we assume we keep the datamodel as is how can we simplify the open ended-ness of your design to make the approach fit our current data model.
The only difference is what you call the structures, and whether you put arbitrary limits on the nesting: I'm open to suggestions.

      was (Author: stuhood):
    >> Indexes for individual rows are gone, since the global index allows random access...
> ^ This wouldn't be useful to cache? in the situation you only want a small range of columns?
That information is outdated: it's from the original implementation. But yes... we will want to keep the index in app memory or page cache.

> Roughly how large would the actual chunk be? This is the unit of deserialization right?
The span is the unit of deserialization (made up of at most 1 chunk per level), and its size would be 100% configurable. The main question is how frequently to index the spans in the sstable index: does each span get an index entry? or only the first span of a row (this is our approach in the current implementation).

> So if you are doing a range query on a very wide row how do you know when to stop processing chunks?
By looking at the global index: if all spans get entries in the index, you know the last interesting span.

> Let me know if this is wrong, but this design opens the cassandra data model to contain arbitrarily nested data.
> Given the complexity we already have surrounding the supercolumn concept do you think this is the right way forward? 
The super column concept is only confusing _because_ we call them "supercolumns" rather than just calling them "compound column names". People use them, and the consensus I've heard is that they are useful.

> If we assume we keep the datamodel as is how can we simplify the open ended-ness of your design to make the approach fit our current data model.
The only difference is what you call the structures, and whether you put arbitrary limits on the nesting: I'm open to suggestions.
  
> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>             Fix For: 0.8
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.