You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Stu Hood (JIRA)" <ji...@apache.org> on 2010/01/06 08:33:54 UTC

[jira] Created: (CASSANDRA-674) New SSTable Format

New SSTable Format
------------------

                 Key: CASSANDRA-674
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
             Project: Cassandra
          Issue Type: Improvement
          Components: Core
    Affects Versions: 0.9
            Reporter: Stu Hood
            Assignee: Stu Hood
             Fix For: 0.9


Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.

The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
 * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
 * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
 * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.

The most interesting concepts from this patch are:
 * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
 * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
 * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
 * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
 * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
 * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "David Strauss (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839395#action_12839395 ] 

David Strauss commented on CASSANDRA-674:
-----------------------------------------

This is a good opportunity to improve get_count() performance. Currently, it is O(n) at call-time, where n is the number of columns being counted. I discussed the issue with Stu on IRC, and he mentioned how a "mini-merge" happens at call-time for the SSTables storing data for a column making it difficult to maintain counts.

Instead of counting all columns, we could maintain and use column counts in the oldest SSTable and "repair" the relevant counts at get_count() call-time with the changes found in the newer SSTables. That would allow calls to get_count() to run in O(m) time, where m is the number of columns being counted in *all but the oldest SSTable*. (Granted, m can approach n on high write volume, but m can never exceed n.)

For stable data, this would bring get_count() to near constant-time with performance gradually degrading depending on the number of non-oldest SSTables.

(Note: I'm probably missing a multiplier in my big-O notation for looking up columns in older SSTables to detect intersections.)

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797451#action_12797451 ] 

Jonathan Ellis commented on CASSANDRA-674:
------------------------------------------

What kind of performance do you get on reads and writes with stress.py vs the old code?  (without compression, to compare apples to apples)

Note that stress.py uses very narrow rows so it's pretty much a best-case scenario for this approach; we should test with much wider rows, too.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.9
>
>         Attachments: 674-v1.diff
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839425#action_12839425 ] 

Jonathan Ellis commented on CASSANDRA-674:
------------------------------------------

there's no such thing as "the oldest sstable," and even if there were, there is no way to know which columns need to increase the count without actually doing the full merge as we do currently.

consider a hypothetical oldest sstable with a row whose count you have set to 10.  there is another sstable fragment with column A in that row.  is A an update to the original 10, or a new insert?  you have no way of knowing.

"count is slow" is one of the tradeoffs we make for having super fast writes (no update-in-place) and snapshotting.  it's the right tradeoff, but there's no magic wand to make it a free lunch.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Kevin Weil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804246#action_12804246 ] 

Kevin Weil commented on CASSANDRA-674:
--------------------------------------

I haven't worked enough with Avro to be sure, but my understanding is that the metadata block can be made pretty lightweight.  It's more for Avro schema resolution than trying to minimize number of files, as I understand it.  It'd be nice if you could even instruct Avro not to put the schema in the metadata for known, generated schemas, though I don't know if that's possible or not.  Either way, it doesn't mandate the indices be stored in the metadata.  Agreed that things are nicer when different types of data are in their own files.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835095#action_12835095 ] 

Stu Hood commented on CASSANDRA-674:
------------------------------------

> it would be easier to skip from one group to another w/ the "slice" indexes next to each other
Since the block might be compressed, you can't assume random access to the whole data file: you might have to scan the block from the beginning anyway. So indexing externally to the data file at a resolution higher than blocks is of questionable value.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-674) New SSTable Format

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-674:
-------------------------------------

    Affects Version/s:     (was: 0.6)
                       0.7

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.7
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.6
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Ryan King (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803386#action_12803386 ] 

Ryan King commented on CASSANDRA-674:
-------------------------------------

This is why the sync markers from avro would be useful. If you bitrot, you'll only lose the block with the rot in it.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797450#action_12797450 ] 

Jonathan Ellis commented on CASSANDRA-674:
------------------------------------------

This approach is fine for proof of concept, but be aware that any sstable format change that we actually commit is going to need to support reading the old version.  So ultimately what a patch set like this needs to look like is

 00: provide APIs that CFS et al can use to read data from either old or new versions (e.g. getScanner, getFileDataInput); probably you will end up with an AbstractSSTableReader class with common functionality like getColumnComparator
 01: refactor old SSTR class and callers to use the new API
 02: introduce the new data file format in separate classes

Splitting it up like this is also going to be much much easier to rebase against the moving target of trunk (and there is enough missing here that it looks like it's going to need to be rebased for a while).


> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.9
>
>         Attachments: 674-v1.diff
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804283#action_12804283 ] 

Jonathan Ellis commented on CASSANDRA-674:
------------------------------------------

But if we're just going to use the Avro format as a "container" for non-avro data i don't see the point.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (CASSANDRA-674) New SSTable Format

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798919#action_12798919 ] 

Stu Hood edited comment on CASSANDRA-674 at 1/11/10 11:15 PM:
--------------------------------------------------------------

I'm marking this one as blocked by 389, because we had a good head start on adding backward compatible sstable versioning there.

Once versioning is merged, next steps will be extracting abstract base classes for SSTableReader and SSTableScanner, and extending them with the SSTable format in trunk, and the format in 674-v1.

      was (Author: stuhood):
    I marking this one as blocked by 389, because we had a good head start on adding backward compatible sstable versioning there.

Once versioning is merged, next steps will be extracting abstract base classes for SSTableReader and SSTableScanner, and extending them with the SSTable format in trunk, and the format in 674-v1.
  
> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.9
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Ryan King (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800257#action_12800257 ] 

Ryan King commented on CASSANDRA-674:
-------------------------------------

stu-

I understand that the overlap is coincidental, I'm just hoping to encourage cooperation where possible. I certainly have a personal bias here, because I'd like to move our infrastructure to using a common data serialization across our online (casandra) and offline (hadoop) storage. That's not to say that we couldn't make the integration work, but it seems like some awesome things could happen when everyone is using the same data format.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.9
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797516#action_12797516 ] 

Jonathan Ellis commented on CASSANDRA-674:
------------------------------------------

if you're seeing exactly equal times on insert (is that what this says?) you're probably not doing enough compactions :)

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.9
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830504#action_12830504 ] 

Stu Hood commented on CASSANDRA-674:
------------------------------------

I've extracted the current interfaces for SSTableReader and SSTableScanner, and I'm going to start modifying the interfaces to be closer to the original 674-v1 patch, which should take a week or so. Then, if everyone is happy with the outcome and satisfied that we'll be able to maintain the interface for a few versions, we can get that interface merged and start thinking about the format again.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-674) New SSTable Format

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-674:
-------------------------------------

        Fix Version/s:     (was: 0.6)
                       0.7
    Affects Version/s:     (was: 0.7)

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804179#action_12804179 ] 

Jonathan Ellis commented on CASSANDRA-674:
------------------------------------------

unfortunately but not suprisingly, the avro format is a poor fit for cassandra.  it appears to be designed for hdfs, where having multiple files is expensive, so metadata (such as cassandra indexes) is stored in the same file as object data, after the normal blocks it describes.

this is how cassandra did things back in 0.3, following the bigtable model, and it is lousy for us because you have to save up the index in memory as you write data out; since cassandra sstables are not bounded, you can easily OOM doing this, which is why in 0.4 we moved to a separate index file.  (additionally, the code is simpler and cleaner when you split different types of data into its own file.)

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Ryan King (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799401#action_12799401 ] 

Ryan King commented on CASSANDRA-674:
-------------------------------------

I haven't had a chance to look through this patch very closely, so forgive me if this is a dumb suggestion, but it seems that there's a degree of overlap with Avro's object container files: http://hadoop.apache.org/avro/docs/current/spec.html#Object+Container+Files. Have we looked at those at all?

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.9
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "David Strauss (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839440#action_12839440 ] 

David Strauss commented on CASSANDRA-674:
-----------------------------------------

@jbellis Sorry, it seems that I was confusing the commit logs (where there can only be one receiving writes on each node to avoid seeks) with the SSTable files (where multiple ones may be receiving writes on each node).

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800520#action_12800520 ] 

Jonathan Ellis commented on CASSANDRA-674:
------------------------------------------

mental note: when we change sstable format, let's take advantage of the opportunity to restrict key lenghts to 64K (i.e., 16 bits).

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.9
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835098#action_12835098 ] 

Jonathan Ellis commented on CASSANDRA-674:
------------------------------------------

i could go for a "put all the metadata at the head of the block, rest of the block is just name value timestamp, name value timestamp... " design.  then you'd have a block index file for as-close-to-random-access-as-you're-gonna-get, a duplicate of block headers in a 2nd file for redundancy, and probably a key-oriented BF file.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835080#action_12835080 ] 

Jonathan Ellis commented on CASSANDRA-674:
------------------------------------------

another argument in favor of using external indexes instead of in-file "slices" would be CASSANDRA-571 -- it would be easier to skip from one group to another w/ the "slice" indexes next to each other on disk instead of scattered through the data file.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839160#action_12839160 ] 

Stu Hood commented on CASSANDRA-674:
------------------------------------

> i could go for a "put all the metadata at the head of the block, rest of the block is just name value timestamp, name value timestamp... "
Similarly, the RCFile design from Hive stores all keys at the head of a block: http://hadoop.apache.org/hive/docs/r0.4.0/api/org/apache/hadoop/hive/ql/io/RCFile.html . I don't know if we should go so far as supporting arbitrary compression per column family, but making the data easier for a generic compression algo to squish is a nice side effect.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797025#action_12797025 ] 

Stu Hood commented on CASSANDRA-674:
------------------------------------

List of features stubbed as "FIXME: not implemented" in v1:
 1. Reverse slicing within CFs is not implemented (see SSTableSliceIterator),
 2. Reading SuperColumns is disabled (see SSTable(Slice|Names)Iterator),
 3. The recently added MMAP support for data files is disabled until I can port this SSTableScanner interface to use it (see SSTableReader),
 4. AntiEntropyService is not hashing slices (meaning that major compactions always fail).
 5. SSTable(Import|Export) are broken,
 6. BinaryMemtables will crash on flush,
 7. The bytesRead MBean for CompactionManager is disabled, 
 8. AntiCompaction is not using the 'skip ranges we don`t need' optimization.

Also, I lied in the description above: the patch does not have GZIP compression enabled, but you can add two lines to enable it: add a GZIPInputStream to the chain in SSTableReader.Block.stream(), and a GZIPOutputStream to the chain in SSTableWriter.BlockContext.flushSlice(). There is a memory leak related to reading from compressed blocks which will quickly kill the server, but it should be easy to track down.

Finally, there are tons of other TODOs/FIXMEs scattered around, many of which should be tackled in other tickets.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.9
>
>         Attachments: 674-v1.diff
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (CASSANDRA-674) New SSTable Format

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800520#action_12800520 ] 

Jonathan Ellis edited comment on CASSANDRA-674 at 1/21/10 3:47 PM:
-------------------------------------------------------------------

mental note: when we change sstable format, let's take advantage of the opportunity to restrict key lenghts to 64K (i.e., 16 bits)

edit: our use of writeUTF is already silently enforcing this.  I added a check to our thrift validation to raise an intelligible error to the user if a longer key is sent.

      was (Author: jbellis):
    mental note: when we change sstable format, let's take advantage of the opportunity to restrict key lenghts to 64K (i.e., 16 bits).
  
> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800135#action_12800135 ] 

Stu Hood commented on CASSANDRA-674:
------------------------------------

kingryan:
>... it seems that there's a degree of overlap with Avro's object container files...
Purely coincidental, I assure you... There might be some benefits in conforming to their standard (we would get streaming support in Hadoop for free), but we need versioning at the SSTReader/Writer level anyway, so versioning within the file is overkill, and I'm fairly sure that the binary serialization we do here will be noticeable faster than Avro.

Adding that magic sync marker seems like a good idea though.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.9
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800132#action_12800132 ] 

Stu Hood commented on CASSANDRA-674:
------------------------------------

jbellis:
> I would rather see the "things with the same parent' concept be an iterator, with metadata from a separate
> file (like the current key index) used to determine begin/end
I couldn't sleep due to timezone changes, and this suggestion kept jumping into my head. While I don't like the idea of a separate file being necessary in order to read from the 'data' file of the sstable (currently, it stands alone: the index and filter files are optimizations), I think moving all of the location information out of the SliceMarks is a good idea.

To imitate the implementation of indexes in trunk, perhaps the Block in 674-v1 becomes the unit that has it's own 'index' (so to speak): the first thing you see when you open the block is the list of Slices contained in the block. Naively, this would be a list of SliceMarks with indexes into the block, but because the Slice information is all stored contiguously, you can optimize it considerably (no need for 'nextKey', and all consecutive slices that share any parents only need those parent keys stored once). Then, following the 'index' for the block, the remainder of the block would just be consecutive columns.

> But that is just a first impression I am throwing out fwiw. :)
Agreed... I always seem to tend toward waterfall, and it hurts in the long run.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.9
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800201#action_12800201 ] 

Jonathan Ellis commented on CASSANDRA-674:
------------------------------------------

> While I don't like the idea of a separate file being necessary in order to read from the 'data' file of the sstable

And I don't like the idea of the data file containing lots of weird speed bumps of headers and such that aren't actually data. :)

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.9
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-674) New SSTable Format

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-674:
-------------------------------

    Attachment: 674-v1.diff

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.9
>
>         Attachments: 674-v1.diff
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803330#action_12803330 ] 

Jonathan Ellis commented on CASSANDRA-674:
------------------------------------------

Another advantage of having an external index-like structure containing redundant information to block headers: if bitrot corrupts a block header we can still recover.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-674) New SSTable Format

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-674:
-------------------------------

    Attachment: perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
                perf-674-v1.txt

Here are stress.py runs of current trunk (default config), and 674-v1 applied to trunk with data file mmap support disabled. It should be possible to make this code competitive with trunk once mmap is added back.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.9
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800526#action_12800526 ] 

Jonathan Ellis commented on CASSANDRA-674:
------------------------------------------

(actually we already are limited to 64K since we are using writeUTF.  but i dont' think we are enforcing that limit at the thrift level)

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.9
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797526#action_12797526 ] 

Stu Hood commented on CASSANDRA-674:
------------------------------------

The one major compaction that typically triggers while inserting 1million items fails immediately with this code: see #4 in the comments. So, if that major compaction succeeded, writes would probably be slower, and reads would be faster.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.9
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Kevin Weil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804167#action_12804167 ] 

Kevin Weil commented on CASSANDRA-674:
--------------------------------------

I'm with Ryan.  Clearly there is a huge caveat because I'm much more of a Hadoop dev than a Cassandra dev.  I'm not at all suggesting that Cassandra should bend over backward to fit another system, but if there is a way to nudge things so as to make technologies work together, I think that's to everyone's benefit.  Hadoop users will have a more straightforward path to Cassandra adoption and vice versa.  Allowing the two technologies to leverage each other's strengths would be a great thing.

> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-674) New SSTable Format

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799273#action_12799273 ] 

Jonathan Ellis commented on CASSANDRA-674:
------------------------------------------

ISTM that Slice is trying to solve the problem "how do I avoid repeating the Key/SC name w/ each column entry, now that I have moved to a global index."  This is the central difficulty with this approach.  So, I definitely agree that we need a concept that means "all the columns w/ the same parent" (sort of like the existing IColumnContainer) but I don't think Slice as it exists here is the right one.  I would rather see the "things with the same parent' concept be an iterator, with metadata from a separate file (like the current key index) used to determine begin/end, rather than have an object inside a block that you need to (potentially) assemble multiple of to get the "things with the same parent" concept.

I also think that if I were doing this myself I would probably make part 1 be a conversion to the global index and just inefficiently repeat the Key/SC data, and then try to make it efficient with the Slice/iterator-thing next.  But that is just a first impression I am throwing out fwiw. :)


> New SSTable Format
> ------------------
>
>                 Key: CASSANDRA-674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-674
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.9
>
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly:
>  * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.