You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Stu Hood (JIRA)" <ji...@apache.org> on 2010/06/18 02:14:22 UTC

[jira] Created: (CASSANDRA-1207) Don't write BloomFilters for skinny rows

Don't write BloomFilters for skinny rows
----------------------------------------

                 Key: CASSANDRA-1207
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1207
             Project: Cassandra
          Issue Type: Improvement
            Reporter: Stu Hood
            Priority: Critical
             Fix For: 0.7


All rows currently contain a serialized BloomFilter, regardless of size. For smaller rows, it is much more efficient in space and CPU time to not write a BloomFilter, and to eagerly perform lookups against the existing columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1207) Don't write BloomFilters for skinny rows

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880551#action_12880551 ] 

Stu Hood commented on CASSANDRA-1207:
-------------------------------------

> i'm skeptical that reading an index block's worth of columns is cheaper than reading a bloom filter, even for skinny rows
Well, maybe "index block" isn't the correct threshold to make this decision at... I'll do some testing.

I marked this as a critical improvement because for 5 columns, I saw > 25% improvement in compaction speed and disk usage.

> Don't write BloomFilters for skinny rows
> ----------------------------------------
>
>                 Key: CASSANDRA-1207
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1207
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Stu Hood
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: 0001-Return-alwaysMatchingBloomFilter-for-0-length-filter.patch, 0002-Conditionally-write-the-row-bloom-filter.patch
>
>
> All rows currently contain a serialized BloomFilter, regardless of size. For smaller rows, it is much more efficient in space and CPU time to not write a BloomFilter, and to eagerly perform lookups against the existing columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1207) Don't write BloomFilters for skinny rows

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-1207:
--------------------------------------

    Fix Version/s: 0.7.1
                       (was: 0.8)

After thinking about this (and writing CASSANDRA-1338) I think the automatic approach is better than having users specify something in the CF definition.  I do think we need some testing to find out what the right threshold is, though.

> Don't write BloomFilters for skinny rows
> ----------------------------------------
>
>                 Key: CASSANDRA-1207
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1207
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Stu Hood
>            Priority: Minor
>             Fix For: 0.7.1
>
>         Attachments: 0001-Return-alwaysMatchingBloomFilter-for-0-length-filter.patch, 0002-Conditionally-write-the-row-bloom-filter.patch
>
>
> All rows currently contain a serialized BloomFilter, regardless of size. For smaller rows, it is much more efficient in space and CPU time to not write a BloomFilter, and to eagerly perform lookups against the existing columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1207) Don't write BloomFilters for skinny rows

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880397#action_12880397 ] 

Jonathan Ellis commented on CASSANDRA-1207:
-------------------------------------------

i'm skeptical that reading an index block's worth of columns is cheaper than reading a bloom filter, even for skinny rows

(the main reason we have bloom filters is because in update-heavy workloads we will have lots of row versions, most of which only have a few columns, so when we are requesting specific column names we want to reject rows that don't have that column at all as early as we can)

> Don't write BloomFilters for skinny rows
> ----------------------------------------
>
>                 Key: CASSANDRA-1207
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1207
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Stu Hood
>            Priority: Critical
>             Fix For: 0.7
>
>         Attachments: 0001-Return-alwaysMatchingBloomFilter-for-0-length-filter.patch, 0002-Conditionally-write-the-row-bloom-filter.patch
>
>
> All rows currently contain a serialized BloomFilter, regardless of size. For smaller rows, it is much more efficient in space and CPU time to not write a BloomFilter, and to eagerly perform lookups against the existing columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1207) Don't write BloomFilters for skinny rows

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-1207:
--------------------------------

    Attachment: 0001-Return-alwaysMatchingBloomFilter-for-0-length-filter.patch
                0002-Conditionally-write-the-row-bloom-filter.patch

Patchset to conditionally write row BloomFilters, and to use alwaysMatchingBloomFilter when a custom filter has not been written for the row.

> Don't write BloomFilters for skinny rows
> ----------------------------------------
>
>                 Key: CASSANDRA-1207
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1207
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Stu Hood
>            Priority: Critical
>             Fix For: 0.7
>
>         Attachments: 0001-Return-alwaysMatchingBloomFilter-for-0-length-filter.patch, 0002-Conditionally-write-the-row-bloom-filter.patch
>
>
> All rows currently contain a serialized BloomFilter, regardless of size. For smaller rows, it is much more efficient in space and CPU time to not write a BloomFilter, and to eagerly perform lookups against the existing columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1207) Don't write BloomFilters for skinny rows

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884330#action_12884330 ] 

Jonathan Ellis commented on CASSANDRA-1207:
-------------------------------------------

I think a better solution would be to allow optionally annotating a ColumnFamily with metadata={bloomfilter,index,both} with both the default.  (this could be changed at runtime, and next compaction we would generate whatever was requested).

because typically you will have "original data" CFs whose columns are either accessed by name (you want a BF, index is unnecessary) or all at once (you don't need either), and "relationship/index" CFs whose columns are accessed by range (you want an index, BF is unnecessary).

> Don't write BloomFilters for skinny rows
> ----------------------------------------
>
>                 Key: CASSANDRA-1207
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1207
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Stu Hood
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: 0001-Return-alwaysMatchingBloomFilter-for-0-length-filter.patch, 0002-Conditionally-write-the-row-bloom-filter.patch
>
>
> All rows currently contain a serialized BloomFilter, regardless of size. For smaller rows, it is much more efficient in space and CPU time to not write a BloomFilter, and to eagerly perform lookups against the existing columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1207) Don't write BloomFilters for skinny rows

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-1207:
--------------------------------

    Fix Version/s: 0.8
                       (was: 0.7)

> because typically you will have "original data" CFs whose columns are either accessed by name (you want a BF, index is unnecessary)
Depending on the size of the row (the threshold I think we need to find), you don't want the bloom filter here either, since the disk/os is likely to bring the entire thing into memory. Optimizing the deserialization of columns to skip values would push the threshold up even more.

----

I'm removing this one from 0.7, since we are planning to refactor the file format in 0.8 anyway.

> Don't write BloomFilters for skinny rows
> ----------------------------------------
>
>                 Key: CASSANDRA-1207
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1207
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Stu Hood
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-Return-alwaysMatchingBloomFilter-for-0-length-filter.patch, 0002-Conditionally-write-the-row-bloom-filter.patch
>
>
> All rows currently contain a serialized BloomFilter, regardless of size. For smaller rows, it is much more efficient in space and CPU time to not write a BloomFilter, and to eagerly perform lookups against the existing columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1207) Don't write BloomFilters for skinny rows

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-1207:
--------------------------------------

    Priority: Minor  (was: Critical)

> Don't write BloomFilters for skinny rows
> ----------------------------------------
>
>                 Key: CASSANDRA-1207
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1207
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Stu Hood
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: 0001-Return-alwaysMatchingBloomFilter-for-0-length-filter.patch, 0002-Conditionally-write-the-row-bloom-filter.patch
>
>
> All rows currently contain a serialized BloomFilter, regardless of size. For smaller rows, it is much more efficient in space and CPU time to not write a BloomFilter, and to eagerly perform lookups against the existing columns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.