You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Stu Hood (JIRA)" <ji...@apache.org> on 2010/02/12 23:28:27 UTC

[jira] Created: (CASSANDRA-790) SSTables limited to (2^31)/15 keys

SSTables limited to (2^31)/15 keys
----------------------------------

                 Key: CASSANDRA-790
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-790
             Project: Cassandra
          Issue Type: Bug
    Affects Versions: 0.5, 0.6, 0.7
            Reporter: Stu Hood
            Priority: Blocker
             Fix For: 0.5, 0.6, 0.7


The current BloomFilter implementation requires a BitSet of (bucket_count * num_keys) in size, and that calculation is currently performed in an integer, which causes overflow for around 140 million keys in one SSTable.

Short term fix: perform the calculation in a long, and cap the value to the maximum size of a BitSet.
Long term fix: begin partitioning BitSets, perhaps using Linear Bloom Filters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-790) SSTables limited to (2^31)/15 keys

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-790:
-------------------------------

    Attachment: 0002-Add-timeouts-to-forceBlockingFlush-during-tests.patch
                0001-Change-parameters-to-BloomCalculations-in-order-to-c.patch

Better implementation of the short term solution. Rather than capping the total number of buckets, this patch caps the number of buckets per element.

The second patch adds timeouts to blocking flushes, to minimize hanging tests.

> SSTables limited to (2^31)/15 keys
> ----------------------------------
>
>                 Key: CASSANDRA-790
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-790
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5, 0.6, 0.7
>            Reporter: Stu Hood
>            Priority: Blocker
>             Fix For: 0.5, 0.6, 0.7
>
>         Attachments: 0001-Change-parameters-to-BloomCalculations-in-order-to-c.patch, 0002-Add-timeouts-to-forceBlockingFlush-during-tests.patch
>
>
> The current BloomFilter implementation requires a BitSet of (bucket_count * num_keys) in size, and that calculation is currently performed in an integer, which causes overflow for around 140 million keys in one SSTable.
> Short term fix: perform the calculation in a long, and cap the value to the maximum size of a BitSet.
> Long term fix: begin partitioning BitSets, perhaps using Linear Bloom Filters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-790) SSTables limited to (2^31)/15 keys

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833927#action_12833927 ] 

Jonathan Ellis commented on CASSANDRA-790:
------------------------------------------

patch 01 comments:

refactoring:
 - please split the refactoring into a separate patch; it's hard to tell what is part of the actual fix here
 - BF constructors that do not chain is a design smell; one of them only being called from tests is also a smell
 - instead of using min/max to force values into acceptable ranges, assert that they are sane
 - I feel part of the BF problems here is that BF is trying to be too high-level.  Wouldn't we be better served by having a low-level BF constructor taking hash & bucket counts, and then factories to do the high level things?

fix:
 - this feels like we're trading an obvious problem (BF constructor throws) for a more subtle one (BF is a no-op when we exceed the spec, as noted by TODO).  wouldn't it be better to log a warning, create the largest BF possible, and degrade gracefully?  This would be easier if the BF constructor were sane as mentioned above.

> SSTables limited to (2^31)/15 keys
> ----------------------------------
>
>                 Key: CASSANDRA-790
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-790
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5, 0.6, 0.7
>            Reporter: Stu Hood
>            Priority: Blocker
>             Fix For: 0.5, 0.6, 0.7
>
>         Attachments: 0001-Change-parameters-to-BloomCalculations-in-order-to-c.patch, 0002-Add-timeouts-to-forceBlockingFlush-during-tests.patch
>
>
> The current BloomFilter implementation requires a BitSet of (bucket_count * num_keys) in size, and that calculation is currently performed in an integer, which causes overflow for around 140 million keys in one SSTable.
> Short term fix: perform the calculation in a long, and cap the value to the maximum size of a BitSet.
> Long term fix: begin partitioning BitSets, perhaps using Linear Bloom Filters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-790) SSTables limited to (2^31)/15 keys

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833930#action_12833930 ] 

Stu Hood commented on CASSANDRA-790:
------------------------------------

> please split the refactoring into a separate patch
Which refactoring? I don't think I did anything that wasn't necessary in order to cap the number of available buckets.

> BF constructors that do not chain is a design smell; one of them only being called from tests is also a smell
The 'maxFalsePosProb' constructor has never been called anywhere but tests, but it was very elegant, and someone spent a lot of time on it, so I wasn't sure whether to remove it.

> low-level BF constructor taking hash & bucket counts, and then factories to do the high level things
Agreed... that would be much better. I'll add factories with warnings.

> SSTables limited to (2^31)/15 keys
> ----------------------------------
>
>                 Key: CASSANDRA-790
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-790
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5, 0.6, 0.7
>            Reporter: Stu Hood
>            Priority: Blocker
>             Fix For: 0.5, 0.6, 0.7
>
>         Attachments: 0001-Change-parameters-to-BloomCalculations-in-order-to-c.patch, 0002-Add-timeouts-to-forceBlockingFlush-during-tests.patch
>
>
> The current BloomFilter implementation requires a BitSet of (bucket_count * num_keys) in size, and that calculation is currently performed in an integer, which causes overflow for around 140 million keys in one SSTable.
> Short term fix: perform the calculation in a long, and cap the value to the maximum size of a BitSet.
> Long term fix: begin partitioning BitSets, perhaps using Linear Bloom Filters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-790) SSTables limited to (2^31)/15 keys

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833933#action_12833933 ] 

Jonathan Ellis commented on CASSANDRA-790:
------------------------------------------

> Which refactoring

the method renaming and constructor rearrangement

> SSTables limited to (2^31)/15 keys
> ----------------------------------
>
>                 Key: CASSANDRA-790
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-790
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5, 0.6, 0.7
>            Reporter: Stu Hood
>            Priority: Blocker
>             Fix For: 0.5, 0.6, 0.7
>
>         Attachments: 0001-Change-parameters-to-BloomCalculations-in-order-to-c.patch, 0002-Add-timeouts-to-forceBlockingFlush-during-tests.patch
>
>
> The current BloomFilter implementation requires a BitSet of (bucket_count * num_keys) in size, and that calculation is currently performed in an integer, which causes overflow for around 140 million keys in one SSTable.
> Short term fix: perform the calculation in a long, and cap the value to the maximum size of a BitSet.
> Long term fix: begin partitioning BitSets, perhaps using Linear Bloom Filters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-790) SSTables limited to (2^31)/15 keys

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-790:
-------------------------------

    Attachment: short-term-790-check-for-overflow-and-cap-bloom-size.txt

Here is the short term solution.

> SSTables limited to (2^31)/15 keys
> ----------------------------------
>
>                 Key: CASSANDRA-790
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-790
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5, 0.6, 0.7
>            Reporter: Stu Hood
>            Priority: Blocker
>             Fix For: 0.5, 0.6, 0.7
>
>         Attachments: short-term-790-check-for-overflow-and-cap-bloom-size.txt
>
>
> The current BloomFilter implementation requires a BitSet of (bucket_count * num_keys) in size, and that calculation is currently performed in an integer, which causes overflow for around 140 million keys in one SSTable.
> Short term fix: perform the calculation in a long, and cap the value to the maximum size of a BitSet.
> Long term fix: begin partitioning BitSets, perhaps using Linear Bloom Filters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-790) SSTables limited to (2^31)/15 keys

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833212#action_12833212 ] 

Jonathan Ellis commented on CASSANDRA-790:
------------------------------------------

rather than just cap the filter size, you need to reduce bucketsPerElement too or you will get suboptimal number-of-hashes calculation, since you haven't told it it can't have as big a filter as it wanted.

> SSTables limited to (2^31)/15 keys
> ----------------------------------
>
>                 Key: CASSANDRA-790
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-790
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5, 0.6, 0.7
>            Reporter: Stu Hood
>            Priority: Blocker
>             Fix For: 0.5, 0.6, 0.7
>
>         Attachments: short-term-790-check-for-overflow-and-cap-bloom-size.txt
>
>
> The current BloomFilter implementation requires a BitSet of (bucket_count * num_keys) in size, and that calculation is currently performed in an integer, which causes overflow for around 140 million keys in one SSTable.
> Short term fix: perform the calculation in a long, and cap the value to the maximum size of a BitSet.
> Long term fix: begin partitioning BitSets, perhaps using Linear Bloom Filters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-790) SSTables limited to (2^31)/15 keys

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-790:
-------------------------------

    Attachment:     (was: short-term-790-check-for-overflow-and-cap-bloom-size.txt)

> SSTables limited to (2^31)/15 keys
> ----------------------------------
>
>                 Key: CASSANDRA-790
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-790
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5, 0.6, 0.7
>            Reporter: Stu Hood
>            Priority: Blocker
>             Fix For: 0.5, 0.6, 0.7
>
>
> The current BloomFilter implementation requires a BitSet of (bucket_count * num_keys) in size, and that calculation is currently performed in an integer, which causes overflow for around 140 million keys in one SSTable.
> Short term fix: perform the calculation in a long, and cap the value to the maximum size of a BitSet.
> Long term fix: begin partitioning BitSets, perhaps using Linear Bloom Filters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-790) SSTables limited to (2^31)/15 keys

Posted by "Ryan King (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833634#action_12833634 ] 

Ryan King commented on CASSANDRA-790:
-------------------------------------

Thanks for the work on this. I'll try and test it soon in the situation that brought up the problem.

> SSTables limited to (2^31)/15 keys
> ----------------------------------
>
>                 Key: CASSANDRA-790
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-790
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5, 0.6, 0.7
>            Reporter: Stu Hood
>            Priority: Blocker
>             Fix For: 0.5, 0.6, 0.7
>
>         Attachments: 0001-Change-parameters-to-BloomCalculations-in-order-to-c.patch, 0002-Add-timeouts-to-forceBlockingFlush-during-tests.patch
>
>
> The current BloomFilter implementation requires a BitSet of (bucket_count * num_keys) in size, and that calculation is currently performed in an integer, which causes overflow for around 140 million keys in one SSTable.
> Short term fix: perform the calculation in a long, and cap the value to the maximum size of a BitSet.
> Long term fix: begin partitioning BitSets, perhaps using Linear Bloom Filters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-790) SSTables limited to (2^31)/15 keys

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-790:
-------------------------------------

          Component/s: Core
    Affects Version/s:     (was: 0.7)
                           (was: 0.6)
                           (was: 0.5)

> SSTables limited to (2^31)/15 keys
> ----------------------------------
>
>                 Key: CASSANDRA-790
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-790
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>            Priority: Blocker
>             Fix For: 0.5
>
>         Attachments: 0001-Change-parameters-to-BloomCalculations-in-order-to-c.patch, 0001-Change-parameters-to-BloomCalculations-in-order-to-c.patch, 0002-Add-timeouts-to-forceBlockingFlush-during-tests.patch, 0002-Add-timeouts-to-forceBlockingFlush-during-tests.patch
>
>
> The current BloomFilter implementation requires a BitSet of (bucket_count * num_keys) in size, and that calculation is currently performed in an integer, which causes overflow for around 140 million keys in one SSTable.
> Short term fix: perform the calculation in a long, and cap the value to the maximum size of a BitSet.
> Long term fix: begin partitioning BitSets, perhaps using Linear Bloom Filters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-790) SSTables limited to (2^31)/15 keys

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-790:
-------------------------------

    Attachment: 0002-Add-timeouts-to-forceBlockingFlush-during-tests.patch
                0001-Change-parameters-to-BloomCalculations-in-order-to-c.patch

Updated to use factory functions rather than constructors. Also, for the factory function that takes a bucket/elem target, if the target can't be achieved, a warning will be logged, and the filter will degrade as gracefully as possible by using a minimum of 1 bucket/elem, and the maximum size filter.

> SSTables limited to (2^31)/15 keys
> ----------------------------------
>
>                 Key: CASSANDRA-790
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-790
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5, 0.6, 0.7
>            Reporter: Stu Hood
>            Priority: Blocker
>             Fix For: 0.5, 0.6, 0.7
>
>         Attachments: 0001-Change-parameters-to-BloomCalculations-in-order-to-c.patch, 0001-Change-parameters-to-BloomCalculations-in-order-to-c.patch, 0002-Add-timeouts-to-forceBlockingFlush-during-tests.patch, 0002-Add-timeouts-to-forceBlockingFlush-during-tests.patch
>
>
> The current BloomFilter implementation requires a BitSet of (bucket_count * num_keys) in size, and that calculation is currently performed in an integer, which causes overflow for around 140 million keys in one SSTable.
> Short term fix: perform the calculation in a long, and cap the value to the maximum size of a BitSet.
> Long term fix: begin partitioning BitSets, perhaps using Linear Bloom Filters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.