You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Benedict (JIRA)" <ji...@apache.org> on 2016/02/02 13:22:39 UTC
[jira] [Commented] (CASSANDRA-9167) Improve bloom-filter false-positive-ratio

    [ https://issues.apache.org/jira/browse/CASSANDRA-9167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128161#comment-15128161 ] 

Benedict commented on CASSANDRA-9167:
-------------------------------------

[~snazy]: sorry for completely letting this fester.  It looks good to me, the only question is if we _want_ to trade CPU and memory-indirection costs.  In principle we do, since this is to prevent disk accesses.  However for many of our benchmark in-memory workloads this is probably only increasing our costs.  However the cost increase is likely minimal compared to our other costs, so on balance with the present state of affairs I'd say this is a pretty darn positive change.

That said, I think it would be neater if we could permit a degree of configurability, where we weigh the marginal cost increase versus the marginal false positive gain for any added hash function, and provide users the option of configuring the tradeoff between the two.

That also said, it's perhaps overthinking things for now, and this positive if unglamorous change should have been included a long time ago IMO.

Assuming it goes in roughly as-is, a few nits/suggestions:

# Add a 'd' suffix to the literals
# Re-calculate the buckets per-element after rounding the hash count, instead of adding a fudge-factor (after all, half the time that fudge factor is wasting space, the other half it may be insufficient)?

> Improve bloom-filter false-positive-ratio
> -----------------------------------------
>
>                 Key: CASSANDRA-9167
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9167
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Robert Stupp
>            Assignee: Robert Stupp
>            Priority: Minor
>              Labels: perfomance
>
> {{org.apache.cassandra.utils.BloomCalculations}} performs some table lookups to calculate the bloom filter specification (size, # of hashes). Using the exact maths for that computation brings a better false-positive-ratio (the maths usually returns higher numbers for hash-counts).
> TL;DR increasing the number of hash-rounds brings a nice improvement. Finally it's a trade-off between CPU and I/O.
> ||false-positive-chance||elements||capacity||hash count new||false-positive-ratio new||hash count current||false-positive-ratio current||improvement
> |0.1|10000|50048|3|0.0848|3|0.0848|0
> |0.1|100000|500032|3|0.09203|3|0.09203|0
> |0.1|1000000|5000064|3|0.0919|3|0.0919|0
> |0.1|10000000|50000064|3|0.09182|3|0.09182|0
> |0.1|100000000|500000064|3|0.091874|3|0.091874|0
> |0.01|10000|100032|7|0.0092|5|0.0107|0.1630434783
> |0.01|100000|1000064|7|0.00818|5|0.00931|0.1381418093
> |0.01|1000000|10000064|7|0.008072|5|0.009405|0.1651387512
> |0.01|10000000|100000064|7|0.008174|5|0.009375|0.146929288
> |0.01|100000000|1000000064|7|0.008197|5|0.009428|0.150176894
> |0.001|10000|150080|10|0.0008|7|0.001|0.25
> |0.001|100000|1500032|10|0.0006|7|0.00094|0.5666666667
> |0.001|1000000|15000064|10|0.000717|7|0.000991|0.3821478382
> |0.001|10000000|150000064|10|0.000743|7|0.000992|0.33512786
> |0.001|100000000|1500000064|10|0.000741|7|0.001002|0.3522267206
> |0.0001|10000|200064|13|0|10|0.0002|#DIV/0!
> |0.0001|100000|2000064|13|0.00004|10|0.0001|1.5
> |0.0001|1000000|20000064|13|0.000075|10|0.000091|0.2133333333
> |0.0001|10000000|200000064|13|0.000069|10|0.000087|0.2608695652
> |0.0001|100000000|2000000064|13|0.000068|10|0.00009|0.3235294118
> If we decide to allow more hash-rounds, it could be nicely back-ported even to 2.0 without affecting existing sstables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)