You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/22 11:38:20 UTC

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties

tustvold commented on code in PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#discussion_r1029206419


##########
parquet/src/file/properties.rs:
##########
@@ -272,6 +301,52 @@ pub struct WriterPropertiesBuilder {
     sorting_columns: Option<Vec<SortingColumn>>,
 }
 
+macro_rules! def_opt_field_setter {
+    ($field: ident, $type: ty) => {
+        paste! {
+            pub fn [<set_ $field>](&mut self, value: $type) -> &mut Self {
+                self.$field = Some(value);
+                self
+            }
+        }
+    };
+    ($field: ident, $type: ty, $min_value:expr, $max_value:expr) => {
+        paste! {
+            pub fn [<set_ $field>](&mut self, value: $type) -> &mut Self {
+                if ($min_value..=$max_value).contains(&value) {

Review Comment:
   I would expect this to error or panic, not just ignore a value out of range?



##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -113,7 +118,43 @@ fn read_bloom_filter_header_and_length(
     ))
 }
 
+const BITSET_MIN_LENGTH: usize = 32;
+const BITSET_MAX_LENGTH: usize = 128 * 1024 * 1024;
+
+#[inline]
+fn optimal_num_of_bytes(num_bytes: usize) -> usize {
+    let num_bytes = num_bytes.min(BITSET_MAX_LENGTH);
+    let num_bytes = num_bytes.max(BITSET_MIN_LENGTH);
+    num_bytes.next_power_of_two()
+}
+
+// see http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf
+// given fpp = (1 - e^(-k * n / m)) ^ k
+// we have m = - k * n / ln(1 - fpp ^ (1 / k))
+// where k = number of hash functions, m = number of bits, n = number of distinct values
+#[inline]
+fn num_of_bits_from_ndv_fpp(ndv: u64, fpp: f64) -> usize {
+    let num_bits = -8.0 * ndv as f64 / (1.0 - fpp.powf(1.0 / 8.0)).ln();
+    num_bits as usize
+}
+
 impl Sbbf {
+    /// Create a new [Sbbf] with given number of distinct values and false positive probability.
+    /// Will panic if `fpp` is greater than 1.0 or less than 0.0.

Review Comment:
   ```suggestion
       /// Will panic if `fpp` is greater than or equal to 1.0 or less than 0.0.
   ```



##########
parquet/src/file/properties.rs:
##########
@@ -272,6 +301,52 @@ pub struct WriterPropertiesBuilder {
     sorting_columns: Option<Vec<SortingColumn>>,
 }
 
+macro_rules! def_opt_field_setter {

Review Comment:
   Just an observation that these macros are potentially more verbose than the alternative. Perhaps I'm old-fashioned but I'm not a massive fan of macros aside from where absolutely necessary, as they complicate debugging and legibility



##########
parquet/src/file/properties.rs:
##########
@@ -272,6 +301,52 @@ pub struct WriterPropertiesBuilder {
     sorting_columns: Option<Vec<SortingColumn>>,
 }
 
+macro_rules! def_opt_field_setter {
+    ($field: ident, $type: ty) => {
+        paste! {
+            pub fn [<set_ $field>](&mut self, value: $type) -> &mut Self {
+                self.$field = Some(value);
+                self
+            }
+        }
+    };
+    ($field: ident, $type: ty, $min_value:expr, $max_value:expr) => {

Review Comment:
   This variant it only used in one place, and so I wonder if it needs to be a macro



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org