You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/15 15:06:15 UTC

[GitHub] [arrow-rs] Jimexist opened a new pull request, #3119: parquet bloom filter part III: add sbbf writer

Jimexist opened a new pull request, #3119:
URL: https://github.com/apache/arrow-rs/pull/3119

   # Which issue does this PR close?
   
   - previous PR: #3102
   - related issue: #3023 
   
   # Rationale for this change
    
   add sbbf filter during metadata writing phase
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Jimexist commented on a diff in pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties

Posted by GitBox <gi...@apache.org>.

Jimexist commented on code in PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#discussion_r1027043345


##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -113,7 +116,48 @@ fn read_bloom_filter_header_and_length(
     ))
 }
 
+const BITSET_MIN_LENGTH: usize = 32;
+const BITSET_MAX_LENGTH: usize = 128 * 1024 * 1024;
+
+#[inline]
+fn optimal_num_of_bytes(num_bytes: usize) -> usize {
+    let num_bytes = num_bytes.min(BITSET_MAX_LENGTH);
+    let num_bytes = num_bytes.max(BITSET_MIN_LENGTH);
+    num_bytes.next_power_of_two()
+}
+
+// see http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf
+// given fpp = (1 - e^(-k * n / m)) ^ k
+// we have m = - k * n / ln(1 - fpp ^ (1 / k))
+// where k = number of hash functions, m = number of bits, n = number of distinct values
+#[inline]
+fn num_of_bits_from_ndv_fpp(ndv: u64, fpp: f64) -> f64 {
+    -8.0 * ndv as f64 / (1.0 - fpp.powf(1.0 / 8.0)).ln()
+}
+
 impl Sbbf {
+    /// Create a new [Sbbf] with given number of distinct values and false positive probability.
+    /// Will panic if `fpp` is greater than 1.0 or less than 0.0.
+    pub fn new_with_ndv_fpp(ndv: u64, fpp: f64) -> Self {

Review Comment:
   will update once we decided on using fpp or ndv or both:
   - #3138 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#discussion_r1029226696


##########
parquet/src/file/properties.rs:
##########
@@ -515,27 +595,24 @@ impl Default for EnabledStatistics {
 ///
 /// If a field is `None`, it means that no specific value has been set for this column,
 /// so some subsequent or default value must be used.
-#[derive(Debug, Clone, PartialEq)]
+#[derive(Debug, Clone, Default, PartialEq)]
 struct ColumnProperties {
     encoding: Option<Encoding>,
     codec: Option<Compression>,
     dictionary_enabled: Option<bool>,
     statistics_enabled: Option<EnabledStatistics>,
     max_statistics_size: Option<usize>,
+    /// bloom filter enabled
+    bloom_filter_enabled: Option<bool>,
+    /// bloom filter expected number of distinct values
+    bloom_filter_ndv: Option<u64>,
+    /// bloom filter false positive probability
+    bloom_filter_fpp: Option<f64>,
+    /// bloom filter max number of bytes
+    bloom_filter_max_bytes: Option<u32>,

Review Comment:
   I wonder if it would be simpler to just ask users to specify the bloom filter size, and provide a free function to compute the size based on ndv and fpp?
   
   The interaction of these three properties isn't immediately apparent?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Jimexist commented on a diff in pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` feature, add reader properties

Posted by GitBox <gi...@apache.org>.

Jimexist commented on code in PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#discussion_r1024625516


##########
parquet/Cargo.toml:
##########
@@ -57,7 +57,8 @@ seq-macro = { version = "0.3", default-features = false }
 futures = { version = "0.3", default-features = false, features = ["std"], optional = true }
 tokio = { version = "1.0", optional = true, default-features = false, features = ["macros", "rt", "io-util"] }
 hashbrown = { version = "0.13", default-features = false }
-twox-hash = { version = "1.6", optional = true }
+twox-hash = { version = "1.6", default-features = false }

Review Comment:
   it by default relies on `rand` which breaks `wasm32`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Jimexist commented on a diff in pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties

Posted by GitBox <gi...@apache.org>.

Jimexist commented on code in PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#discussion_r1027038712


##########
parquet/src/file/properties.rs:
##########
@@ -599,17 +681,25 @@ impl ColumnProperties {
     fn max_statistics_size(&self) -> Option<usize> {
         self.max_statistics_size
     }
+
+    def_opt_field_getter!(bloom_filter_enabled, bool);
+    def_opt_field_getter!(bloom_filter_fpp, f64);
+    def_opt_field_getter!(bloom_filter_max_bytes, u32);
+    def_opt_field_getter!(bloom_filter_ndv, u64);
 }
 
 /// Reference counted reader properties.
 pub type ReaderPropertiesPtr = Arc<ReaderProperties>;
 
+const DEFAULT_READ_BLOOM_FILTER: bool = true;

Review Comment:
   changed to false by default



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#discussion_r1029228086


##########
parquet/src/file/properties.rs:
##########
@@ -622,11 +712,17 @@ impl ReaderProperties {
     pub(crate) fn codec_options(&self) -> &CodecOptions {
         &self.codec_options
     }
+
+    /// Returns whether to read bloom filter
+    pub(crate) fn read_bloom_filter(&self) -> bool {
+        self.read_bloom_filter
+    }
 }
 
 /// Reader properties builder.
 pub struct ReaderPropertiesBuilder {
     codec_options_builder: CodecOptionsBuilder,
+    read_bloom_filter: Option<bool>,

Review Comment:
   ```suggestion
       read_bloom_filter: bool,
   ```



##########
parquet/src/file/properties.rs:
##########
@@ -635,13 +731,17 @@ impl ReaderPropertiesBuilder {
     fn with_defaults() -> Self {
         Self {
             codec_options_builder: CodecOptionsBuilder::default(),
+            read_bloom_filter: None,

Review Comment:
   ```suggestion
               read_bloom_filter: DEFAULT_READ_BLOOM_FILTER,
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#discussion_r1029224288


##########
parquet/src/column/writer/mod.rs:
##########
@@ -231,6 +238,19 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> {
         // Used for level information
         encodings.insert(Encoding::RLE);
 
+        let bloom_filter_enabled = props.bloom_filter_enabled(descr.path());
+        let bloom_filter = if bloom_filter_enabled {
+            if let Some(ndv) = props.bloom_filter_ndv(descr.path()) {
+                let fpp = props.bloom_filter_fpp(descr.path());
+                Some(Sbbf::new_with_ndv_fpp(ndv, fpp))
+            } else {

Review Comment:
   I think it is perhaps a little surprising that `bloom_filter_max_bytes` is ignored if `ndv` is set



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on a diff in pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#discussion_r1026750561


##########
parquet/Cargo.toml:
##########
@@ -77,7 +78,7 @@ rand = { version = "0.8", default-features = false, features = ["std", "std_rng"
 all-features = true
 
 [features]
-default = ["arrow", "bloom", "snap", "brotli", "flate2", "lz4", "zstd", "base64"]

Review Comment:
   👍 



##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -113,7 +116,48 @@ fn read_bloom_filter_header_and_length(
     ))
 }
 
+const BITSET_MIN_LENGTH: usize = 32;
+const BITSET_MAX_LENGTH: usize = 128 * 1024 * 1024;
+
+#[inline]
+fn optimal_num_of_bytes(num_bytes: usize) -> usize {
+    let num_bytes = num_bytes.min(BITSET_MAX_LENGTH);
+    let num_bytes = num_bytes.max(BITSET_MIN_LENGTH);
+    num_bytes.next_power_of_two()
+}
+
+// see http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf
+// given fpp = (1 - e^(-k * n / m)) ^ k
+// we have m = - k * n / ln(1 - fpp ^ (1 / k))
+// where k = number of hash functions, m = number of bits, n = number of distinct values
+#[inline]
+fn num_of_bits_from_ndv_fpp(ndv: u64, fpp: f64) -> f64 {
+    -8.0 * ndv as f64 / (1.0 - fpp.powf(1.0 / 8.0)).ln()
+}
+
 impl Sbbf {
+    /// Create a new [Sbbf] with given number of distinct values and false positive probability.
+    /// Will panic if `fpp` is greater than 1.0 or less than 0.0.
+    pub fn new_with_ndv_fpp(ndv: u64, fpp: f64) -> Self {

Review Comment:
   Since this is a function meant for use outside the parquet crate I would prefer it return an error rather than panic with bad input. 



##########
parquet/src/file/properties.rs:
##########
@@ -82,6 +83,9 @@ const DEFAULT_STATISTICS_ENABLED: EnabledStatistics = EnabledStatistics::Page;
 const DEFAULT_MAX_STATISTICS_SIZE: usize = 4096;
 const DEFAULT_MAX_ROW_GROUP_SIZE: usize = 1024 * 1024;
 const DEFAULT_CREATED_BY: &str = env!("PARQUET_CREATED_BY");
+const DEFAULT_BLOOM_FILTER_ENABLED: bool = false;
+const DEFAULT_BLOOM_FILTER_MAX_BYTES: u32 = 1024 * 1024;

Review Comment:
   👍 



##########
parquet/src/file/metadata.rs:
##########
@@ -236,7 +236,7 @@ pub struct RowGroupMetaData {
 }
 
 impl RowGroupMetaData {
-    /// Returns builer for row group metadata.
+    /// Returns builder for row group metadata.

Review Comment:
   👍 



##########
parquet/src/file/properties.rs:
##########
@@ -599,17 +681,25 @@ impl ColumnProperties {
     fn max_statistics_size(&self) -> Option<usize> {
         self.max_statistics_size
     }
+
+    def_opt_field_getter!(bloom_filter_enabled, bool);
+    def_opt_field_getter!(bloom_filter_fpp, f64);
+    def_opt_field_getter!(bloom_filter_max_bytes, u32);
+    def_opt_field_getter!(bloom_filter_ndv, u64);
 }
 
 /// Reference counted reader properties.
 pub type ReaderPropertiesPtr = Arc<ReaderProperties>;
 
+const DEFAULT_READ_BLOOM_FILTER: bool = true;

Review Comment:
   I think this perhaps should default to `false` to maintain the current behavior and avoid (potentially) costly semi-random IO if the reader is not going to read the bloom filter



##########
parquet/src/file/writer.rs:
##########
@@ -212,6 +220,35 @@ impl<W: Write> SerializedFileWriter<W> {
         Ok(())
     }
 
+    /// Serialize all the bloom filter to the file
+    fn write_bloom_filters(&mut self, row_groups: &mut [RowGroup]) -> Result<()> {
+        // iter row group
+        // iter each column
+        // write bloom filter to the file
+        for (row_group_idx, row_group) in row_groups.iter_mut().enumerate() {
+            for (column_idx, column_chunk) in row_group.columns.iter_mut().enumerate() {
+                match &self.bloom_filters[row_group_idx][column_idx] {
+                    Some(bloom_filter) => {

Review Comment:
   Given the interactionwith TCompact protocol for reading is in `Sbff` what would you think about moving the interaction for writing there as well (like `Sbff::write(writer)`)?



##########
parquet/src/file/properties.rs:
##########
@@ -255,6 +279,11 @@ impl WriterProperties {
             .or_else(|| self.default_column_properties.max_statistics_size())
             .unwrap_or(DEFAULT_MAX_STATISTICS_SIZE)
     }
+
+    def_col_property_getter!(bloom_filter_enabled, bool, DEFAULT_BLOOM_FILTER_ENABLED);

Review Comment:
   I think these properties need docstrings -- I am happy to help write them. In particular, I think it should mention that the ndv and fpp are knobs that allow for control over bloom filter accuracy. Also it should mention if these limits are for each column chunk or the entire file (e.g. the ndv value is not the number of distinct values in the entire column)
   
   



##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -292,4 +363,46 @@ mod tests {
         assert_eq!(num_bytes, 32_i32);
         assert_eq!(20, SBBF_HEADER_SIZE_ESTIMATE);
     }
+
+    #[test]
+    fn test_optimal_num_of_bytes() {
+        for (input, expected) in &[
+            (0, 32),
+            (9, 32),
+            (31, 32),
+            (32, 32),
+            (33, 64),
+            (99, 128),
+            (1024, 1024),
+            (999_000_000, 128 * 1024 * 1024),
+        ] {
+            assert_eq!(*expected, optimal_num_of_bytes(*input));
+        }
+    }
+
+    #[test]
+    fn test_num_of_bits_from_ndv_fpp() {
+        for (fpp, ndv, num_bits) in &[
+            (0.1, 10, 57),
+            (0.01, 10, 96),
+            (0.001, 10, 146),
+            (0.1, 100, 577),
+            (0.01, 100, 968),
+            (0.001, 100, 1460),
+            (0.1, 1000, 5772),
+            (0.01, 1000, 9681),
+            (0.001, 1000, 14607),
+            (0.1, 10000, 57725),
+            (0.01, 10000, 96815),
+            (0.001, 10000, 146076),
+            (0.1, 100000, 577254),
+            (0.01, 100000, 968152),
+            (0.001, 100000, 1460769),
+            (0.1, 1000000, 5772541),
+            (0.01, 1000000, 9681526),
+            (0.001, 1000000, 14607697),

Review Comment:
   does this mean a 14MB bloom filter? Its ok as there is a limit in `optimal_num_of_bytes ` but when I saw this it was just 🤯 
   
   It might also be good to pass in some value larger than 2^32 to test there isn't an overflow problem lurking



##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -128,6 +172,33 @@ impl Sbbf {
         Self(data)
     }
 
+    /// Write the bitset in serialized form to the writer.
+    pub fn write_bitset<W: Write>(&self, mut writer: W) -> Result<(), ParquetError> {

Review Comment:
   I think it would be good to write a test for round tripping the bloom filters (as in write a SBFF to a Vec<u8> and then read it back out and verify it is the same). Specifically it would be nice to verify the bytes are not scrambled and the lengths are correct and handle empty bitsets (if that is possible)



##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -113,7 +116,48 @@ fn read_bloom_filter_header_and_length(
     ))
 }
 
+const BITSET_MIN_LENGTH: usize = 32;
+const BITSET_MAX_LENGTH: usize = 128 * 1024 * 1024;
+
+#[inline]
+fn optimal_num_of_bytes(num_bytes: usize) -> usize {
+    let num_bytes = num_bytes.min(BITSET_MAX_LENGTH);
+    let num_bytes = num_bytes.max(BITSET_MIN_LENGTH);
+    num_bytes.next_power_of_two()
+}
+
+// see http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf

Review Comment:
   Here is the parquet-mr code: https://github.com/apache/parquet-mr/blob/d057b39d93014fe40f5067ee4a33621e65c91552/parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java#L277-L304
   
   That looks very similar



##########
parquet/src/file/reader.rs:
##########
@@ -145,9 +144,8 @@ pub trait RowGroupReader: Send + Sync {
         Ok(col_reader)
     }
 
-    #[cfg(feature = "bloom")]
     /// Get bloom filter for the `i`th column chunk, if present.

Review Comment:
   ```suggestion
       /// Get bloom filter for the `i`th column chunk, if present and the reader was configured
       /// to read bloom filters.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] XinyuZeng commented on pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties

Posted by "XinyuZeng (via GitHub)" <gi...@apache.org>.

XinyuZeng commented on PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#issuecomment-1435577674

   > the arrow parquet C++ writer seems to allow for the fpp setting
   > 
   > https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow8adapters3orc12WriteOptions16bloom_filter_fppE
   
   nit: this reference is for arrow ORC C++ writer, parquet C++ does not enable write bloom filter yet.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] XinyuZeng commented on a diff in pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties

Posted by "XinyuZeng (via GitHub)" <gi...@apache.org>.

XinyuZeng commented on code in PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#discussion_r1110973057


##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -113,7 +116,48 @@ fn read_bloom_filter_header_and_length(
     ))
 }
 
+const BITSET_MIN_LENGTH: usize = 32;
+const BITSET_MAX_LENGTH: usize = 128 * 1024 * 1024;
+
+#[inline]
+fn optimal_num_of_bytes(num_bytes: usize) -> usize {
+    let num_bytes = num_bytes.min(BITSET_MAX_LENGTH);
+    let num_bytes = num_bytes.max(BITSET_MIN_LENGTH);
+    num_bytes.next_power_of_two()
+}
+
+// see http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf

Review Comment:
   @Jimexist Wondering why here we round num_bytes to next power of two? parquet-mr just rounds to (k * BITS_PER_BLOCK)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` feature, add reader properties

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#issuecomment-1319118852

   I believe @tustvold  is away for a few days. I plan to review this PR in more detail tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] ursabot commented on pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#issuecomment-1323569190

   Benchmark runs are scheduled for baseline = 004a151e8df711292062236f8a94c09b6e18ef47 and contender = e214ccccc702d0295fbf59258a6a817cd09ac4ea. e214ccccc702d0295fbf59258a6a817cd09ac4ea is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/b0129e68aa1b479eaaef3667f11aaeb2...e7092960bb6940b9ad278dfad0b80959/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/12078ae5b3f741e39cbc403e9e2f6ea1...ddb988c0d9ff4aca9859940b79077777/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/b74f243d4f1b4aec9c99157eb9521125...dbd83da27ff345e98db73972aed170d9/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b7cae1d80f8f457999a47b54dd83fffa...1ddf1e1873b04797a3f1d1c3fefe19c8/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Jimexist commented on pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties

Posted by GitBox <gi...@apache.org>.

Jimexist commented on PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#issuecomment-1320806707

   thanks @alamb for the detailed comment. i wish to merge as is and then for subsequent steps:
   
   1. add an integration or round trip tests or maybe both
   2. adjust overall configurable parameters and then add docstring after they are settled
   
   do you think this is a good idea given that we are not releasing a new version very soon?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold merged pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties

Posted by GitBox <gi...@apache.org>.

tustvold merged PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#discussion_r1029206419


##########
parquet/src/file/properties.rs:
##########
@@ -272,6 +301,52 @@ pub struct WriterPropertiesBuilder {
     sorting_columns: Option<Vec<SortingColumn>>,
 }
 
+macro_rules! def_opt_field_setter {
+    ($field: ident, $type: ty) => {
+        paste! {
+            pub fn [<set_ $field>](&mut self, value: $type) -> &mut Self {
+                self.$field = Some(value);
+                self
+            }
+        }
+    };
+    ($field: ident, $type: ty, $min_value:expr, $max_value:expr) => {
+        paste! {
+            pub fn [<set_ $field>](&mut self, value: $type) -> &mut Self {
+                if ($min_value..=$max_value).contains(&value) {

Review Comment:
   I would expect this to error or panic, not just ignore a value out of range?



##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -113,7 +118,43 @@ fn read_bloom_filter_header_and_length(
     ))
 }
 
+const BITSET_MIN_LENGTH: usize = 32;
+const BITSET_MAX_LENGTH: usize = 128 * 1024 * 1024;
+
+#[inline]
+fn optimal_num_of_bytes(num_bytes: usize) -> usize {
+    let num_bytes = num_bytes.min(BITSET_MAX_LENGTH);
+    let num_bytes = num_bytes.max(BITSET_MIN_LENGTH);
+    num_bytes.next_power_of_two()
+}
+
+// see http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf
+// given fpp = (1 - e^(-k * n / m)) ^ k
+// we have m = - k * n / ln(1 - fpp ^ (1 / k))
+// where k = number of hash functions, m = number of bits, n = number of distinct values
+#[inline]
+fn num_of_bits_from_ndv_fpp(ndv: u64, fpp: f64) -> usize {
+    let num_bits = -8.0 * ndv as f64 / (1.0 - fpp.powf(1.0 / 8.0)).ln();
+    num_bits as usize
+}
+
 impl Sbbf {
+    /// Create a new [Sbbf] with given number of distinct values and false positive probability.
+    /// Will panic if `fpp` is greater than 1.0 or less than 0.0.

Review Comment:
   ```suggestion
       /// Will panic if `fpp` is greater than or equal to 1.0 or less than 0.0.
   ```



##########
parquet/src/file/properties.rs:
##########
@@ -272,6 +301,52 @@ pub struct WriterPropertiesBuilder {
     sorting_columns: Option<Vec<SortingColumn>>,
 }
 
+macro_rules! def_opt_field_setter {

Review Comment:
   Just an observation that these macros are potentially more verbose than the alternative. Perhaps I'm old-fashioned but I'm not a massive fan of macros aside from where absolutely necessary, as they complicate debugging and legibility



##########
parquet/src/file/properties.rs:
##########
@@ -272,6 +301,52 @@ pub struct WriterPropertiesBuilder {
     sorting_columns: Option<Vec<SortingColumn>>,
 }
 
+macro_rules! def_opt_field_setter {
+    ($field: ident, $type: ty) => {
+        paste! {
+            pub fn [<set_ $field>](&mut self, value: $type) -> &mut Self {
+                self.$field = Some(value);
+                self
+            }
+        }
+    };
+    ($field: ident, $type: ty, $min_value:expr, $max_value:expr) => {

Review Comment:
   This variant it only used in one place, and so I wonder if it needs to be a macro



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Jimexist commented on pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties

Posted by GitBox <gi...@apache.org>.

Jimexist commented on PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#issuecomment-1323556723

   thanks @tustvold i plan to merge as is and then in subsequent pr adjust:
   1. knobs or params that users can tune
   2. usage of macro
   3. panic behavior
   
   because 2 relies on 1, and 3 relies on 2, so i'd like to clean up 3 of them together


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] XinyuZeng commented on a diff in pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties

Posted by "XinyuZeng (via GitHub)" <gi...@apache.org>.

XinyuZeng commented on code in PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#discussion_r1110973057


##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -113,7 +116,48 @@ fn read_bloom_filter_header_and_length(
     ))
 }
 
+const BITSET_MIN_LENGTH: usize = 32;
+const BITSET_MAX_LENGTH: usize = 128 * 1024 * 1024;
+
+#[inline]
+fn optimal_num_of_bytes(num_bytes: usize) -> usize {
+    let num_bytes = num_bytes.min(BITSET_MAX_LENGTH);
+    let num_bytes = num_bytes.max(BITSET_MIN_LENGTH);
+    num_bytes.next_power_of_two()
+}
+
+// see http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf

Review Comment:
   @Jimexist Wondering why here we round num_bytes to next power of two? parquet-mr just rounds to (k * BITS_PER_BLOCK)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on a diff in pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#discussion_r1029246855


##########
parquet/src/file/properties.rs:
##########
@@ -515,27 +595,24 @@ impl Default for EnabledStatistics {
 ///
 /// If a field is `None`, it means that no specific value has been set for this column,
 /// so some subsequent or default value must be used.
-#[derive(Debug, Clone, PartialEq)]
+#[derive(Debug, Clone, Default, PartialEq)]
 struct ColumnProperties {
     encoding: Option<Encoding>,
     codec: Option<Compression>,
     dictionary_enabled: Option<bool>,
     statistics_enabled: Option<EnabledStatistics>,
     max_statistics_size: Option<usize>,
+    /// bloom filter enabled
+    bloom_filter_enabled: Option<bool>,
+    /// bloom filter expected number of distinct values
+    bloom_filter_ndv: Option<u64>,
+    /// bloom filter false positive probability
+    bloom_filter_fpp: Option<f64>,
+    /// bloom filter max number of bytes
+    bloom_filter_max_bytes: Option<u32>,

Review Comment:
   That was the conclusion I may have come to as well -- see https://github.com/apache/arrow-rs/issues/3138#issuecomment-1322771730



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Jimexist commented on a diff in pull request #3119: parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties

Posted by GitBox <gi...@apache.org>.

Jimexist commented on code in PR #3119:
URL: https://github.com/apache/arrow-rs/pull/3119#discussion_r1027037998


##########
parquet/src/file/properties.rs:
##########
@@ -599,17 +681,25 @@ impl ColumnProperties {
     fn max_statistics_size(&self) -> Option<usize> {
         self.max_statistics_size
     }
+
+    def_opt_field_getter!(bloom_filter_enabled, bool);
+    def_opt_field_getter!(bloom_filter_fpp, f64);
+    def_opt_field_getter!(bloom_filter_max_bytes, u32);
+    def_opt_field_getter!(bloom_filter_ndv, u64);
 }
 
 /// Reference counted reader properties.
 pub type ReaderPropertiesPtr = Arc<ReaderProperties>;
 
+const DEFAULT_READ_BLOOM_FILTER: bool = true;

Review Comment:
   ```suggestion
   const DEFAULT_READ_BLOOM_FILTER: bool = false;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org