You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/13 13:25:26 UTC

[GitHub] [arrow-rs] Jimexist opened a new pull request, #3102: WIP Add bloom filter part II

Jimexist opened a new pull request, #3102:
URL: https://github.com/apache/arrow-rs/pull/3102

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   - related #3057
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Jimexist commented on pull request #3102: parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo

Posted by GitBox <gi...@apache.org>.
Jimexist commented on PR #3102:
URL: https://github.com/apache/arrow-rs/pull/3102#issuecomment-1314625409

   > Once we have read the file metadata we know the byte ranges of the column chunks, and page indexes, as well as the offsets of the bloom filter data for each column chunk. It should therefore be possible to get a fairly accurate overestimate of the length of each bloom filter, simply by process of elimination.
   
   Thanks for the suggestion. I wonder if that is future proof, e.g. if there are more data structure to be added later beside sbbf, page index, etc. would that be a problem? Thinking out loud... that this would just be ballooning the over-estimate and/or make the likelihood of needing to look at both locations before it can correctly locate which was the right one when parquet file was written.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Jimexist commented on a diff in pull request #3102: WIP Add bloom filter part II

Posted by GitBox <gi...@apache.org>.
Jimexist commented on code in PR #3102:
URL: https://github.com/apache/arrow-rs/pull/3102#discussion_r1020917157


##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -79,6 +81,40 @@ fn block_check(block: &Block, hash: u32) -> bool {
 /// A split block Bloom filter
 pub struct Sbbf(Vec<Block>);
 
+// this size should not be too large to not to hit short read too early (although unlikely)
+// but also not to small to ensure cache efficiency, this is essential a "guess" of the header
+// size
+const STEP_SIZE: usize = 32;
+
+/// given an initial offset, and a chunk reader, try to read out a bloom filter header by trying
+/// one or more iterations, returns both the header and the offset after it (for bitset).
+fn chunk_read_bloom_filter_header_and_offset<R: ChunkReader>(
+    offset: usize,
+    reader: Arc<R>,
+) -> Result<(BloomFilterHeader, usize), ParquetError> {
+    // because we do not know in advance what the TCompactInputProtocol will read, we have to
+    // loop read until we can parse the header. Allocate at least 128 bytes to start with
+    let mut buffer = BytesMut::with_capacity(128);
+    let mut start = offset;
+    loop {
+        let batch = reader.get_bytes(offset as u64, STEP_SIZE)?;
+        buffer.put(batch);
+        // need to clone as we read from the very beginning of the buffer each time
+        let buffer = buffer.clone();
+        let mut buf_reader = buffer.reader();
+        // try to deserialize header
+        let mut prot = TCompactInputProtocol::new(&mut buf_reader);
+        if let Ok(h) = BloomFilterHeader::read_from_in_protocol(&mut prot) {
+            let buffer = buf_reader.into_inner();
+            let bitset_offset = start + STEP_SIZE - buffer.remaining();
+            return Ok((h, bitset_offset));
+        } else {
+            // continue to try by reading another batch
+            start += STEP_SIZE;
+        }
+    }
+}

Review Comment:
   @alamb et. al. - sanity check before i move on to implement more API and test cases



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3102: parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #3102:
URL: https://github.com/apache/arrow-rs/pull/3102#discussion_r1022016903


##########
parquet/src/bin/parquet-show-bloom-filter.rs:
##########
@@ -0,0 +1,112 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Binary file to read bloom filter data from a Parquet file.
+//!
+//! # Install
+//!
+//! `parquet-show-bloom-filter` can be installed using `cargo`:
+//! ```
+//! cargo install parquet --features=cli
+//! ```
+//! After this `parquet-show-bloom-filter` should be available:
+//! ```
+//! parquet-show-bloom-filter --file-name XYZ.parquet --column id --values a
+//! ```
+//!
+//! The binary can also be built from the source code and run as follows:
+//! ```
+//! cargo run --features=cli --bin parquet-show-bloom-filter -- --file-name XYZ.parquet --column id --values a
+//! ```
+
+extern crate parquet;

Review Comment:
   ```suggestion
   ```



##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -79,6 +81,33 @@ fn block_check(block: &Block, hash: u32) -> bool {
 /// A split block Bloom filter
 pub struct Sbbf(Vec<Block>);
 
+// this size should not be too large to not to hit short read too early (although unlikely)
+// but also not to small to ensure cache efficiency, this is essential a "guess" of the header
+// size. In the demo test the size is 15 bytes.
+const STEP_SIZE: usize = 16;

Review Comment:
   Based on the thrift compact protocol encoding - https://raw.githubusercontent.com/apache/thrift/master/doc/specs/thrift-compact-protocol.md.
   
   The BloomFilterHeader consists of 1 int32 and 3 enums. Enumerations are encoded as int32. Each int32 is encoded as 1 to 5 bytes.
   
   Therefore the maximum header size is 20 bytes. One possibility might be to read 20 bytes, and bail on error?
   
   



##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -79,6 +81,33 @@ fn block_check(block: &Block, hash: u32) -> bool {
 /// A split block Bloom filter
 pub struct Sbbf(Vec<Block>);
 
+// this size should not be too large to not to hit short read too early (although unlikely)
+// but also not to small to ensure cache efficiency, this is essential a "guess" of the header
+// size. In the demo test the size is 15 bytes.
+const STEP_SIZE: usize = 16;
+
+/// given an initial offset, and a chunk reader, try to read out a bloom filter header by trying
+/// one or more iterations, returns both the header and the offset after it (for bitset).
+fn chunk_read_bloom_filter_header_and_offset<R: ChunkReader>(
+    offset: usize,
+    reader: Arc<R>,
+) -> Result<(BloomFilterHeader, usize), ParquetError> {
+    let mut length = STEP_SIZE;
+    loop {
+        let buffer = reader.get_bytes(offset as u64, length)?;
+        let mut buf_reader = buffer.reader();
+        let mut prot = TCompactInputProtocol::new(&mut buf_reader);
+        if let Ok(h) = BloomFilterHeader::read_from_in_protocol(&mut prot) {
+            let buffer = buf_reader.into_inner();
+            let bitset_offset = offset + length - buffer.remaining();
+            return Ok((h, bitset_offset));
+        } else {
+            // continue to try by reading another batch

Review Comment:
   Corrupt data I think will cause this to potentially iterate indefinitely, which would be bad...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #3102: parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3102:
URL: https://github.com/apache/arrow-rs/pull/3102#discussion_r1023308450


##########
parquet/src/bin/parquet-show-bloom-filter.rs:
##########
@@ -0,0 +1,110 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Binary file to read bloom filter data from a Parquet file.
+//!
+//! # Install
+//!
+//! `parquet-show-bloom-filter` can be installed using `cargo`:
+//! ```
+//! cargo install parquet --features=cli
+//! ```
+//! After this `parquet-show-bloom-filter` should be available:
+//! ```
+//! parquet-show-bloom-filter --file-name XYZ.parquet --column id --values a
+//! ```
+//!
+//! The binary can also be built from the source code and run as follows:
+//! ```
+//! cargo run --features=cli --bin parquet-show-bloom-filter -- --file-name XYZ.parquet --column id --values a
+//! ```
+
+use clap::Parser;

Review Comment:
   Maybe we could add the ability to dump bloom filters to  to https://github.com/apache/arrow-rs/blob/master/parquet/src/bin/parquet-schema.rs  rather than make a new executable
   
   I don't feel strongly however



##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -125,11 +162,8 @@ impl Sbbf {
         let length: usize = header.num_bytes.try_into().map_err(|_| {

Review Comment:
   it is unfortunate that we need to do more than one read to potentially read a bloom filter (read the bloom header and then read the length). I think @tustvold noted this as a limitation in the parquet file format itself (that the file metadata only has the bloom filter starting offset, but not its length)
   
   Perhaps the reader abstraction can hide most/all of this nonsense from us



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #3102: parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #3102:
URL: https://github.com/apache/arrow-rs/pull/3102#issuecomment-1314627789

   I think it unlikely that a subsequent addition would omit sufficient information to properly identify its location in the file, but as you say, the result would simply be overestimating the length of the final bloom filter which is relatively harmless


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3102: parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #3102:
URL: https://github.com/apache/arrow-rs/pull/3102#discussion_r1023393798


##########
parquet/src/file/reader.rs:
##########
@@ -143,6 +145,10 @@ pub trait RowGroupReader: Send + Sync {
         Ok(col_reader)
     }
 
+    #[cfg(feature = "bloom")]

Review Comment:
   I'm suggesting rather than providing a lazy API to read the bloom filter on demand, provide an API to make SerializedReader load blook filters as part of ParquetMetadata if the corresponding feature and ReadOption is enabled.
   
   This is necessary to be able to support object stores, and is generally a good idea to avoid lots of small IO reads.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3102: parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #3102:
URL: https://github.com/apache/arrow-rs/pull/3102#discussion_r1023393798


##########
parquet/src/file/reader.rs:
##########
@@ -143,6 +145,10 @@ pub trait RowGroupReader: Send + Sync {
         Ok(col_reader)
     }
 
+    #[cfg(feature = "bloom")]

Review Comment:
   I'm suggesting rather than providing a lazy API to read the bloom filter on demand, provide an API to make SerializedReader load blook filters as part of ParquetMetadata if the corresponding feature and ReadOption is enabled. Similar to how we handle the page index.
   
   This is necessary to be able to support object stores, and is generally a good idea to avoid lots of small IO reads.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #3102: parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #3102:
URL: https://github.com/apache/arrow-rs/pull/3102


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Jimexist commented on a diff in pull request #3102: parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo

Posted by GitBox <gi...@apache.org>.
Jimexist commented on code in PR #3102:
URL: https://github.com/apache/arrow-rs/pull/3102#discussion_r1023385471


##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -125,11 +162,8 @@ impl Sbbf {
         let length: usize = header.num_bytes.try_into().map_err(|_| {

Review Comment:
   you can take a look at https://github.com/apache/arrow-rs/pull/3119/files#diff-3b307348aabe465890fa39973e9fda0243bd2344cb7cb9cdf02ac2d39521d7caR232-R236 which should show how it works - similar to writing column offsets and indices



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #3102: parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3102:
URL: https://github.com/apache/arrow-rs/pull/3102#issuecomment-1317465724

   🚄  love to see this -- thank you @tustvold  and @Jimexist 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3102: parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #3102:
URL: https://github.com/apache/arrow-rs/pull/3102#discussion_r1023208106


##########
parquet/src/file/reader.rs:
##########
@@ -143,6 +145,10 @@ pub trait RowGroupReader: Send + Sync {
         Ok(col_reader)
     }
 
+    #[cfg(feature = "bloom")]

Review Comment:
   What do you think about handling this in the same eager way that we handle page indexes, namely add an option to `ReadOptions` to enable reading bloom filters, and read this data in `SerializedFileReader`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] ursabot commented on pull request #3102: parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #3102:
URL: https://github.com/apache/arrow-rs/pull/3102#issuecomment-1316245597

   Benchmark runs are scheduled for baseline = c95eb4c80a532653bc91e04e78814f1282c8d005 and contender = 73d66d837c20e3b80a77fdad5018f7872de4ef9d. 73d66d837c20e3b80a77fdad5018f7872de4ef9d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/b388fb21598241b4adff106881f89e63...57443d83ca584d3f90a5ad490821b0bf/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/df31b31f0e964335bf5c994edec4829a...fa12d6b264964f9e9eb122912fb8f7af/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/1fccdd2d6f7443bc963190b8dcbce078...aefa71b9584c43f2b530ea85ebfcbf70/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/ad83a5afe04144fd8d8ff22db8eed7c9...513e29fe71b345baa6f56151edbeff8b/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Jimexist commented on a diff in pull request #3102: parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo

Posted by GitBox <gi...@apache.org>.
Jimexist commented on code in PR #3102:
URL: https://github.com/apache/arrow-rs/pull/3102#discussion_r1023386158


##########
parquet/src/bin/parquet-show-bloom-filter.rs:
##########
@@ -0,0 +1,110 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Binary file to read bloom filter data from a Parquet file.
+//!
+//! # Install
+//!
+//! `parquet-show-bloom-filter` can be installed using `cargo`:
+//! ```
+//! cargo install parquet --features=cli
+//! ```
+//! After this `parquet-show-bloom-filter` should be available:
+//! ```
+//! parquet-show-bloom-filter --file-name XYZ.parquet --column id --values a
+//! ```
+//!
+//! The binary can also be built from the source code and run as follows:
+//! ```
+//! cargo run --features=cli --bin parquet-show-bloom-filter -- --file-name XYZ.parquet --column id --values a
+//! ```
+
+use clap::Parser;

Review Comment:
   I am coding this similar to https://github.com/apache/parquet-mr/blob/master/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowBloomFilterCommand.java so I'd like to keep it this way for now. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Jimexist commented on a diff in pull request #3102: parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo

Posted by GitBox <gi...@apache.org>.
Jimexist commented on code in PR #3102:
URL: https://github.com/apache/arrow-rs/pull/3102#discussion_r1022220568


##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -79,6 +81,33 @@ fn block_check(block: &Block, hash: u32) -> bool {
 /// A split block Bloom filter
 pub struct Sbbf(Vec<Block>);
 
+// this size should not be too large to not to hit short read too early (although unlikely)
+// but also not to small to ensure cache efficiency, this is essential a "guess" of the header
+// size. In the demo test the size is 15 bytes.
+const STEP_SIZE: usize = 16;

Review Comment:
   that's a good idea, let me change it to 20 byte read and bail, before switching to more complicated "guessing" from the page indices gaps



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #3102: parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3102:
URL: https://github.com/apache/arrow-rs/pull/3102#issuecomment-1313764382

   Thank you @Jimexist  -- I have this PR on my review list and hopefully will get to it today. The backlog in DataFusion is substantial, however, so it might not be until tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Jimexist commented on a diff in pull request #3102: parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo

Posted by GitBox <gi...@apache.org>.
Jimexist commented on code in PR #3102:
URL: https://github.com/apache/arrow-rs/pull/3102#discussion_r1023389113


##########
parquet/src/file/reader.rs:
##########
@@ -143,6 +145,10 @@ pub trait RowGroupReader: Send + Sync {
         Ok(col_reader)
     }
 
+    #[cfg(feature = "bloom")]

Review Comment:
   @tustvold are you suggesting dropping the feature gate altogether and enable it by default? I added the feature gate trying to reduce binary size but then if the feature is very likely to be used there's no need for this gate any more.



##########
parquet/src/file/reader.rs:
##########
@@ -143,6 +145,10 @@ pub trait RowGroupReader: Send + Sync {
         Ok(col_reader)
     }
 
+    #[cfg(feature = "bloom")]

Review Comment:
   @tustvold are you also suggesting dropping the feature gate altogether and enable it by default? I added the feature gate trying to reduce binary size but then if the feature is very likely to be used there's no need for this gate any more.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Jimexist commented on a diff in pull request #3102: parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo

Posted by GitBox <gi...@apache.org>.
Jimexist commented on code in PR #3102:
URL: https://github.com/apache/arrow-rs/pull/3102#discussion_r1023393988


##########
parquet/src/file/reader.rs:
##########
@@ -143,6 +145,10 @@ pub trait RowGroupReader: Send + Sync {
         Ok(col_reader)
     }
 
+    #[cfg(feature = "bloom")]

Review Comment:
   also thanks for the suggestion, I agree with this direction, however I have:
   - #3119 
   coming up, so I'd like to maybe merge this as is and quickly follow up on this after it is merged. do you think that works?



##########
parquet/src/file/reader.rs:
##########
@@ -143,6 +145,10 @@ pub trait RowGroupReader: Send + Sync {
         Ok(col_reader)
     }
 
+    #[cfg(feature = "bloom")]

Review Comment:
   also thanks for the suggestion, I agree with this direction, however I have:
   - #3119 
   
   coming up, so I'd like to maybe merge this as is and quickly follow up on this after it is merged. do you think that works?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org