You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/05 10:50:12 UTC

[GitHub] [arrow-rs] alamb opened a new issue, #3023: Support bloom filter reading and writing for parquet

alamb opened a new issue, #3023:
URL: https://github.com/apache/arrow-rs/issues/3023

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   There are usecases where one wants to search a large amount of parquet data for a relatively small number of rows. For example, if you have distributed tracing data stored as parquet files and want to find the data for a particular trace.
   
   In general, the pattern is "needle in a haystack type query" -- specifically a very selective predicate (passes on only a few rows) on high cardinality (many distinct values) columns. 
   
   Even though the rust  parquet crate has fairly advanced support for [row group pruning](https://docs.rs/parquet/26.0.0/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_selection), [page level indexes](https://docs.rs/parquet/26.0.0/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_selection), and [filter pushdown](https://docs.rs/parquet/26.0.0/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_filter). These techniques are quite effective when data is sorted and large contiguous ranges of rows can be skipped. 
   
   However, doing needle in the haystack queries still often requires substantial amounts of CPU and IO 
   
   One challenge is that for typical high cardinality columns such as ids, they often (by design) span the entire range of values of the data type
   
   For example, given the best case when the data is "optimally sorted" by id within a row group,  min/max statistics can not help skip row groups or pages. Instead the entire column must be decoded to search for a particular value 
   
   ```
   ┌──────────────────────────┐                WHERE                 
   │            id            │       ┌─────── id = 54322342343      
   ├──────────────────────────┤       │                              
   │       00000000000        │       │                              
   ├──────────────────────────┤       │    Selective predicate on a  
   │       00054542543        │       │    high cardinality column   
   ├──────────────────────────┤       │                              
   │           ...            │       │                              
   ├──────────────────────────┤       │                              
   │        ??????????        │◀──────┘                              
   ├──────────────────────────┤          Can not rule out ranges     
   │           ...            │            using min/max values      
   ├──────────────────────────┤                                      
   │       99545435432        │                                      
   ├──────────────────────────┤                                      
   │       99999999999        │                                      
   └──────────────────────────┘                                      
                                                                     
     High cardinality column:                                        
       many distinct values                                          
             (sorted)                                                
                                                                     
   ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐                                           
      min: 00000000000                                               
   │  max: 99999999999   │                                           
                                                                     
   │       Metadata      │                                           
    ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─                                                                                     
   ```
   
   **Describe the solution you'd like**
   The parquet file format has support for bloom filters: https://github.com/apache/parquet-format/blob/master/BloomFilter.md
   
   A bloom filter is a space efficient structure that allows determining if a value is not in a set quickly. So for a parquet file with bloom filters for `id` in the metadata, 
   
   I would like the parquet crate to
   1.  support optionally writing Parquet bloom filters into the metadata 
   2. support using parquet bloom filters during read to make "needle in the haystack" type queries go quickly by skipping entire row groups if the item is not in the bloom filter. 
   
   
   The format support is here
   https://docs.rs/parquet/latest/parquet/format/struct.BloomFilterHeader.html?search=Bloom
   
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features you've considered.
   -->
   
   **Additional context**
   There is some code for parquet bloom filters in https://github.com/jorgecarleitao/parquet2/tree/main/src/bloom_filter from @jorgecarleitao
   
   Perhaps we can use/repurpose some of that 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold closed issue #3023: Support bloom filter reading and writing for parquet

Posted by GitBox <gi...@apache.org>.

tustvold closed issue #3023: Support bloom filter reading and writing for parquet
URL: https://github.com/apache/arrow-rs/issues/3023


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Jimexist commented on issue #3023: Support bloom filter reading and writing for parquet

Posted by GitBox <gi...@apache.org>.

Jimexist commented on issue #3023:
URL: https://github.com/apache/arrow-rs/issues/3023#issuecomment-1325983591

   @tustvold and @alamb i might not have the bandwidth to dig into how parquet integrates with arrow so i'd maybe defer this to you or anyone else to follow up in the final piece:
   - #3167 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #3023: Support bloom filter reading and writing for parquet

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #3023:
URL: https://github.com/apache/arrow-rs/issues/3023#issuecomment-1312817102

   (in case other people have missed it, @Jimexist  has begun work on this feature ❤️ )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #3023: Support bloom filter reading and writing for parquet

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #3023:
URL: https://github.com/apache/arrow-rs/issues/3023#issuecomment-1328233651

   I wonder how much of this feature is left to complete? I think the parquet reading/writing support may be done -- the next phase will be to add support to query engines like DataFusion to take advantage of these filters. 
   
   I plan to write up a ticket in DataFusion over the course of the coming week to do so


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #3023: Support bloom filter reading and writing for parquet

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #3023:
URL: https://github.com/apache/arrow-rs/issues/3023#issuecomment-1304496689

   The influxdb_iox project is very interested in this feature and we would love to collaborate with the community to make it happen -- I at least can offer code and design reviews, and blogging about it :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] aierui commented on issue #3023: Support bloom filter reading and writing for parquet

Posted by GitBox <gi...@apache.org>.

aierui commented on issue #3023:
URL: https://github.com/apache/arrow-rs/issues/3023#issuecomment-1304583861

   very cool❤️


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Jimexist commented on issue #3023: Support bloom filter reading and writing for parquet

Posted by GitBox <gi...@apache.org>.

Jimexist commented on issue #3023:
URL: https://github.com/apache/arrow-rs/issues/3023#issuecomment-1312723411

   a note to myself for [this comment][1]
   
   [1]: https://github.com/apache/arrow-rs/pull/3057#discussion_r1020882314
   
   cc @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org