You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/19 17:44:06 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue #363: Reusable "row group pruning" logic

alamb opened a new issue #363:
URL: https://github.com/apache/arrow-datafusion/issues/363


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   DataFusion contains logic (originally contributed by @yordan-pavlov in https://github.com/apache/arrow/pull/9064 🎉 ) to perform Row Group Pruning, which skips scanning of entire row groups within a parquet file, based on pushed down predicates (source link in arrow-datafusion: [parquet.rs](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/parquet.rs)).
   
   The algorithm behind the Row Group Pruning implementation is general and can be applied to any storage system that maintains min/max statistics for different sets of files / chunks of the data and would like to quickly rule out chunks which can not match a predicate.
   
   We would like to reuse the row group pruning logic from DataFusion (rather than writing our own) because we want to make this logic easier to reuse by both other parts of DataFusion (e.g. pruning parquet *files* rather than just row groups) as well as downstream projects. We also hope to receive benefit ourselves as the community can work to improve this code
   
   In addition, there  other usecases, such as the one mentioned by @returnString, where you have a bunch of parquet files in some object store and statistics about the min/max values and you could skip entire files based on those statistics alone.  
   
   **Describe the solution you'd like**
   1. Refactor what is currently called `RowGroupPredicateBuilder` into something more generic related to `Pruning`
   2. Rework the implementation so it is  generic for a Statistics trait so that the predicates can be evaluated against any type (not just the Parquet `RowGroupMetadata`)
   
   **Additional context**
   
   You can see more about the usecase on the IOx ticket https://github.com/influxdata/influxdb_iox/issues/736 and [design document](https://docs.google.com/document/d/1ulK-jHxYEMTDQT77u0GzCMGRFC5MYQcegztkcnPHYnM/edit#)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #363: Reusable "row group pruning" logic

Posted by GitBox <gi...@apache.org>.
alamb closed issue #363:
URL: https://github.com/apache/arrow-datafusion/issues/363


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org