You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/20 15:54:12 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request #370: Return Vec from PredicateBuilder rather than an `Fn`

alamb opened a new pull request #370:
URL: https://github.com/apache/arrow-datafusion/pull/370


   # Which issue does this PR close?
   
   re https://github.com/apache/arrow-datafusion/issues/363 (leaving as draft until #365 is in)
   
    # Rationale for this change
   As explained on #363 the high level idea goal is to make the parquet row group pruning logic generic to any types of min/max statistics (not just parquet metadata)
   
   
   # What changes are included in this PR?
   1. Changes the *output* of PruningPredicateBuilder to be a `bool` for each of the input statistics
   2. Moves the parquet specific functionality (aka the function signature required for the `ParquetFileReader`) into the parquet.rs module
   3. Returns errors from `build_pruning_predicate` rather than silently ignoring them (though they are still silently ignored in parquet.rs as before)
   4. Improves some docstrings
   
   
   # Are there any user-facing changes?
   No change in parquet functionality is intended in this PR
   
   
   
   # Sequence:
   
   My next PR will change the *input* of the `PruningPredicateBuilder` to be generic
   
   I am trying to do this in a few small PRs to reduce review burden; Here is how I plan that they will connect together:
   
   Planned changes:
   - [x] Refactor code into a new module (https://github.com/apache/arrow-datafusion/pull/365)
   - [x] Return bool rather than parquet specific output (this PR)
   - [ ] Add `PruningStatstics` Trait (forthcoming PR)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #370: Return Vec from PredicateBuilder rather than an `Fn`

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #370:
URL: https://github.com/apache/arrow-datafusion/pull/370#issuecomment-846179431


   (BTW @yordan-pavlov the more I work with this code the cooler I think it (and its algorithm) is. Thank you for contributing it in the first place)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #370: Return Vec from PredicateBuilder rather than an `Fn`

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #370:
URL: https://github.com/apache/arrow-datafusion/pull/370#discussion_r636236186



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -85,24 +90,24 @@ impl PruningPredicateBuilder {
         })
     }
 
-    /// Generate a predicate function used to filter based on
-    /// statistics
+    /// For each set of statistics, evalates the predicate in this
+    /// builder and returns a `bool` with the following meaning for a
+    /// container with those statistics:
+    ///
+    /// `true`: The container MAY contain rows that match the predicate
     ///
-    /// This function takes a slice of statistics as parameter, so
-    /// that DataFusion's physical expressions can be executed once
-    /// against a single RecordBatch, containing statistics arrays, on
-    /// which the physical predicate expression is executed to
-    /// generate a row group filter array.
+    /// `false`: The container definitely does NOT contain rows that match the predicate
     ///
-    /// The generated filter function is then used in the returned
-    /// closure to filter row groups. NOTE this is parquet specific at the moment
+    /// Note this function takes a slice of statistics as a parameter
+    /// to amortize the cost of the evaluation of the predicate
+    /// against a single record batch.
     pub fn build_pruning_predicate(
         &self,
-        row_group_metadata: &[RowGroupMetaData],
-    ) -> Box<dyn Fn(&RowGroupMetaData, usize) -> bool> {
+        statistics: &[RowGroupMetaData],

Review comment:
       The main change in this PR is this function's signature. The rest of the changes are fallout from doing that

##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -454,6 +457,22 @@ fn send_result(
     Ok(())
 }
 
+fn build_row_group_predicate(
+    predicate_builder: &PruningPredicateBuilder,
+    row_group_metadata: &[RowGroupMetaData],
+) -> Box<dyn Fn(&RowGroupMetaData, usize) -> bool> {
+    let predicate_values = predicate_builder.build_pruning_predicate(row_group_metadata);
+
+    let predicate_values = match predicate_values {
+        Ok(values) => values,
+        // stats filter array could not be built
+        // return a closure which will not filter out any row groups
+        _ => return Box::new(|_r, _i| true),

Review comment:
       The existing code also (silently) ignores error so we continue that tradition in this PR




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #370: Return Vec from PredicateBuilder rather than an `Fn`

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #370:
URL: https://github.com/apache/arrow-datafusion/pull/370


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yordan-pavlov commented on pull request #370: Return Vec from PredicateBuilder rather than an `Fn`

Posted by GitBox <gi...@apache.org>.
yordan-pavlov commented on pull request #370:
URL: https://github.com/apache/arrow-datafusion/pull/370#issuecomment-846143281


   @alamb thank you for making this more generic so that it can be useful in more cases; I do like how the error case (where a `Box::new(|_r, _i| true)`) is returned is now handled in a single place); overall looks like a good change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #370: Return Vec from PredicateBuilder rather than an `Fn`

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #370:
URL: https://github.com/apache/arrow-datafusion/pull/370#issuecomment-846142593


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/370?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#370](https://codecov.io/gh/apache/arrow-datafusion/pull/370?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (1db9363) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/db4f098d38993b96ce1134c4bc7bf5c6579509cf?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (db4f098) will **decrease** coverage by `0.00%`.
   > The diff coverage is `52.63%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/370/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/370?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #370      +/-   ##
   ==========================================
   - Coverage   74.94%   74.93%   -0.01%     
   ==========================================
     Files         146      146              
     Lines       24314    24318       +4     
   ==========================================
   + Hits        18221    18223       +2     
   - Misses       6093     6095       +2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/370?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/src/physical\_optimizer/pruning.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/370/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfb3B0aW1pemVyL3BydW5pbmcucnM=) | `89.73% <38.46%> (-0.88%)` | :arrow_down: |
   | [datafusion/src/physical\_plan/parquet.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/370/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9wYXJxdWV0LnJz) | `81.76% <83.33%> (+0.49%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/370?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/370?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [db4f098...1db9363](https://codecov.io/gh/apache/arrow-datafusion/pull/370?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #370: Return Vec from PredicateBuilder rather than an `Fn`

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #370:
URL: https://github.com/apache/arrow-datafusion/pull/370#issuecomment-846115376


   fyi @yordan-pavlov and @returnString 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on a change in pull request #370: Return Vec from PredicateBuilder rather than an `Fn`

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #370:
URL: https://github.com/apache/arrow-datafusion/pull/370#discussion_r637092118



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -85,24 +90,24 @@ impl PruningPredicateBuilder {
         })
     }
 
-    /// Generate a predicate function used to filter based on
-    /// statistics
+    /// For each set of statistics, evalates the predicate in this
+    /// builder and returns a `bool` with the following meaning for a
+    /// container with those statistics:
+    ///
+    /// `true`: The container MAY contain rows that match the predicate
     ///
-    /// This function takes a slice of statistics as parameter, so
-    /// that DataFusion's physical expressions can be executed once
-    /// against a single RecordBatch, containing statistics arrays, on
-    /// which the physical predicate expression is executed to
-    /// generate a row group filter array.
+    /// `false`: The container definitely does NOT contain rows that match the predicate

Review comment:
       ```suggestion
       /// `false`: The container MUST NOT contain rows that match the predicate
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yordan-pavlov edited a comment on pull request #370: Return Vec from PredicateBuilder rather than an `Fn`

Posted by GitBox <gi...@apache.org>.
yordan-pavlov edited a comment on pull request #370:
URL: https://github.com/apache/arrow-datafusion/pull/370#issuecomment-846143281


   @alamb thank you for making this more generic so that it can be useful in more cases; I do like how the error case (where a `Box::new(|_r, _i| true)` is returned) is now handled in a single place); overall looks like a good change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org