You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/03 20:56:21 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request #500: Add support for boolean columns in pruning logic d10273a

alamb opened a new pull request #500:
URL: https://github.com/apache/arrow-datafusion/pull/500


   Closes https://github.com/apache/arrow-datafusion/issues/490
   
   This PR adds support for pruning of boolean predicates such as `flag_col`, and `not flag_col` so that they can be used to prune row groups from parquet files and other predicates
   
   It does *not* add code to handle `flag_col = true` and `flag_col != false` (which currently error and continue to do so) as those are simplified in the ConstantEvaluation pass. 
   
   This ended up being a larger change than I wanted because the logic to create `col_min` and `col_max` references was intertwined in `PruningExpressionBuilder`
   
    # Rationale for this change
   See https://github.com/apache/arrow-datafusion/issues/490
   
   
   # What changes are included in this PR?
   
   Major changes:
   1. Encapsulate `stat_column_req `into a new `RequiredStatColumns` struct
   2. Move expression reference and rewriting logic to `StatisticsColumns`
   3. Add rules for boolean columns
   
   # Are there any user-facing changes?
   Additional predicates can be used to prune
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #500: Add support for boolean columns in pruning logic

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #500:
URL: https://github.com/apache/arrow-datafusion/pull/500#discussion_r645121166



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -324,42 +417,20 @@ impl<'a> PruningExpressionBuilder<'a> {
         self.scalar_expr
     }
 
-    fn is_stat_column_missing(&self, statistics_type: StatisticsType) -> bool {

Review comment:
       This logic was just refactored into `RequiredStatColumns` so I could reuse it




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #500: Add support for boolean columns in pruning logic

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #500:
URL: https://github.com/apache/arrow-datafusion/pull/500#issuecomment-854184385


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#500](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (5ceb541) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/28b0dad82be302fea240bb6b177ff60abbd0f090?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (28b0dad) will **increase** coverage by `0.09%`.
   > The diff coverage is `93.75%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/500/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #500      +/-   ##
   ==========================================
   + Coverage   75.92%   76.02%   +0.09%     
   ==========================================
     Files         154      154              
     Lines       26195    26421     +226     
   ==========================================
   + Hits        19889    20087     +198     
   - Misses       6306     6334      +28     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/src/physical\_optimizer/pruning.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfb3B0aW1pemVyL3BydW5pbmcucnM=) | `91.52% <93.75%> (+1.44%)` | :arrow_up: |
   | [datafusion/src/optimizer/utils.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvb3B0aW1pemVyL3V0aWxzLnJz) | `48.22% <0.00%> (-1.78%)` | :arrow_down: |
   | [...ta/rust/core/src/serde/physical\_plan/from\_proto.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9waHlzaWNhbF9wbGFuL2Zyb21fcHJvdG8ucnM=) | `38.79% <0.00%> (-0.85%)` | :arrow_down: |
   | [...sta/rust/core/src/serde/logical\_plan/from\_proto.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9sb2dpY2FsX3BsYW4vZnJvbV9wcm90by5ycw==) | `35.96% <0.00%> (-0.22%)` | :arrow_down: |
   | [datafusion/src/logical\_plan/builder.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvbG9naWNhbF9wbGFuL2J1aWxkZXIucnM=) | `90.04% <0.00%> (-0.05%)` | :arrow_down: |
   | [datafusion/src/physical\_plan/planner.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9wbGFubmVyLnJz) | `80.32% <0.00%> (ø)` | |
   | [datafusion/src/optimizer/projection\_push\_down.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvb3B0aW1pemVyL3Byb2plY3Rpb25fcHVzaF9kb3duLnJz) | `98.46% <0.00%> (+<0.01%)` | :arrow_up: |
   | [datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvbG9naWNhbF9wbGFuL2V4cHIucnM=) | `84.60% <0.00%> (+0.07%)` | :arrow_up: |
   | [...lista/rust/core/src/serde/logical\_plan/to\_proto.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9sb2dpY2FsX3BsYW4vdG9fcHJvdG8ucnM=) | `62.48% <0.00%> (+0.15%)` | :arrow_up: |
   | [datafusion/src/sql/planner.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvc3FsL3BsYW5uZXIucnM=) | `84.37% <0.00%> (+0.26%)` | :arrow_up: |
   | ... and [2 more](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [28b0dad...5ceb541](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #500: Add support for boolean columns in pruning logic

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #500:
URL: https://github.com/apache/arrow-datafusion/pull/500#issuecomment-854184385


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#500](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (94d0062) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/28b0dad82be302fea240bb6b177ff60abbd0f090?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (28b0dad) will **increase** coverage by `0.06%`.
   > The diff coverage is `92.91%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/500/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #500      +/-   ##
   ==========================================
   + Coverage   75.92%   75.99%   +0.06%     
   ==========================================
     Files         154      154              
     Lines       26195    26266      +71     
   ==========================================
   + Hits        19889    19960      +71     
     Misses       6306     6306              
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/src/physical\_optimizer/pruning.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfb3B0aW1pemVyL3BydW5pbmcucnM=) | `91.27% <92.91%> (+1.18%)` | :arrow_up: |
   | [datafusion/src/optimizer/utils.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvb3B0aW1pemVyL3V0aWxzLnJz) | `50.74% <0.00%> (+0.74%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [28b0dad...94d0062](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #500: Add support for boolean columns in pruning logic

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #500:
URL: https://github.com/apache/arrow-datafusion/pull/500#discussion_r645121166



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -324,42 +417,20 @@ impl<'a> PruningExpressionBuilder<'a> {
         self.scalar_expr
     }
 
-    fn is_stat_column_missing(&self, statistics_type: StatisticsType) -> bool {

Review comment:
       This logic was just refactored into `RequiredStatColumns` so I could reuse it

##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -383,6 +454,46 @@ fn rewrite_column_expr(
     utils::rewrite_expression(&expr, &expressions)
 }
 
+/// Given a column reference to `column_name`, returns a pruning
+/// expression in terms of the min and max that will evaluate to true
+/// if the column may contain values, and false if definitely does not
+/// contain values
+fn build_single_column_expr(
+    column_name: &str,
+    schema: &Schema,
+    required_columns: &mut RequiredStatColumns,
+    is_not: bool, // if true, treat as !col
+) -> Option<Expr> {
+    use crate::logical_plan;
+    let field = schema.field_with_name(column_name).ok()?;
+
+    if matches!(field.data_type(), &DataType::Boolean) {

Review comment:
       Here is the actual logic / rules

##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -953,4 +1111,92 @@ mod tests {
 
         assert_eq!(result, expected);
     }
+
+    /// Creates setup for boolean chunk pruning
+    ///
+    /// For predicate "b1" (boolean expr)
+    /// b1 [false, false] ==> no rows can pass (not keep)
+    /// b1 [false, true] ==> some rows could pass (must keep)
+    /// b1 [true, true] ==> all rows must pass (must keep)
+    /// b1 [NULL, NULL]  ==> unknown (must keep)
+    /// b1 [false, NULL]  ==> unknown (must keep)
+    ///
+    /// For predicate "!b1" (boolean expr)
+    /// b1 [false, false] ==> all rows pass (must keep)
+    /// b1 [false, true] ==> some rows could pass (must keep)
+    /// b1 [true, true] ==> no rows can pass (not keep)
+    /// b1 [NULL, NULL]  ==> unknown (must keep)
+    /// b1 [false, NULL]  ==> unknown (must keep)
+    fn bool_setup() -> (SchemaRef, TestStatistics, Vec<bool>, Vec<bool>) {
+        let schema =
+            Arc::new(Schema::new(vec![Field::new("b1", DataType::Boolean, true)]));
+
+        let statistics = TestStatistics::new().with(
+            "b1",
+            ContainerStats::new_bool(
+                vec![Some(false), Some(false), Some(true), None, Some(false)], // min
+                vec![Some(false), Some(true), Some(true), None, None],         // max
+            ),
+        );
+        let expected_true = vec![false, true, true, true, true];
+        let expected_false = vec![true, true, false, true, true];
+
+        (schema, statistics, expected_true, expected_false)
+    }
+
+    #[test]
+    fn prune_bool_column() {
+        let (schema, statistics, expected_true, _) = bool_setup();
+
+        // b1
+        let expr = col("b1");
+        let p = PruningPredicate::try_new(&expr, schema).unwrap();
+        let result = p.prune(&statistics).unwrap();
+        assert_eq!(result, expected_true);
+    }
+
+    #[test]
+    fn prune_bool_not_column() {
+        let (schema, statistics, _, expected_false) = bool_setup();
+
+        // !b1
+        let expr = col("b1").not();
+        let p = PruningPredicate::try_new(&expr, schema).unwrap();
+        let result = p.prune(&statistics).unwrap();
+        assert_eq!(result, expected_false);
+    }
+
+    #[test]
+    fn prune_bool_column_eq_true() {
+        let (schema, statistics, _, _) = bool_setup();
+
+        // b1 = true
+        let expr = col("b1").eq(lit(true));
+        let p = PruningPredicate::try_new(&expr, schema).unwrap();
+        let result = p.prune(&statistics).unwrap_err();
+        assert!(
+            result.to_string().contains(
+                "Data type Boolean not supported for scalar operation on dyn array"

Review comment:
       these aren't great messages, but they are what happens on master today, and I figured I would document them for posterity (and maybe inspire people to help fix them)

##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -391,28 +502,50 @@ fn rewrite_column_expr(
 fn build_predicate_expression(
     expr: &Expr,
     schema: &Schema,
-    stat_column_req: &mut Vec<(String, StatisticsType, Field)>,
+    required_columns: &mut RequiredStatColumns,
 ) -> Result<Expr> {
     use crate::logical_plan;
+
+    // Returned for unsupported expressions. Such expressions are
+    // converted to TRUE. This can still be useful when multiple
+    // conditions are joined using AND such as: column > 10 AND TRUE
+    let unhandled = logical_plan::lit(true);
+
     // predicate expression can only be a binary expression
     let (left, op, right) = match expr {
         Expr::BinaryExpr { left, op, right } => (left, *op, right),
+        Expr::Column(name) => {
+            if let Some(expr) =
+                build_single_column_expr(&name, schema, required_columns, false)

Review comment:
       An excellent idea -- I will do so




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #500: Add support for boolean columns in pruning logic

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #500:
URL: https://github.com/apache/arrow-datafusion/pull/500#discussion_r645136877



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -391,28 +502,50 @@ fn rewrite_column_expr(
 fn build_predicate_expression(
     expr: &Expr,
     schema: &Schema,
-    stat_column_req: &mut Vec<(String, StatisticsType, Field)>,
+    required_columns: &mut RequiredStatColumns,
 ) -> Result<Expr> {
     use crate::logical_plan;
+
+    // Returned for unsupported expressions. Such expressions are
+    // converted to TRUE. This can still be useful when multiple
+    // conditions are joined using AND such as: column > 10 AND TRUE
+    let unhandled = logical_plan::lit(true);
+
     // predicate expression can only be a binary expression
     let (left, op, right) = match expr {
         Expr::BinaryExpr { left, op, right } => (left, *op, right),
+        Expr::Column(name) => {
+            if let Some(expr) =
+                build_single_column_expr(&name, schema, required_columns, false)

Review comment:
       An excellent idea -- I will do so




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #500: Add support for boolean columns in pruning logic

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #500:
URL: https://github.com/apache/arrow-datafusion/pull/500#issuecomment-854184385


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#500](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (5ceb541) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/28b0dad82be302fea240bb6b177ff60abbd0f090?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (28b0dad) will **increase** coverage by `0.09%`.
   > The diff coverage is `93.75%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/500/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #500      +/-   ##
   ==========================================
   + Coverage   75.92%   76.02%   +0.09%     
   ==========================================
     Files         154      154              
     Lines       26195    26421     +226     
   ==========================================
   + Hits        19889    20087     +198     
   - Misses       6306     6334      +28     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/src/physical\_optimizer/pruning.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfb3B0aW1pemVyL3BydW5pbmcucnM=) | `91.52% <93.75%> (+1.44%)` | :arrow_up: |
   | [datafusion/src/optimizer/utils.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvb3B0aW1pemVyL3V0aWxzLnJz) | `48.22% <0.00%> (-1.78%)` | :arrow_down: |
   | [...ta/rust/core/src/serde/physical\_plan/from\_proto.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9waHlzaWNhbF9wbGFuL2Zyb21fcHJvdG8ucnM=) | `38.79% <0.00%> (-0.85%)` | :arrow_down: |
   | [...sta/rust/core/src/serde/logical\_plan/from\_proto.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9sb2dpY2FsX3BsYW4vZnJvbV9wcm90by5ycw==) | `35.96% <0.00%> (-0.22%)` | :arrow_down: |
   | [datafusion/src/logical\_plan/builder.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvbG9naWNhbF9wbGFuL2J1aWxkZXIucnM=) | `90.04% <0.00%> (-0.05%)` | :arrow_down: |
   | [datafusion/src/physical\_plan/planner.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9wbGFubmVyLnJz) | `80.32% <0.00%> (ø)` | |
   | [datafusion/src/optimizer/projection\_push\_down.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvb3B0aW1pemVyL3Byb2plY3Rpb25fcHVzaF9kb3duLnJz) | `98.46% <0.00%> (+<0.01%)` | :arrow_up: |
   | [datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvbG9naWNhbF9wbGFuL2V4cHIucnM=) | `84.60% <0.00%> (+0.07%)` | :arrow_up: |
   | [...lista/rust/core/src/serde/logical\_plan/to\_proto.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9sb2dpY2FsX3BsYW4vdG9fcHJvdG8ucnM=) | `62.48% <0.00%> (+0.15%)` | :arrow_up: |
   | [datafusion/src/sql/planner.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvc3FsL3BsYW5uZXIucnM=) | `84.37% <0.00%> (+0.26%)` | :arrow_up: |
   | ... and [2 more](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [28b0dad...5ceb541](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #500: Add support for boolean columns in pruning logic

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #500:
URL: https://github.com/apache/arrow-datafusion/pull/500


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #500: Add support for boolean columns in pruning logic

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #500:
URL: https://github.com/apache/arrow-datafusion/pull/500#discussion_r645121254



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -383,6 +454,46 @@ fn rewrite_column_expr(
     utils::rewrite_expression(&expr, &expressions)
 }
 
+/// Given a column reference to `column_name`, returns a pruning
+/// expression in terms of the min and max that will evaluate to true
+/// if the column may contain values, and false if definitely does not
+/// contain values
+fn build_single_column_expr(
+    column_name: &str,
+    schema: &Schema,
+    required_columns: &mut RequiredStatColumns,
+    is_not: bool, // if true, treat as !col
+) -> Option<Expr> {
+    use crate::logical_plan;
+    let field = schema.field_with_name(column_name).ok()?;
+
+    if matches!(field.data_type(), &DataType::Boolean) {

Review comment:
       Here is the actual logic / rules

##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -953,4 +1111,92 @@ mod tests {
 
         assert_eq!(result, expected);
     }
+
+    /// Creates setup for boolean chunk pruning
+    ///
+    /// For predicate "b1" (boolean expr)
+    /// b1 [false, false] ==> no rows can pass (not keep)
+    /// b1 [false, true] ==> some rows could pass (must keep)
+    /// b1 [true, true] ==> all rows must pass (must keep)
+    /// b1 [NULL, NULL]  ==> unknown (must keep)
+    /// b1 [false, NULL]  ==> unknown (must keep)
+    ///
+    /// For predicate "!b1" (boolean expr)
+    /// b1 [false, false] ==> all rows pass (must keep)
+    /// b1 [false, true] ==> some rows could pass (must keep)
+    /// b1 [true, true] ==> no rows can pass (not keep)
+    /// b1 [NULL, NULL]  ==> unknown (must keep)
+    /// b1 [false, NULL]  ==> unknown (must keep)
+    fn bool_setup() -> (SchemaRef, TestStatistics, Vec<bool>, Vec<bool>) {
+        let schema =
+            Arc::new(Schema::new(vec![Field::new("b1", DataType::Boolean, true)]));
+
+        let statistics = TestStatistics::new().with(
+            "b1",
+            ContainerStats::new_bool(
+                vec![Some(false), Some(false), Some(true), None, Some(false)], // min
+                vec![Some(false), Some(true), Some(true), None, None],         // max
+            ),
+        );
+        let expected_true = vec![false, true, true, true, true];
+        let expected_false = vec![true, true, false, true, true];
+
+        (schema, statistics, expected_true, expected_false)
+    }
+
+    #[test]
+    fn prune_bool_column() {
+        let (schema, statistics, expected_true, _) = bool_setup();
+
+        // b1
+        let expr = col("b1");
+        let p = PruningPredicate::try_new(&expr, schema).unwrap();
+        let result = p.prune(&statistics).unwrap();
+        assert_eq!(result, expected_true);
+    }
+
+    #[test]
+    fn prune_bool_not_column() {
+        let (schema, statistics, _, expected_false) = bool_setup();
+
+        // !b1
+        let expr = col("b1").not();
+        let p = PruningPredicate::try_new(&expr, schema).unwrap();
+        let result = p.prune(&statistics).unwrap();
+        assert_eq!(result, expected_false);
+    }
+
+    #[test]
+    fn prune_bool_column_eq_true() {
+        let (schema, statistics, _, _) = bool_setup();
+
+        // b1 = true
+        let expr = col("b1").eq(lit(true));
+        let p = PruningPredicate::try_new(&expr, schema).unwrap();
+        let result = p.prune(&statistics).unwrap_err();
+        assert!(
+            result.to_string().contains(
+                "Data type Boolean not supported for scalar operation on dyn array"

Review comment:
       these aren't great messages, but they are what happens on master today, and I figured I would document them for posterity (and maybe inspire people to help fix them)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #500: Add support for boolean columns in pruning logic

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #500:
URL: https://github.com/apache/arrow-datafusion/pull/500#issuecomment-854871805


   I rebased this PR and added a few more tests. The code is unchanged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #500: Add support for boolean columns in pruning logic

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #500:
URL: https://github.com/apache/arrow-datafusion/pull/500#discussion_r645126489



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -391,28 +502,50 @@ fn rewrite_column_expr(
 fn build_predicate_expression(
     expr: &Expr,
     schema: &Schema,
-    stat_column_req: &mut Vec<(String, StatisticsType, Field)>,
+    required_columns: &mut RequiredStatColumns,
 ) -> Result<Expr> {
     use crate::logical_plan;
+
+    // Returned for unsupported expressions. Such expressions are
+    // converted to TRUE. This can still be useful when multiple
+    // conditions are joined using AND such as: column > 10 AND TRUE
+    let unhandled = logical_plan::lit(true);
+
     // predicate expression can only be a binary expression
     let (left, op, right) = match expr {
         Expr::BinaryExpr { left, op, right } => (left, *op, right),
+        Expr::Column(name) => {
+            if let Some(expr) =
+                build_single_column_expr(&name, schema, required_columns, false)

Review comment:
       This kind of pattern probably can be written a bit shorter with some combinators. Something like:
   
   `build_single_column_expr(&name, schema, required_columns).ok().or(Ok(unhandled))`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #500: Add support for boolean columns in pruning logic

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #500:
URL: https://github.com/apache/arrow-datafusion/pull/500#discussion_r645126489



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -391,28 +502,50 @@ fn rewrite_column_expr(
 fn build_predicate_expression(
     expr: &Expr,
     schema: &Schema,
-    stat_column_req: &mut Vec<(String, StatisticsType, Field)>,
+    required_columns: &mut RequiredStatColumns,
 ) -> Result<Expr> {
     use crate::logical_plan;
+
+    // Returned for unsupported expressions. Such expressions are
+    // converted to TRUE. This can still be useful when multiple
+    // conditions are joined using AND such as: column > 10 AND TRUE
+    let unhandled = logical_plan::lit(true);
+
     // predicate expression can only be a binary expression
     let (left, op, right) = match expr {
         Expr::BinaryExpr { left, op, right } => (left, *op, right),
+        Expr::Column(name) => {
+            if let Some(expr) =
+                build_single_column_expr(&name, schema, required_columns, false)

Review comment:
       This kind of pattern probably can be written a bit shorter with some combinators. Something like:
   
   `build_single_column_expr(&name, schema, required_columns).ok().or(Ok(unhandled))`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #500: Add support for boolean columns in pruning logic

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #500:
URL: https://github.com/apache/arrow-datafusion/pull/500#issuecomment-854184385


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#500](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (94d0062) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/28b0dad82be302fea240bb6b177ff60abbd0f090?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (28b0dad) will **increase** coverage by `0.06%`.
   > The diff coverage is `92.91%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/500/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #500      +/-   ##
   ==========================================
   + Coverage   75.92%   75.99%   +0.06%     
   ==========================================
     Files         154      154              
     Lines       26195    26266      +71     
   ==========================================
   + Hits        19889    19960      +71     
     Misses       6306     6306              
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/src/physical\_optimizer/pruning.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfb3B0aW1pemVyL3BydW5pbmcucnM=) | `91.27% <92.91%> (+1.18%)` | :arrow_up: |
   | [datafusion/src/optimizer/utils.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/500/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvb3B0aW1pemVyL3V0aWxzLnJz) | `50.74% <0.00%> (+0.74%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [28b0dad...94d0062](https://codecov.io/gh/apache/arrow-datafusion/pull/500?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org