You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/03 11:00:07 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue #490: Support pruning for `boolean` columns

alamb opened a new issue #490:
URL: https://github.com/apache/arrow-datafusion/issues/490


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   When attempting to prune containers such as parquet row groups based on boolean columns (e.g. a flag column), the pruning logic does not work. 
   
   So for example, with a query like
   ```sql
   select * from my_parquet_based_table where my_flag_column = true
   ```
   Will not prune any row groups based on the `my_flag_column` predicate. 
   
   **Describe the solution you'd like**
   I would like pruning to occur for boolean columns. Aka add support here: https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_optimizer/pruning.rs
   
   Here is an example test that fails:
   ```diff
   diff --git a/datafusion/src/physical_optimizer/pruning.rs b/datafusion/src/physical_optimizer/pruning.rs
   index 3a5a64c6f..d2e93b9b5 100644
   --- a/datafusion/src/physical_optimizer/pruning.rs
   +++ b/datafusion/src/physical_optimizer/pruning.rs
   @@ -508,6 +508,16 @@ mod tests {
                }
            }
    
   +        fn new_bool<'a>(
   +            min: impl IntoIterator<Item = Option<bool>>,
   +            max: impl IntoIterator<Item = Option<bool>>,
   +        ) -> Self {
   +            Self {
   +                min: Arc::new(min.into_iter().collect::<BooleanArray>()),
   +                max: Arc::new(max.into_iter().collect::<BooleanArray>()),
   +            }
   +        }
   +
            fn min(&self) -> Option<ArrayRef> {
                Some(self.min.clone())
            }
   @@ -927,8 +937,8 @@ mod tests {
        #[test]
        fn prune_api() {
            let schema = Arc::new(Schema::new(vec![
   -            Field::new("s1", DataType::Utf8, false),
   -            Field::new("s2", DataType::Int32, false),
   +            Field::new("s1", DataType::Utf8, true),
   +            Field::new("s2", DataType::Int32, true),
            ]));
    
            // Prune using s2 > 5
   @@ -953,4 +963,35 @@ mod tests {
    
            assert_eq!(result, expected);
        }
   +
   +
   +    #[test]
   +    fn prune_api_bool() {
   +        let schema = Arc::new(Schema::new(vec![
   +            Field::new("b1", DataType::Boolean, true),
   +        ]));
   +
   +        let statistics = TestStatistics::new().with(
   +            "b1",
   +            ContainerStats::new_bool(
   +                vec![Some(false), Some(false), Some(true), None, Some(false)], // min
   +                vec![Some(false), Some(true),  Some(true), None, None ], // max
   +            ),
   +        );
   +
   +        // For predicate "b1" (boolean expr)
   +        // b1 [false, false] ==> no rows should pass
   +        // b1 [false, true] ==> some rows could pass
   +        // b1 [true, true] ==> some rows could pass
   +        // b1 [NULL, NULL]  ==> no rows could pass
   +        // b1 [false, NULL]  ==> no rows could pass
   +        let expr = col("b1");
   +        let expected = vec![false, true, true, false, false];
   +
   +        let p = PruningPredicate::try_new(&expr, schema).unwrap();
   +        let result = p.prune(&statistics).unwrap();
   +
   +        assert_eq!(result, expected);
   +    }
   +
    }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #490: Support pruning for `boolean` columns

Posted by GitBox <gi...@apache.org>.
alamb closed issue #490:
URL: https://github.com/apache/arrow-datafusion/issues/490


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org