You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/18 18:39:34 UTC

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1595: Fix null comparison for Parquet pruning predicate

alamb commented on a change in pull request #1595:
URL: https://github.com/apache/arrow-datafusion/pull/1595#discussion_r786049767



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -421,6 +444,13 @@ impl<'a> PruningExpressionBuilder<'a> {
         &self.scalar_expr
     }
 
+    fn scalar_expr_value(&self) -> Result<&ScalarValue> {

Review comment:
       ```suggestion
       fn scalar_expr_value(&self) -> Option<&ScalarValue> {
   ```
   
   Would save a string creation on error (not that it really matters)

##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -657,13 +696,23 @@ fn build_statistics_expr(expr_builder: &mut PruningExpressionBuilder) -> Result<
                     .or(expr_builder.scalar_expr().clone().not_eq(max_column_expr))
             }
             Operator::Eq => {
-                // column = literal => (min, max) = literal => min <= literal && literal <= max
-                // (column / 2) = 4 => (column_min / 2) <= 4 && 4 <= (column_max / 2)
-                let min_column_expr = expr_builder.min_column_expr()?;
-                let max_column_expr = expr_builder.max_column_expr()?;
-                min_column_expr
-                    .lt_eq(expr_builder.scalar_expr().clone())
-                    .and(expr_builder.scalar_expr().clone().lt_eq(max_column_expr))
+                if expr_builder
+                    .scalar_expr_value()
+                    .map(|s| s.is_null())
+                    .unwrap_or(false)
+                {
+                    // column = null => null_count > 0
+                    let null_count_column_expr = expr_builder.null_count_column_expr()?;
+                    null_count_column_expr.gt(lit::<i64>(0))

Review comment:
       I am curious why we use a `i64` here rather than `u64`?

##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -430,6 +460,15 @@ impl<'a> PruningExpressionBuilder<'a> {
         self.required_columns
             .max_column_expr(&self.column, &self.column_expr, self.field)
     }
+
+    fn null_count_column_expr(&mut self) -> Result<Expr> {
+        let null_count_field = &Field::new(self.field.name(), DataType::Int64, false);
+        self.required_columns.null_count_column_expr(

Review comment:
       👍 

##########
File path: datafusion/src/physical_plan/file_format/parquet.rs
##########
@@ -757,10 +784,8 @@ mod tests {
             .enumerate()
             .map(|(i, g)| row_group_predicate(g, i))
             .collect::<Vec<_>>();
-        // no row group is filtered out because the predicate expression can't be evaluated
-        // when a null array is generated for a statistics column,
-        // because the null values propagate to the end result, making the predicate result undefined
-        assert_eq!(row_group_filter, vec![true, true]);
+        // First row group was filtered out because it contains no null value on "c2".
+        assert_eq!(row_group_filter, vec![false, true]);

Review comment:
       I actually think this could be `vec![false, false]` as the predicate can never be true (`int > 1 AND bool = NULL` is always `NULL`)

##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -75,6 +77,10 @@ pub trait PruningStatistics {
     /// return the number of containers (e.g. row groups) being
     /// pruned with these statistics
     fn num_containers(&self) -> usize;
+
+    /// return the number of null values for the named column.
+    /// Note: the returned array must contain `num_containers()` rows.
+    fn null_counts(&self, column: &Column) -> Option<ArrayRef>;

Review comment:
       ```suggestion
   
       /// return the number of null values for the named column as an 
       /// `Option<Int64Array>`.
       /// 
       /// Note: the returned array must contain `num_containers()` rows.
       fn null_counts(&self, column: &Column) -> Option<ArrayRef>;
   ```
   
   I had to look this up to figure out what type this was required
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org