You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/14 18:05:37 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request #561: Revert pruning on not equal predicate

alamb opened a new pull request #561:
URL: https://github.com/apache/arrow-datafusion/pull/561


   Closes #560 
   
   # Rationale for this change
   Logic is incorrect
   
   # What changes are included in this PR?
   1. Revert https://github.com/apache/arrow-datafusion/commit/2568323dbd85e05f2bf3e6e484f7cc39983ff26c / #544
   2. Add a test with expected results
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #561: Revert pruning on not equal predicate

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #561:
URL: https://github.com/apache/arrow-datafusion/pull/561#discussion_r651166362



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -1190,6 +1162,34 @@ mod tests {
         assert_eq!(result, expected);
     }
 
+    #[test]
+    fn prune_not_eq_data() {
+        let schema = Arc::new(Schema::new(vec![Field::new("s1", DataType::Utf8, true)]));
+
+        // Prune using s2 != 'M'
+        let expr = col("s1").not_eq(lit("M"));
+
+        let statistics = TestStatistics::new().with(
+            "s1",
+            ContainerStats::new_utf8(
+                vec![Some("A"), Some("A"), Some("N"), None, Some("A")], // min
+                vec![Some("Z"), Some("L"), Some("Z"), None, None],      // max
+            ),
+        );
+
+        // s1 [A, Z] ==> might have values that pass predicate
+        // s1 [A, L] ==> all rows pass the predicate
+        // s1 [N, Z] ==> all rows pass the predicate
+        // No stats for s2 ==> some rows could pass
+        // s2 [3, None] (null max) ==> some rows could pass
+
+        let p = PruningPredicate::try_new(&expr, schema).unwrap();
+        let result = p.prune(&statistics).unwrap();
+        let expected = vec![true, true, true, true, true];

Review comment:
       On master this test fails thusly:
   
   
   ```
   ---- physical_optimizer::pruning::tests::prune_not_eq_data stdout ----
   thread 'physical_optimizer::pruning::tests::prune_not_eq_data' panicked at 'assertion failed: `(left == right)`
     left: `[false, true, true, false, true, true]`,
    right: `[true, true, true, false, true, true]`', datafusion/src/physical_optimizer/pruning.rs:1218:9
   
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jgoday commented on pull request #561: Fix pruning on not equal predicate

Posted by GitBox <gi...@apache.org>.
jgoday commented on pull request #561:
URL: https://github.com/apache/arrow-datafusion/pull/561#issuecomment-860937695


   > fyi @jgoday
   
   @alamb, Sorry for introducing the bug, prune_not_eq_data test is much clearer to me now. Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #561: Fix pruning on not equal predicate

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #561:
URL: https://github.com/apache/arrow-datafusion/pull/561


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #561: Revert pruning on not equal predicate

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #561:
URL: https://github.com/apache/arrow-datafusion/pull/561#issuecomment-860886806


   I actually think I can still support not equals, I just need to make it a bit more restricted
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #561: Fix pruning on not equal predicate

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #561:
URL: https://github.com/apache/arrow-datafusion/pull/561#discussion_r651243037



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -553,12 +553,14 @@ fn build_predicate_expression(
     let corrected_op = expr_builder.correct_operator(op);
     let statistics_expr = match corrected_op {
         Operator::NotEq => {
-            // column != literal => (min, max) = literal => min > literal || literal > max
+            // column != literal => (min, max) = literal =>
+            // !(min != literal && max != literal) ==>
+            // min != literal || literal != max
             let min_column_expr = expr_builder.min_column_expr()?;
             let max_column_expr = expr_builder.max_column_expr()?;
             min_column_expr
-                .gt(expr_builder.scalar_expr().clone())
-                .or(expr_builder.scalar_expr().clone().gt(max_column_expr))
+                .not_eq(expr_builder.scalar_expr().clone())
+                .or(expr_builder.scalar_expr().clone().not_eq(max_column_expr))

Review comment:
       Shouldn't this be `and`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #561: Fix pruning on not equal predicate

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #561:
URL: https://github.com/apache/arrow-datafusion/pull/561#discussion_r651282910



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -553,12 +553,14 @@ fn build_predicate_expression(
     let corrected_op = expr_builder.correct_operator(op);
     let statistics_expr = match corrected_op {
         Operator::NotEq => {
-            // column != literal => (min, max) = literal => min > literal || literal > max
+            // column != literal => (min, max) = literal =>
+            // !(min != literal && max != literal) ==>
+            // min != literal || literal != max
             let min_column_expr = expr_builder.min_column_expr()?;
             let max_column_expr = expr_builder.max_column_expr()?;
             min_column_expr
-                .gt(expr_builder.scalar_expr().clone())
-                .or(expr_builder.scalar_expr().clone().gt(max_column_expr))
+                .not_eq(expr_builder.scalar_expr().clone())
+                .or(expr_builder.scalar_expr().clone().not_eq(max_column_expr))

Review comment:
       The way I think about it = predicate is effectively `!(col == min && col == max)` which when you apply Demorgan's law you get `(col != min) || (col != max)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #561: Revert pruning on not equal predicate

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #561:
URL: https://github.com/apache/arrow-datafusion/pull/561#issuecomment-860896807


   fyi @jgoday 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #561: Revert pruning on not equal predicate

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #561:
URL: https://github.com/apache/arrow-datafusion/pull/561#discussion_r651166601



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -1190,6 +1162,34 @@ mod tests {
         assert_eq!(result, expected);
     }
 
+    #[test]
+    fn prune_not_eq_data() {
+        let schema = Arc::new(Schema::new(vec![Field::new("s1", DataType::Utf8, true)]));
+
+        // Prune using s2 != 'M'
+        let expr = col("s1").not_eq(lit("M"));
+
+        let statistics = TestStatistics::new().with(
+            "s1",
+            ContainerStats::new_utf8(
+                vec![Some("A"), Some("A"), Some("N"), None, Some("A")], // min
+                vec![Some("Z"), Some("L"), Some("Z"), None, None],      // max
+            ),
+        );
+
+        // s1 [A, Z] ==> might have values that pass predicate
+        // s1 [A, L] ==> all rows pass the predicate
+        // s1 [N, Z] ==> all rows pass the predicate
+        // No stats for s2 ==> some rows could pass
+        // s2 [3, None] (null max) ==> some rows could pass
+
+        let p = PruningPredicate::try_new(&expr, schema).unwrap();
+        let result = p.prune(&statistics).unwrap();
+        let expected = vec![true, true, true, true, true];

Review comment:
       (aka it incorrectly prunes out the range of `A -> Z`)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #561: Fix pruning on not equal predicate

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #561:
URL: https://github.com/apache/arrow-datafusion/pull/561#issuecomment-860947331


   > @alamb, Sorry for introducing the bug, prune_not_eq_data test is much clearer to me now. Thank you!
   
   No worries @jgoday  -- both @Dandandan  and I missed it on the review as well! This is tricky stuff


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #561: Revert pruning on not equal predicate

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #561:
URL: https://github.com/apache/arrow-datafusion/pull/561#discussion_r651166362



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -1190,6 +1162,34 @@ mod tests {
         assert_eq!(result, expected);
     }
 
+    #[test]
+    fn prune_not_eq_data() {
+        let schema = Arc::new(Schema::new(vec![Field::new("s1", DataType::Utf8, true)]));
+
+        // Prune using s2 != 'M'
+        let expr = col("s1").not_eq(lit("M"));
+
+        let statistics = TestStatistics::new().with(
+            "s1",
+            ContainerStats::new_utf8(
+                vec![Some("A"), Some("A"), Some("N"), None, Some("A")], // min
+                vec![Some("Z"), Some("L"), Some("Z"), None, None],      // max
+            ),
+        );
+
+        // s1 [A, Z] ==> might have values that pass predicate
+        // s1 [A, L] ==> all rows pass the predicate
+        // s1 [N, Z] ==> all rows pass the predicate
+        // No stats for s2 ==> some rows could pass
+        // s2 [3, None] (null max) ==> some rows could pass
+
+        let p = PruningPredicate::try_new(&expr, schema).unwrap();
+        let result = p.prune(&statistics).unwrap();
+        let expected = vec![true, true, true, true, true];

Review comment:
       On master this test fails thusly:
   
   
   ```
   ---- physical_optimizer::pruning::tests::prune_not_eq_data stdout ----
   thread 'physical_optimizer::pruning::tests::prune_not_eq_data' panicked at 'assertion failed: `(left == right)`
     left: `[false, true, true, true, true]`,
    right: `[true, true, true, true, true]`', datafusion/src/physical_optimizer/pruning.rs:1220:9
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb removed a comment on pull request #561: Revert pruning on not equal predicate

Posted by GitBox <gi...@apache.org>.
alamb removed a comment on pull request #561:
URL: https://github.com/apache/arrow-datafusion/pull/561#issuecomment-860886806


   I actually think I can still support not equals, I just need to make it a bit more restricted
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #561: Fix pruning on not equal predicate

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #561:
URL: https://github.com/apache/arrow-datafusion/pull/561#discussion_r651241366



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -1190,6 +1192,34 @@ mod tests {
         assert_eq!(result, expected);
     }
 
+    #[test]
+    fn prune_not_eq_data() {
+        let schema = Arc::new(Schema::new(vec![Field::new("s1", DataType::Utf8, true)]));
+
+        // Prune using s2 != 'M'
+        let expr = col("s1").not_eq(lit("M"));
+
+        let statistics = TestStatistics::new().with(
+            "s1",
+            ContainerStats::new_utf8(
+                vec![Some("A"), Some("A"), Some("N"), Some("M"), None, Some("A")], // min

Review comment:
       What about M / N ?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #561: Fix pruning on not equal predicate

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #561:
URL: https://github.com/apache/arrow-datafusion/pull/561#discussion_r651243733



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -553,12 +553,14 @@ fn build_predicate_expression(
     let corrected_op = expr_builder.correct_operator(op);
     let statistics_expr = match corrected_op {
         Operator::NotEq => {
-            // column != literal => (min, max) = literal => min > literal || literal > max
+            // column != literal => (min, max) = literal =>
+            // !(min != literal && max != literal) ==>
+            // min != literal || literal != max
             let min_column_expr = expr_builder.min_column_expr()?;
             let max_column_expr = expr_builder.max_column_expr()?;
             min_column_expr
-                .gt(expr_builder.scalar_expr().clone())
-                .or(expr_builder.scalar_expr().clone().gt(max_column_expr))
+                .not_eq(expr_builder.scalar_expr().clone())
+                .or(expr_builder.scalar_expr().clone().not_eq(max_column_expr))

Review comment:
       No - did some extra check - this seems the right thing.

##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -1190,6 +1192,34 @@ mod tests {
         assert_eq!(result, expected);
     }
 
+    #[test]
+    fn prune_not_eq_data() {
+        let schema = Arc::new(Schema::new(vec![Field::new("s1", DataType::Utf8, true)]));
+
+        // Prune using s2 != 'M'
+        let expr = col("s1").not_eq(lit("M"));
+
+        let statistics = TestStatistics::new().with(
+            "s1",
+            ContainerStats::new_utf8(
+                vec![Some("A"), Some("A"), Some("N"), Some("M"), None, Some("A")], // min

Review comment:
       This will also do the right thing 👍 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org