You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jonahgao (via GitHub)" <gi...@apache.org> on 2023/06/29 15:51:02 UTC

[GitHub] [arrow-datafusion] jonahgao opened a new pull request, #6799: fix: incorrect nullability of `InList` expr

jonahgao opened a new pull request, #6799:
URL: https://github.com/apache/arrow-datafusion/pull/6799

   # Which issue does this PR close?
   
   None
   
   # Rationale for this change
   
   Similar to #6786 .
   
   We can safely assume that the `InList` expr is non-nullalbe only if all of its subexpressions are non-nullable.
   
   An example of a nullable expression is:
   ```
   DataFusion CLI v27.0.0
   
   ❯ select 1 in(2, null);
   +----------------------------------------------------------------------+
   | Int64(1) IN (Map { iter: Iter([Literal(Int64(2)), Literal(NULL)]) }) |
   +----------------------------------------------------------------------+
   |                                                                      |
   +----------------------------------------------------------------------+
   ```
   
   # What changes are included in this PR?
   - fix incorrect nullability of `InList` expr
   - cleaner code for insert.rs
   
   
   # Are these changes tested?
   Yes
   
   # Are there any user-facing changes?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonahgao commented on a diff in pull request #6799: fix: incorrect nullability of `InList` expr

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on code in PR #6799:
URL: https://github.com/apache/arrow-datafusion/pull/6799#discussion_r1247318235


##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -173,8 +173,30 @@ impl ExprSchemable for Expr {
             Expr::Alias(expr, _)
             | Expr::Not(expr)
             | Expr::Negative(expr)
-            | Expr::Sort(Sort { expr, .. })
-            | Expr::InList(InList { expr, .. }) => expr.nullable(input_schema),
+            | Expr::Sort(Sort { expr, .. }) => expr.nullable(input_schema),
+
+            Expr::InList(InList { expr, list, .. }) => {
+                // Avoid inspecting too many expressions.
+                const MAX_INSPECT_LIMIT: usize = 6;
+                // Stop if a nullable expression is found or an error occurs.
+                let has_nullable = std::iter::once(expr.as_ref())
+                    .chain(list)
+                    .take(MAX_INSPECT_LIMIT)
+                    .find_map(|e| {
+                        e.nullable(input_schema)
+                            .map(|x| if x { Some(()) } else { None })

Review Comment:
   Thank you @alamb  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener merged pull request #6799: fix: incorrect nullability of `InList` expr

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener merged PR #6799:
URL: https://github.com/apache/arrow-datafusion/pull/6799


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonahgao commented on a diff in pull request #6799: fix: incorrect nullability of `InList` expr

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on code in PR #6799:
URL: https://github.com/apache/arrow-datafusion/pull/6799#discussion_r1247692275


##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -173,8 +173,30 @@ impl ExprSchemable for Expr {
             Expr::Alias(expr, _)
             | Expr::Not(expr)
             | Expr::Negative(expr)
-            | Expr::Sort(Sort { expr, .. })
-            | Expr::InList(InList { expr, .. }) => expr.nullable(input_schema),
+            | Expr::Sort(Sort { expr, .. }) => expr.nullable(input_schema),
+
+            Expr::InList(InList { expr, list, .. }) => {
+                // Avoid inspecting too many expressions.
+                const MAX_INSPECT_LIMIT: usize = 6;

Review Comment:
   > I want to know why use this number? Is it the practice from other systems?
   
   @jackwener No. The `nullable` function may be called multiple times during the optimization phase, 
   So I think adding a limitation would be preferable in order to prevent it from being excessively slow.
   But I'm not quite sure what would be an appropriate number.
   
      
   
   > spark handle it simplified.
   > 
   > ```scala
   > override def nullable: Boolean = children. Exists(_.nullable)
   > ```
   This seems to be a cache style.
   We can implement this by precomputing nullable in the `InList::new()` function.
   But the disadvantages are:
   - We need a new field for the `InList` struct
   - The precomputed nullable may not be used.
   - Calculating nullable requires `input_schema`
   
   @jackwener  Which solution do you prefer?
   
   Update: It seems challenging to accomplish. I need to take a closer look.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #6799: fix: incorrect nullability of `InList` expr

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on code in PR #6799:
URL: https://github.com/apache/arrow-datafusion/pull/6799#discussion_r1247648888


##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -173,8 +173,30 @@ impl ExprSchemable for Expr {
             Expr::Alias(expr, _)
             | Expr::Not(expr)
             | Expr::Negative(expr)
-            | Expr::Sort(Sort { expr, .. })
-            | Expr::InList(InList { expr, .. }) => expr.nullable(input_schema),
+            | Expr::Sort(Sort { expr, .. }) => expr.nullable(input_schema),
+
+            Expr::InList(InList { expr, list, .. }) => {
+                // Avoid inspecting too many expressions.
+                const MAX_INSPECT_LIMIT: usize = 6;

Review Comment:
   I want to know why use this number? Is it the practice from other systems?



##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -173,8 +173,30 @@ impl ExprSchemable for Expr {
             Expr::Alias(expr, _)
             | Expr::Not(expr)
             | Expr::Negative(expr)
-            | Expr::Sort(Sort { expr, .. })
-            | Expr::InList(InList { expr, .. }) => expr.nullable(input_schema),
+            | Expr::Sort(Sort { expr, .. }) => expr.nullable(input_schema),
+
+            Expr::InList(InList { expr, list, .. }) => {
+                // Avoid inspecting too many expressions.
+                const MAX_INSPECT_LIMIT: usize = 6;

Review Comment:
   spark handle it simplified.
   
   ```scala
   override def nullable: Boolean = children. Exists(_.nullable)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6799: fix: incorrect nullability of `InList` expr

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6799:
URL: https://github.com/apache/arrow-datafusion/pull/6799#discussion_r1246980465


##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -173,8 +173,30 @@ impl ExprSchemable for Expr {
             Expr::Alias(expr, _)
             | Expr::Not(expr)
             | Expr::Negative(expr)
-            | Expr::Sort(Sort { expr, .. })
-            | Expr::InList(InList { expr, .. }) => expr.nullable(input_schema),
+            | Expr::Sort(Sort { expr, .. }) => expr.nullable(input_schema),
+
+            Expr::InList(InList { expr, list, .. }) => {
+                // Avoid inspecting too many expressions.
+                const MAX_INSPECT_LIMIT: usize = 6;
+                // Stop if a nullable expression is found or an error occurs.
+                let has_nullable = std::iter::once(expr.as_ref())
+                    .chain(list)
+                    .take(MAX_INSPECT_LIMIT)
+                    .find_map(|e| {
+                        e.nullable(input_schema)
+                            .map(|x| if x { Some(()) } else { None })

Review Comment:
   nit: I think calling this `nullable` would make the code easer to read:
   
   ```suggestion
                               .map(|nullable| if nullable { Some(()) } else { None })
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonahgao commented on a diff in pull request #6799: fix: incorrect nullability of `InList` expr

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on code in PR #6799:
URL: https://github.com/apache/arrow-datafusion/pull/6799#discussion_r1247692275


##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -173,8 +173,30 @@ impl ExprSchemable for Expr {
             Expr::Alias(expr, _)
             | Expr::Not(expr)
             | Expr::Negative(expr)
-            | Expr::Sort(Sort { expr, .. })
-            | Expr::InList(InList { expr, .. }) => expr.nullable(input_schema),
+            | Expr::Sort(Sort { expr, .. }) => expr.nullable(input_schema),
+
+            Expr::InList(InList { expr, list, .. }) => {
+                // Avoid inspecting too many expressions.
+                const MAX_INSPECT_LIMIT: usize = 6;

Review Comment:
   > I want to know why use this number? Is it the practice from other systems?
   
   @jackwener No. The `nullable` function may be called multiple times during the optimization phase, 
   So I think adding a limitation would be preferable in order to prevent it from being excessively slow.
   But I'm not quite sure what would be an appropriate number.
   
      
   
   > spark handle it simplified.
   > 
   > ```scala
   > override def nullable: Boolean = children. Exists(_.nullable)
   > ```
   This seems to be a cache style.
   We can implement this by precomputing nullable in the `InList::new()` function.
   But the disadvantage are:
   - We need a new field for the `InList` struct
   - The precomputed nullable may not be used.
   
   @jackwener  Which solution do you prefer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonahgao commented on a diff in pull request #6799: fix: incorrect nullability of `InList` expr

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on code in PR #6799:
URL: https://github.com/apache/arrow-datafusion/pull/6799#discussion_r1247692275


##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -173,8 +173,30 @@ impl ExprSchemable for Expr {
             Expr::Alias(expr, _)
             | Expr::Not(expr)
             | Expr::Negative(expr)
-            | Expr::Sort(Sort { expr, .. })
-            | Expr::InList(InList { expr, .. }) => expr.nullable(input_schema),
+            | Expr::Sort(Sort { expr, .. }) => expr.nullable(input_schema),
+
+            Expr::InList(InList { expr, list, .. }) => {
+                // Avoid inspecting too many expressions.
+                const MAX_INSPECT_LIMIT: usize = 6;

Review Comment:
   > I want to know why use this number? Is it the practice from other systems?
   
   @jackwener No. The `nullable` function may be called multiple times during the optimization phase, 
   So I think adding a limitation would be preferable in order to prevent it from being excessively slow.
   But I'm not quite sure what would be an appropriate number.
   
      
   
   > spark handle it simplified.
   > 
   > ```scala
   > override def nullable: Boolean = children. Exists(_.nullable)
   > ```
   This seems to be a cache style.
   We can implement this by precomputing nullable in the `InList::new()` function.
   But the disadvantages are:
   - We need a new field for the `InList` struct
   - The precomputed nullable may not be used.
   
   @jackwener  Which solution do you prefer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #6799: fix: incorrect nullability of `InList` expr

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on code in PR #6799:
URL: https://github.com/apache/arrow-datafusion/pull/6799#discussion_r1248611233


##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -173,8 +173,30 @@ impl ExprSchemable for Expr {
             Expr::Alias(expr, _)
             | Expr::Not(expr)
             | Expr::Negative(expr)
-            | Expr::Sort(Sort { expr, .. })
-            | Expr::InList(InList { expr, .. }) => expr.nullable(input_schema),
+            | Expr::Sort(Sort { expr, .. }) => expr.nullable(input_schema),
+
+            Expr::InList(InList { expr, list, .. }) => {
+                // Avoid inspecting too many expressions.
+                const MAX_INSPECT_LIMIT: usize = 6;

Review Comment:
   thanks @jonahgao @alamb 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonahgao commented on a diff in pull request #6799: fix: incorrect nullability of `InList` expr

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on code in PR #6799:
URL: https://github.com/apache/arrow-datafusion/pull/6799#discussion_r1247692275


##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -173,8 +173,30 @@ impl ExprSchemable for Expr {
             Expr::Alias(expr, _)
             | Expr::Not(expr)
             | Expr::Negative(expr)
-            | Expr::Sort(Sort { expr, .. })
-            | Expr::InList(InList { expr, .. }) => expr.nullable(input_schema),
+            | Expr::Sort(Sort { expr, .. }) => expr.nullable(input_schema),
+
+            Expr::InList(InList { expr, list, .. }) => {
+                // Avoid inspecting too many expressions.
+                const MAX_INSPECT_LIMIT: usize = 6;

Review Comment:
   > I want to know why use this number? Is it the practice from other systems?
   
   @jackwener No. The `nullable` function may be called multiple times during the optimization phase, 
   So I think adding a limitation would be preferable in order to prevent it from being excessively slow.
   But I'm not quite sure what would be an appropriate number.
   
      
   
   > spark handle it simplified.
   > 
   > ```scala
   > override def nullable: Boolean = children. Exists(_.nullable)
   > ```
   This seems to be a cache style.
   We can implement this by precomputing nullable in the `InList::new()` function.
   But the disadvantages are:
   - We need a new field for the `InList` struct
   - The precomputed nullable may not be used.
   - Calculating nullable requires `input_schema`
   
   @jackwener  Which solution do you prefer?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6799: fix: incorrect nullability of `InList` expr

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6799:
URL: https://github.com/apache/arrow-datafusion/pull/6799#discussion_r1248216604


##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -173,8 +173,30 @@ impl ExprSchemable for Expr {
             Expr::Alias(expr, _)
             | Expr::Not(expr)
             | Expr::Negative(expr)
-            | Expr::Sort(Sort { expr, .. })
-            | Expr::InList(InList { expr, .. }) => expr.nullable(input_schema),
+            | Expr::Sort(Sort { expr, .. }) => expr.nullable(input_schema),
+
+            Expr::InList(InList { expr, list, .. }) => {
+                // Avoid inspecting too many expressions.
+                const MAX_INSPECT_LIMIT: usize = 6;

Review Comment:
   As @jonahgao  hints at , trying to add some sort of cache / memoization is going to be challenging given Rust and how DataFusion is structured
   
   I think the current PR solution is good:
   1. It is well tested 
   2. It is conservative (it might say an InList is nullable that isn't, but I think that will not generate wrong results, only potentially less optimal plans)
   
   Thus I think we should merge this PR as is and then as a follow on PR we can remove the limit or increase it, etc. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org