You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "liukuijian8040 (via GitHub)" <gi...@apache.org> on 2023/05/13 09:01:31 UTC

[GitHub] [spark] liukuijian8040 opened a new pull request, #41162: [SPARK-43491] In expression should act as same as EqualTo when elements in IN expression have same DataType.

liukuijian8040 opened a new pull request, #41162:
URL: https://github.com/apache/spark/pull/41162

   ### What changes were proposed in this pull request?
   When dataTypes of elements in In expression are the same, it will behaviour as same as BinaryComparison like EqualTo when the switch is open.
   `
   // test data and content: 
   // test.json
   // {"name":"Michael","age":0}
   
   // test SQL
   spark.read().json("examples/src/main/resources/test.json").createOrReplaceTempView("t");
   spark.sql("select * from t where age in ('00')").explain(true);
   
   // Before change:
   == Parsed Logical Plan ==
   'Project [*]
   +- 'Filter 'age IN (00)
      +- 'UnresolvedRelation [t], [], false
   
   == Analyzed Logical Plan ==
   age: bigint, name: string
   Project [age#7L, name#8]
   +- Filter cast(age#7L as string) IN (cast(00 as string))
      +- SubqueryAlias t
   	  +- Relation[age#7L,name#8] json
   
   == Optimized Logical Plan ==
   Filter (isnotnull(age#7L) AND (cast(age#7L as string) = 00))
   +- Relation[age#7L,name#8] json
   
   == Physical Plan ==
   *(1) Filter (isnotnull(age#7L) AND (cast(age#7L as string) = 00))
   +- FileScan json [age#7L,name#8] Batched: false, DataFilters: [isnotnull(age#7L), (cast(age#7L as string) = 00)], Format: JSON, Location: InMemoryFileIndex[file:/D:/code/spark/examples/src/main/resources/test.json], PartitionFilters: [], PushedFilters: [IsNotNull(age)], ReadSchema: struct<age:bigint,name:string>
   
   +---+----+
   |age|name|
   +---+----+
   +---+----+
   
   
   // After change:
   spark.sql("select * from t where age = '00'").explain(true);
   == Parsed Logical Plan ==
   'Project [*]
   +- 'Filter 'age IN (00)
      +- 'UnresolvedRelation [t], [], false
   
   == Analyzed Logical Plan ==
   age: bigint, name: string
   Project [age#7L, name#8]
   +- Filter cast(age#7L as bigint) IN (cast(00 as bigint))
      +- SubqueryAlias t
   	  +- Relation[age#7L,name#8] json
   
   == Optimized Logical Plan ==
   Filter (isnotnull(age#7L) AND (age#7L = 0))
   +- Relation[age#7L,name#8] json
   
   == Physical Plan ==
   *(1) Filter (isnotnull(age#7L) AND (age#7L = 0))
   +- FileScan json [age#7L,name#8] Batched: false, DataFilters: [isnotnull(age#7L), (age#7L = 0)], Format: JSON, Location: InMemoryFileIndex[file:/D:/code/spark/examples/src/main/resources/test.json], PartitionFilters: [], PushedFilters: [IsNotNull(age), EqualTo(age,0)], ReadSchema: struct<age:bigint,name:string>
   
   +---+-------+
   |age|   name|
   +---+-------+
   |  0|Michael|
   +---+-------+
   `
   
   ### Why are the changes needed?
   The query results of Spark SQL and Hive SQL are inconsistent with same sql. Spark SQL calculates 0 in ('00') as false, which act different from = keyword, but Hive calculates true. Hive is compatible with the in keyword in 3.1.0, but SparkSQL does not.
   for example, this two query sql should have same result, how ever, the query result is different:
   `
   scala> spark.sql("select 1 as test where 0 in ('00')").show;
   +----+
   |test|
   +----+
   +----+
   
   
   scala> spark.sql("select 1 as test where 0 = '00'").show;
   +----+                                                                          
   |test|
   +----+
   |   1|
   +----+
   
   `
   
   ### Does this PR introduce _any_ user-facing change?
   We add a switch to support In expression compatible with EqualTo expression with false as default value, to make sure it will not change default behavior of Spark SQL.
   
   ### How was this patch tested?
   By set spark.sql.legacy.inExpressionCompatibleWithEqualTo.enabled=true/false, to check whether the analyzed logical plan will cast expression as expected. For true, it will generate same Cast logical plan as EqualTo, and false will keep the old Cast logical plan solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] liukuijian8040 commented on pull request #41162: [SPARK-43491][SQL] In expression should act as same as EqualTo when elements in IN expression have same DataType.

Posted by "liukuijian8040 (via GitHub)" <gi...@apache.org>.
liukuijian8040 commented on PR #41162:
URL: https://github.com/apache/spark/pull/41162#issuecomment-1622846241

   cc @cloud-fan @srielau please for more other thought of this pr?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] liukuijian8040 commented on pull request #41162: [SPARK-43491][SQL] In expression should act as same as EqualTo when elements in IN expression have same DataType.

Posted by "liukuijian8040 (via GitHub)" <gi...@apache.org>.
liukuijian8040 commented on PR #41162:
URL: https://github.com/apache/spark/pull/41162#issuecomment-1551234335

   gentle ping @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #41162: [SPARK-43491][SQL] In expression should act as same as EqualTo when elements in IN expression have same DataType.

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #41162:
URL: https://github.com/apache/spark/pull/41162#issuecomment-1552504318

   BTW can we also check the behavior in other databases like mysql, postgres, oracle, etc.?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] liukuijian8040 commented on a diff in pull request #41162: [SPARK-43491][SQL] In expression should act as same as EqualTo when elements in IN expression have same DataType.

Posted by "liukuijian8040 (via GitHub)" <gi...@apache.org>.
liukuijian8040 commented on code in PR #41162:
URL: https://github.com/apache/spark/pull/41162#discussion_r1197258991


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##########
@@ -509,16 +509,25 @@ case class In(value: Expression, list: Seq[Expression]) extends Predicate {
     val valueArg = ctx.freshName("valueArg")
     // All the blocks are meant to be inside a do { ... } while (false); loop.
     // The evaluation of variables can be stopped when we find a matching value.
-    val listCode = listGen.map(x =>
-      s"""
-         |${x.code}
-         |if (${x.isNull}) {
-         |  $tmpResult = $HAS_NULL; // ${ev.isNull} = true;
-         |} else if (${ctx.genEqual(value.dataType, valueArg, x.value)}) {
-         |  $tmpResult = $MATCHED; // ${ev.isNull} = false; ${ev.value} = true;
-         |  continue;
-         |}
-       """.stripMargin)
+    val listCode = listGen.map { x =>
+      val codeIf =
+        s"""
+           |${x.code}
+           |if (${x.isNull}) {
+           |  $tmpResult = $HAS_NULL; // ${ev.isNull} = true;
+           |}
+         """.stripMargin
+      val codeElseIf =
+        if (!java.lang.Boolean.parseBoolean(x.isNull.toString)) {

Review Comment:
   `x.null` is a `ExprCode` type. and toString returns `code` string. So, there are three cases:
   a. code is a variable name of the Boolean type, for example, if (varArg_0)
   b. code is false, if (false)
   c. code is true, if (true)
   The `else if` branch does not need to be generated when `isNull` is set to true.
    
   ![image](https://github.com/apache/spark/assets/132866841/9e133749-30f7-4644-8d56-ede17bea7ff4)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #41162: [SPARK-43491][SQL] In expression should act as same as EqualTo when elements in IN expression have same DataType.

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #41162:
URL: https://github.com/apache/spark/pull/41162#issuecomment-1552411237

   I think this is indeed an issue, but it seems a bit weird to special-case the 1-element-in-list case. Thoughts? @gengliangwang @srielau 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] liukuijian8040 commented on pull request #41162: [SPARK-43491][SQL] In expression should act as same as EqualTo when elements in IN expression have same DataType.

Posted by "liukuijian8040 (via GitHub)" <gi...@apache.org>.
liukuijian8040 commented on PR #41162:
URL: https://github.com/apache/spark/pull/41162#issuecomment-1549567123

   @cloud-fan @wzhfy , please help review this pr, thanks. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43491][SQL] In expression should act as same as EqualTo when elements in IN expression have same DataType. [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #41162: [SPARK-43491][SQL] In expression should act as same as EqualTo when elements in IN expression have same DataType.
URL: https://github.com/apache/spark/pull/41162


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wzhfy commented on a diff in pull request #41162: [SPARK-43491][SQL] In expression should act as same as EqualTo when elements in IN expression have same DataType.

Posted by "wzhfy (via GitHub)" <gi...@apache.org>.
wzhfy commented on code in PR #41162:
URL: https://github.com/apache/spark/pull/41162#discussion_r1197253055


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##########
@@ -509,16 +509,25 @@ case class In(value: Expression, list: Seq[Expression]) extends Predicate {
     val valueArg = ctx.freshName("valueArg")
     // All the blocks are meant to be inside a do { ... } while (false); loop.
     // The evaluation of variables can be stopped when we find a matching value.
-    val listCode = listGen.map(x =>
-      s"""
-         |${x.code}
-         |if (${x.isNull}) {
-         |  $tmpResult = $HAS_NULL; // ${ev.isNull} = true;
-         |} else if (${ctx.genEqual(value.dataType, valueArg, x.value)}) {
-         |  $tmpResult = $MATCHED; // ${ev.isNull} = false; ${ev.value} = true;
-         |  continue;
-         |}
-       """.stripMargin)
+    val listCode = listGen.map { x =>
+      val codeIf =
+        s"""
+           |${x.code}
+           |if (${x.isNull}) {
+           |  $tmpResult = $HAS_NULL; // ${ev.isNull} = true;
+           |}
+         """.stripMargin
+      val codeElseIf =
+        if (!java.lang.Boolean.parseBoolean(x.isNull.toString)) {

Review Comment:
   better to add a comment here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wzhfy commented on pull request #41162: [SPARK-43491][SQL] In expression should act as same as EqualTo when elements in IN expression have same DataType.

Posted by "wzhfy (via GitHub)" <gi...@apache.org>.
wzhfy commented on PR #41162:
URL: https://github.com/apache/spark/pull/41162#issuecomment-1552291841

   I also think that the different results between 0 in ('00') and 0 = '00' are confusing, and seems hive already fixed this problem.
   Could you also take a look? @cloud-fan @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] liukuijian8040 commented on pull request #41162: [SPARK-43491][SQL] In expression should act as same as EqualTo when elements in IN expression have same DataType.

Posted by "liukuijian8040 (via GitHub)" <gi...@apache.org>.
liukuijian8040 commented on PR #41162:
URL: https://github.com/apache/spark/pull/41162#issuecomment-1552572954

   quickly check behavior in mysql, and `in ('00')` has same query result with `= '00'`, but mysql does not support . @cloud-fan 
   ![image](https://github.com/apache/spark/assets/132866841/715bf24b-f700-4110-bc10-27c15a27fc0d)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] liukuijian8040 commented on a diff in pull request #41162: [SPARK-43491][SQL] In expression should act as same as EqualTo when elements in IN expression have same DataType.

Posted by "liukuijian8040 (via GitHub)" <gi...@apache.org>.
liukuijian8040 commented on code in PR #41162:
URL: https://github.com/apache/spark/pull/41162#discussion_r1197258991


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##########
@@ -509,16 +509,25 @@ case class In(value: Expression, list: Seq[Expression]) extends Predicate {
     val valueArg = ctx.freshName("valueArg")
     // All the blocks are meant to be inside a do { ... } while (false); loop.
     // The evaluation of variables can be stopped when we find a matching value.
-    val listCode = listGen.map(x =>
-      s"""
-         |${x.code}
-         |if (${x.isNull}) {
-         |  $tmpResult = $HAS_NULL; // ${ev.isNull} = true;
-         |} else if (${ctx.genEqual(value.dataType, valueArg, x.value)}) {
-         |  $tmpResult = $MATCHED; // ${ev.isNull} = false; ${ev.value} = true;
-         |  continue;
-         |}
-       """.stripMargin)
+    val listCode = listGen.map { x =>
+      val codeIf =
+        s"""
+           |${x.code}
+           |if (${x.isNull}) {
+           |  $tmpResult = $HAS_NULL; // ${ev.isNull} = true;
+           |}
+         """.stripMargin
+      val codeElseIf =
+        if (!java.lang.Boolean.parseBoolean(x.isNull.toString)) {

Review Comment:
   `x.null` is a `ExprCode` type. and toString returns code string. So, there are three cases:
   a. code is a variable name of the Boolean type, for example, if (varArg_0)
   b. code is false, if (false)
   c. code is true, if (true)
   The `else if` branch does not need to be generated when isNull is set to true.
    
   ![image](https://github.com/apache/spark/assets/132866841/9e133749-30f7-4644-8d56-ede17bea7ff4)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43491][SQL] In expression should act as same as EqualTo when elements in IN expression have same DataType. [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #41162:
URL: https://github.com/apache/spark/pull/41162#issuecomment-1763550996

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org