You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "miland-db (via GitHub)" <gi...@apache.org> on 2024/03/20 16:22:28 UTC

[PR] [SPARK-47358][SQL][COLLATION] Improve repeat expression support to return correct datatype [spark]

miland-db opened a new pull request, #45615:
URL: https://github.com/apache/spark/pull/45615

   ### What changes were proposed in this pull request?
   Enable collation support for the StringRepeat built-in string function in Spark.
   
   ### Why are the changes needed?
   So we can propagate input string datatype as a result of repeat expression.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, users will now get different type when using REPEAT on collated strings in Spark SQL queries.
   
   ### How was this patch tested?
   Unit test in `CollationStringExpressionsSuite.scala`
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47358][SQL][COLLATION] Improve repeat expression support to return correct datatype [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.
miland-db commented on code in PR #45615:
URL: https://github.com/apache/spark/pull/45615#discussion_r1535317953


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -70,6 +71,34 @@ class CollationStringExpressionsSuite extends QueryTest with SharedSparkSession
     })
   }
 
+  case class RepeatCollationTestCase[R](s: String, n: Int, collation: String, expectedResult: R)
+
+  test("Support Repeat string expression with Collation") {
+    // Supported collations
+    val checksRepeat = Seq(
+      RepeatCollationTestCase("Spark", 2, "UTF8_BINARY", "SparkSpark"),
+      RepeatCollationTestCase("Spark", 2, "UTF8_BINARY_LCASE", "SparkSpark"),
+      RepeatCollationTestCase("Spark", 2, "UNICODE", "SparkSpark"),
+      RepeatCollationTestCase("Spark", 2, "UNICODE_CI", "SparkSpark")
+    )
+    checksRepeat.foreach(ct => {
+      checkAnswer(sql(s"SELECT repeat(collate('${ct.s}', '${ct.collation}'), ${ct.n})"),
+        Row(ct.expectedResult))
+    })
+
+    // Check return type
+    val checksReturnType = Seq(
+      RepeatCollationTestCase("Spark", 2, "UTF8_BINARY", "UTF8_BINARY"),
+      RepeatCollationTestCase("Spark", 2, "UTF8_BINARY_LCASE", "UTF8_BINARY_LCASE"),
+      RepeatCollationTestCase("Spark", 2, "UNICODE", "UNICODE"),
+      RepeatCollationTestCase("Spark", 2, "UNICODE_CI", "UNICODE_CI")
+    )
+    checksReturnType.foreach(ct => {
+      checkAnswer(sql(s"SELECT collation(repeat(collate('${ct.s}', '${ct.collation}'), ${ct.n}))"),

Review Comment:
   I have changed the tests now. Please check it again



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47358][SQL][COLLATION] Improve repeat expression support to return correct datatype [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk closed pull request #45615: [SPARK-47358][SQL][COLLATION] Improve repeat expression support to return correct datatype
URL: https://github.com/apache/spark/pull/45615


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47358][SQL][COLLATION] Improve repeat expression support to return correct datatype [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45615:
URL: https://github.com/apache/spark/pull/45615#discussion_r1537389338


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -70,6 +73,19 @@ class CollationStringExpressionsSuite extends QueryTest with SharedSparkSession
     })
   }
 
+  test("REPEAT check output type on explicitly collated string") {
+    def testRepeat(expected: String, stringType: Integer, input: String, n: Integer): Unit = {

Review Comment:
   ```suggestion
       def testRepeat(expected: String, collationId: Int, input: String, n: Int): Unit = {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47358][SQL][COLLATION] Improve repeat expression support to return correct datatype [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.
miland-db commented on PR #45615:
URL: https://github.com/apache/spark/pull/45615#issuecomment-2012987740

   @MaxGekk @cloud-fan could you please check this and merge it if everything is ok?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47358][SQL][COLLATION] Improve repeat expression support to return correct datatype [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.
miland-db commented on code in PR #45615:
URL: https://github.com/apache/spark/pull/45615#discussion_r1533420973


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -71,8 +71,7 @@ import org.apache.spark.util.ArrayImplicits._
   since = "1.5.0",
   group = "string_funcs")
 // scalastyle:on line.size.limit
-case class ConcatWs(children: Seq[Expression])
-  extends Expression with ImplicitCastInputTypes {
+case class ConcatWs(children: Seq[Expression]) extends Expression with ImplicitCastInputTypes {

Review Comment:
   I can do that, but all scala style tests are failing then.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47358][SQL][COLLATION] Improve repeat expression support to return correct datatype [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #45615:
URL: https://github.com/apache/spark/pull/45615#discussion_r1532420069


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -70,6 +70,34 @@ class CollationStringExpressionsSuite extends QueryTest with SharedSparkSession
     })
   }
 
+  case class RepeatCollationTestCase[R](s: String, n: Int, collation: String, expectedResult: R)
+
+  test("Support Repeat string expression with Collation") {
+    // Supported collations
+    val checksRepeat = Seq(
+      RepeatCollationTestCase("Spark", 2, "UTF8_BINARY", "SparkSpark"),

Review Comment:
   i see that you're just doing the same thing as the previous test but I'm not a fan of creating a class for each test
   
   something i would recommend would be:
   ```
     checksRepeat.foreach { case (s, n, collation, expectedResult) =>
       ...
     }
   ```
   
   but this is not a big deal, so it's up to you whichever you prefer



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47358][SQL][COLLATION] Improve repeat expression support to return correct datatype [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.
miland-db commented on code in PR #45615:
URL: https://github.com/apache/spark/pull/45615#discussion_r1537382204


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -70,6 +73,19 @@ class CollationStringExpressionsSuite extends QueryTest with SharedSparkSession
     })
   }
 
+  test("REPEAT check output type on explicitly collated string") {
+    def testRepeat(expected: String, stringType: Integer, input: String, n: Integer): Unit = {
+      val s = Literal.create(input, StringType(stringType))
+
+      checkEvaluation(Collation(StringRepeat(s, Literal.create(n))).replacement, expected)

Review Comment:
   Yes, I test if the output type matches the expected result. You can see below that I test if the result has type for example `expected="UNICODE_CI"`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47358][SQL][COLLATION] Improve repeat expression support to return correct datatype [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45615:
URL: https://github.com/apache/spark/pull/45615#discussion_r1534315850


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -70,6 +71,34 @@ class CollationStringExpressionsSuite extends QueryTest with SharedSparkSession
     })
   }
 
+  case class RepeatCollationTestCase[R](s: String, n: Int, collation: String, expectedResult: R)
+
+  test("Support Repeat string expression with Collation") {
+    // Supported collations
+    val checksRepeat = Seq(
+      RepeatCollationTestCase("Spark", 2, "UTF8_BINARY", "SparkSpark"),
+      RepeatCollationTestCase("Spark", 2, "UTF8_BINARY_LCASE", "SparkSpark"),
+      RepeatCollationTestCase("Spark", 2, "UNICODE", "SparkSpark"),
+      RepeatCollationTestCase("Spark", 2, "UNICODE_CI", "SparkSpark")
+    )
+    checksRepeat.foreach(ct => {
+      checkAnswer(sql(s"SELECT repeat(collate('${ct.s}', '${ct.collation}'), ${ct.n})"),

Review Comment:
   The test suites like "*ExpressionSuite" should test expression "directly", see `StringExpressionsSuite` using `checkEvaluation` (codegen, non-codegen, and so on).



##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -70,6 +71,34 @@ class CollationStringExpressionsSuite extends QueryTest with SharedSparkSession
     })
   }
 
+  case class RepeatCollationTestCase[R](s: String, n: Int, collation: String, expectedResult: R)
+
+  test("Support Repeat string expression with Collation") {
+    // Supported collations
+    val checksRepeat = Seq(
+      RepeatCollationTestCase("Spark", 2, "UTF8_BINARY", "SparkSpark"),
+      RepeatCollationTestCase("Spark", 2, "UTF8_BINARY_LCASE", "SparkSpark"),
+      RepeatCollationTestCase("Spark", 2, "UNICODE", "SparkSpark"),
+      RepeatCollationTestCase("Spark", 2, "UNICODE_CI", "SparkSpark")
+    )
+    checksRepeat.foreach(ct => {
+      checkAnswer(sql(s"SELECT repeat(collate('${ct.s}', '${ct.collation}'), ${ct.n})"),
+        Row(ct.expectedResult))
+    })
+
+    // Check return type
+    val checksReturnType = Seq(
+      RepeatCollationTestCase("Spark", 2, "UTF8_BINARY", "UTF8_BINARY"),
+      RepeatCollationTestCase("Spark", 2, "UTF8_BINARY_LCASE", "UTF8_BINARY_LCASE"),
+      RepeatCollationTestCase("Spark", 2, "UNICODE", "UNICODE"),
+      RepeatCollationTestCase("Spark", 2, "UNICODE_CI", "UNICODE_CI")
+    )
+    checksReturnType.foreach(ct => {
+      checkAnswer(sql(s"SELECT collation(repeat(collate('${ct.s}', '${ct.collation}'), ${ct.n}))"),

Review Comment:
   Similar to @cloud-fan comment at https://github.com/apache/spark/pull/45611#discussion_r1533994036. Let's write unit tests, and don't test entire stack by running end-to-end tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47358][SQL][COLLATION] Improve repeat expression support to return correct datatype [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #45615:
URL: https://github.com/apache/spark/pull/45615#issuecomment-2019526503

   The GA's failure is not related to the changes, I believe: [Run / Run Spark on Kubernetes Integration test](https://github.com/miland-db/spark/actions/runs/8422755164/job/23065575631#logs)
   
   +1, LGTM. Merging to master.
   Thank you, @miland-db and @cloud-fan @stefankandic for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47358][SQL][COLLATION] Improve repeat expression support to return correct datatype [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45615:
URL: https://github.com/apache/spark/pull/45615#discussion_r1537389338


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -70,6 +73,19 @@ class CollationStringExpressionsSuite extends QueryTest with SharedSparkSession
     })
   }
 
+  test("REPEAT check output type on explicitly collated string") {
+    def testRepeat(expected: String, stringType: Integer, input: String, n: Integer): Unit = {

Review Comment:
   ```suggestion
       def testRepeat(expected: String, collationId: Integer, input: String, n: Integer): Unit = {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47358][SQL][COLLATION] Improve repeat expression support to return correct datatype [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #45615:
URL: https://github.com/apache/spark/pull/45615#discussion_r1533393334


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -71,8 +71,7 @@ import org.apache.spark.util.ArrayImplicits._
   since = "1.5.0",
   group = "string_funcs")
 // scalastyle:on line.size.limit
-case class ConcatWs(children: Seq[Expression])
-  extends Expression with ImplicitCastInputTypes {
+case class ConcatWs(children: Seq[Expression]) extends Expression with ImplicitCastInputTypes {

Review Comment:
   Can we remove those unrelated changes? those changes make it harder to backport, cherry-pick, revert, etc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47358][SQL][COLLATION] Improve repeat expression support to return correct datatype [spark]

Posted by "miland-db (via GitHub)" <gi...@apache.org>.
miland-db commented on code in PR #45615:
URL: https://github.com/apache/spark/pull/45615#discussion_r1537382204


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -70,6 +73,19 @@ class CollationStringExpressionsSuite extends QueryTest with SharedSparkSession
     })
   }
 
+  test("REPEAT check output type on explicitly collated string") {
+    def testRepeat(expected: String, stringType: Integer, input: String, n: Integer): Unit = {
+      val s = Literal.create(input, StringType(stringType))
+
+      checkEvaluation(Collation(StringRepeat(s, Literal.create(n))).replacement, expected)

Review Comment:
   Yes, I test if the output type matches the expected result. You can see below that I test if the result has type for example `expected="UNICODE_CI` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47358][SQL][COLLATION] Improve repeat expression support to return correct datatype [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #45615:
URL: https://github.com/apache/spark/pull/45615#discussion_r1537379721


##########
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##########
@@ -70,6 +73,19 @@ class CollationStringExpressionsSuite extends QueryTest with SharedSparkSession
     })
   }
 
+  test("REPEAT check output type on explicitly collated string") {
+    def testRepeat(expected: String, stringType: Integer, input: String, n: Integer): Unit = {
+      val s = Literal.create(input, StringType(stringType))
+
+      checkEvaluation(Collation(StringRepeat(s, Literal.create(n))).replacement, expected)

Review Comment:
   Does `checkEvaluation` check the output type?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org